Terminal-Bench Eval Harness Demos

Multiple ways to explore AgentReady’s empirical validation system

Choose your preferred learning style:

🖥️

Terminal Demo

Watch a live CLI demonstration with interactive playback controls. See the exact commands and outputs.

Watch Demo ~3 min

📊

Slide Presentation

Conference-ready slides with architecture diagrams and visual workflow explanations.

View Slides ~15 slides

📖

Complete Walkthrough

In-depth guide with Mermaid diagrams, interactive examples, and full command outputs.

Read Guide ~10 min read

⚡

Quick Reference

One-page cheat sheet with all commands, file structure, and statistical criteria.

Get Reference 1 page

What is the Eval Harness?

The Terminal-Bench eval harness empirically measures the impact of each AgentReady assessor on agentic development performance through systematic A/B testing.

graph LR
    A[Baseline] --> B[Test Assessor]
    B --> C[Measure Delta]
    C --> D[Statistical Analysis]
    D --> E[Rank by Impact]

    style A fill:#e1f5ff
    style B fill:#fff3cd
    style C fill:#d4edda
    style D fill:#cce5ff
    style E fill:#d1ecf1

Key Features

Empirical Validation: Measure actual impact on Terminal-Bench scores
Statistical Rigor: P-values + Cohen’s d for significance testing
Systematic A/B Testing: Test each assessor independently
Interactive Dashboard: GitHub Pages visualization with Chart.js
Comprehensive Reporting: JSON, Markdown, and HTML outputs

Current Demo Results

All demo commands executed on AgentReady repository (2025-12-07):

Baseline Score: 58.35 ± 0.00
Assessors Tested: 5 (all Tier 1)
Significant Improvements: 0 (AgentReady already passes all!)
Tests Passing: 56/56 ✅

Why +0.00 delta? AgentReady already has CLAUDE.md, README, type annotations, standard layout, and intentionally excludes lock files (library project). Testing on a non-compliant repository would show meaningful improvements!

Quick Start

# Establish baseline
agentready eval-harness baseline . --iterations 3

# Test single assessor
agentready eval-harness test-assessor --assessor-id claude_md_file --iterations 3

# Aggregate results
agentready eval-harness summarize

# Generate dashboard
agentready eval-harness dashboard

Dashboard - Interactive visualization with Chart.js
Methodology - Statistical methods explained
User Guide - Complete AgentReady documentation

Last Updated: 2025-12-07 Version: 2.14.1 Status: Phase 1A-1F Complete ✅