Terminal-Bench Eval Harness Demos
Multiple ways to explore AgentReady’s empirical validation system
Choose your preferred learning style:
Terminal Demo
Watch a live CLI demonstration with interactive playback controls. See the exact commands and outputs.
Slide Presentation
Conference-ready slides with architecture diagrams and visual workflow explanations.
Complete Walkthrough
In-depth guide with Mermaid diagrams, interactive examples, and full command outputs.
Quick Reference
One-page cheat sheet with all commands, file structure, and statistical criteria.
What is the Eval Harness?
The Terminal-Bench eval harness empirically measures the impact of each AgentReady assessor on agentic development performance through systematic A/B testing.
graph LR
A[Baseline] --> B[Test Assessor]
B --> C[Measure Delta]
C --> D[Statistical Analysis]
D --> E[Rank by Impact]
style A fill:#e1f5ff
style B fill:#fff3cd
style C fill:#d4edda
style D fill:#cce5ff
style E fill:#d1ecf1
Key Features
- Empirical Validation: Measure actual impact on Terminal-Bench scores
- Statistical Rigor: P-values + Cohen’s d for significance testing
- Systematic A/B Testing: Test each assessor independently
- Interactive Dashboard: GitHub Pages visualization with Chart.js
- Comprehensive Reporting: JSON, Markdown, and HTML outputs
Current Demo Results
All demo commands executed on AgentReady repository (2025-12-07):
- Baseline Score: 58.35 ± 0.00
- Assessors Tested: 5 (all Tier 1)
- Significant Improvements: 0 (AgentReady already passes all!)
- Tests Passing: 56/56 ✅
Why +0.00 delta? AgentReady already has CLAUDE.md, README, type annotations, standard layout, and intentionally excludes lock files (library project). Testing on a non-compliant repository would show meaningful improvements!
Quick Start
# Establish baseline
agentready eval-harness baseline . --iterations 3
# Test single assessor
agentready eval-harness test-assessor --assessor-id claude_md_file --iterations 3
# Aggregate results
agentready eval-harness summarize
# Generate dashboard
agentready eval-harness dashboard
Related Pages
- Dashboard - Interactive visualization with Chart.js
- Methodology - Statistical methods explained
- User Guide - Complete AgentReady documentation
Last Updated: 2025-12-07 Version: 2.14.1 Status: Phase 1A-1F Complete ✅