🖥️ Terminal-Bench Eval Harness Demo
Interactive CLI demonstration with playback controls
⌨️ Playback Controls
- Play/Pause: Click the play button or press Space
- Speed: Click the speed indicator (1x, 2x, 3x)
- Fullscreen: Click the fullscreen icon
- Seek: Click anywhere on the progress bar
📋 Commands Demonstrated
-
agentready eval-harness baseline . --iterations 3 --verbose
Establishes baseline Terminal-Bench performance (3 runs) -
agentready eval-harness test-assessor --assessor-id claude_md_file --iterations 3
Tests single assessor impact on benchmark scores -
agentready eval-harness summarize --verbose
Aggregates results and ranks assessors by impact -
agentready eval-harness dashboard --verbose
Generates interactive dashboard data files