Skip to main content

🖥️ Terminal-Bench Eval Harness Demo

Interactive CLI demonstration with playback controls

⌨️ Playback Controls

  • Play/Pause: Click the play button or press Space
  • Speed: Click the speed indicator (1x, 2x, 3x)
  • Fullscreen: Click the fullscreen icon
  • Seek: Click anywhere on the progress bar

📋 Commands Demonstrated

  1. agentready eval-harness baseline . --iterations 3 --verbose
    Establishes baseline Terminal-Bench performance (3 runs)
  2. agentready eval-harness test-assessor --assessor-id claude_md_file --iterations 3
    Tests single assessor impact on benchmark scores
  3. agentready eval-harness summarize --verbose
    Aggregates results and ranks assessors by impact
  4. agentready eval-harness dashboard --verbose
    Generates interactive dashboard data files