Skip to main content

Terminal-Bench Eval Harness Demos

Multiple ways to explore AgentReady’s empirical validation system

Choose your preferred learning style:


🖥️

Terminal Demo

Watch a live CLI demonstration with interactive playback controls. See the exact commands and outputs.

Watch Demo ~3 min
📊

Slide Presentation

Conference-ready slides with architecture diagrams and visual workflow explanations.

View Slides ~15 slides
📖

Complete Walkthrough

In-depth guide with Mermaid diagrams, interactive examples, and full command outputs.

Read Guide ~10 min read

Quick Reference

One-page cheat sheet with all commands, file structure, and statistical criteria.

Get Reference 1 page

What is the Eval Harness?

The Terminal-Bench eval harness empirically measures the impact of each AgentReady assessor on agentic development performance through systematic A/B testing.

graph LR
    A[Baseline] --> B[Test Assessor]
    B --> C[Measure Delta]
    C --> D[Statistical Analysis]
    D --> E[Rank by Impact]

    style A fill:#e1f5ff
    style B fill:#fff3cd
    style C fill:#d4edda
    style D fill:#cce5ff
    style E fill:#d1ecf1

Key Features


Current Demo Results

All demo commands executed on AgentReady repository (2025-12-07):

Why +0.00 delta? AgentReady already has CLAUDE.md, README, type annotations, standard layout, and intentionally excludes lock files (library project). Testing on a non-compliant repository would show meaningful improvements!


Quick Start

# Establish baseline
agentready eval-harness baseline . --iterations 3

# Test single assessor
agentready eval-harness test-assessor --assessor-id claude_md_file --iterations 3

# Aggregate results
agentready eval-harness summarize

# Generate dashboard
agentready eval-harness dashboard


Last Updated: 2025-12-07 Version: 2.14.1 Status: Phase 1A-1F Complete ✅