Skip to main content

Terminal-Bench Eval Harness Results

Systematic A/B testing of each AgentReady assessor’s impact on Terminal-Bench performance


πŸ“Š Overview

-

Assessors Tested

-

Significant Improvements

-%

Significance Rate

-

Baseline Score


🎯 Impact by Tier

Tier 1: Essential

Most critical for AI assistance

Tier 2: Critical

Major impact on velocity

Tier 3: Important

Meaningful quality gains

Tier 4: Advanced

Polish and optimization


πŸ† Top Performing Assessors

Rank Assessor Tier Delta Score Effect Size Significant?

πŸ“ˆ Complete Results

Rank Assessor Tier Delta (%) Cohen's d P-value Status Fixes

πŸ“– Methodology

Click to expand methodology details
### A/B Testing Workflow 1. **Establish Baseline**: Run Terminal-Bench 5 times on unmodified repository 2. **For Each Assessor**: - Clone repository to temporary directory - Run single assessor assessment - Apply remediation using AgentReady's `align` command - Run Terminal-Bench 5 times post-remediation - Calculate delta score and statistical significance 3. **Aggregate Results**: Combine all assessor impacts with tier-level statistics ### Statistical Rigor - **Significance Threshold**: p-value < 0.05 AND |Cohen's d| > 0.2 - **T-Test**: Two-sample t-test comparing baseline vs. post-remediation scores - **Effect Size**: Cohen's d measures standardized difference - Small: 0.2 ≀ |d| < 0.5 - Medium: 0.5 ≀ |d| < 0.8 - Large: |d| β‰₯ 0.8 ### Current Status **Phase 1 (MVP)**: Mocked Terminal-Bench integration for workflow validation **Phase 2 (Planned)**: Real Harbor framework integration and leaderboard submission