Terminal-Bench Eval Harness Results
Systematic A/B testing of each AgentReady assessorβs impact on Terminal-Bench performance
π Overview
-
Assessors Tested
-
Significant Improvements
-%
Significance Rate
-
Baseline Score
π― Impact by Tier
Tier 1: Essential
Most critical for AI assistance
Tier 2: Critical
Major impact on velocity
Tier 3: Important
Meaningful quality gains
Tier 4: Advanced
Polish and optimization
π Top Performing Assessors
| Rank | Assessor | Tier | Delta Score | Effect Size | Significant? |
|---|
π Complete Results
| Rank | Assessor | Tier | Delta (%) | Cohen's d | P-value | Status | Fixes |
|---|
π Methodology
Click to expand methodology details
### A/B Testing Workflow
1. **Establish Baseline**: Run Terminal-Bench 5 times on unmodified repository
2. **For Each Assessor**:
- Clone repository to temporary directory
- Run single assessor assessment
- Apply remediation using AgentReady's `align` command
- Run Terminal-Bench 5 times post-remediation
- Calculate delta score and statistical significance
3. **Aggregate Results**: Combine all assessor impacts with tier-level statistics
### Statistical Rigor
- **Significance Threshold**: p-value < 0.05 AND |Cohen's d| > 0.2
- **T-Test**: Two-sample t-test comparing baseline vs. post-remediation scores
- **Effect Size**: Cohen's d measures standardized difference
- Small: 0.2 β€ |d| < 0.5
- Medium: 0.5 β€ |d| < 0.8
- Large: |d| β₯ 0.8
### Current Status
**Phase 1 (MVP)**: Mocked Terminal-Bench integration for workflow validation
**Phase 2 (Planned)**: Real Harbor framework integration and leaderboard submission