**Do AgentReady recommendations actually improve agentic development performance?**
We needed proof.
The Approach
# A/B Testing at Scale
- **Baseline**: Measure performance before fixes
- **Remediate**: Apply single assessor fixes
- **Re-measure**: Run benchmark again
- **Compare**: Calculate statistical significance
Eval Harness Architecture
```mermaid
graph LR
A[Repository] -->|Run 3x| B[Baseline: 58.35]
B --> C[Apply Fixes]
C -->|Run 3x| D[Post-Fix Score]
D --> E{Compare}
E -->|p-value + Cohen's d| F[Statistical Significance]
style B fill:#e1f5ff
style D fill:#d4edda
style F fill:#fff3cd
```
## Two-Factor Test
**BOTH required for significance:**
1. **P-value < 0.05**
*95% confidence not due to chance*
2. **|Cohen's d| > 0.2**
*Meaningful effect size*