Eval Harnesses — how you know your agent isn't lying to itselffeatured

A model can demo anything. An agent ships only what an eval suite proves it does reliably. Ten lessons on building eval suites for autonomous systems — from golden sets to leaderboard hygiene to the difference between SWE-bench numbers and "would I let this ship on Friday."

Lessons

Foundations

01Why the eval set is the only artifact that ships AI7 min 02Task suites vs. leaderboards — what each one is for7 min

Reading the public benchmarks

03SWE-bench, SWE-bench Verified, and what they actually measure8 min

Building your suite

04Building a golden set from real user data8 min 05Graders — string match, LLM-as-judge, human-in-the-loop8 min

Operating the suite

06Regression testing as a culture, not a one-time event8 min 07Eval drift — when your eval set stops catching regressions7 min

Reference implementations

08The Ostronaut named-vector retrieval harness — a worked example10 min

Operating the suite

09The cost of evals — when not to grade everything8 min