Eval Harnesses — how you know your agent isn't lying to itselffeatured
A model can demo anything. An agent ships only what an eval suite proves it does reliably. Ten lessons on building eval suites for autonomous systems — from golden sets to leaderboard hygiene to the difference between SWE-bench numbers and "would I let this ship on Friday."
Lessons
Foundations
01Why the eval set is the only artifact that ships AI7 min02Task suites vs. leaderboards — what each one is for7 minReading the public benchmarks
03SWE-bench, SWE-bench Verified, and what they actually measure8 minBuilding your suite
04Building a golden set from real user data8 min05Graders — string match, LLM-as-judge, human-in-the-loop8 minOperating the suite
06Regression testing as a culture, not a one-time event8 min07Eval drift — when your eval set stops catching regressions7 minReference implementations
08The Ostronaut named-vector retrieval harness — a worked example10 minOperating the suite
09The cost of evals — when not to grade everything8 min