Harness Engineering — building the cage for autonomous agents

featured

The next eighteen months of product surface area is being shaped by what's possible with multi-agent autonomy, not by what's possible in a single prompt. This path teaches the vocabulary, the failure modes, and the operating discipline behind agent loops, eval harnesses, tools, memory, multi-agent orchestration, and production runtime — so you can spec a system that doesn't fall over at hour six.

forPMs, founders, and product leaders who have to scope, ship, or judge autonomous-agent systems and need a working mental model of the harness — the eval suites, agent loops, tools, memory, orchestration, and production runtime — without becoming an ML engineer.

outcomeYou can name the components of an agentic system, sketch a failure-mode tree on a whiteboard, push back on a vendor pitch that conflates capability with reliability, write the eval set that decides whether v1 ships, scope tool and memory surfaces deliberately, decide when a second agent is a multiplier versus a tax, and brief leadership on cost, observability, and recovery before a Sentry page wakes the team.

6 weeks5 stages20 items

This path is for the product person who is being asked to ship an autonomous-agent system and does not want to find out at hour six that the harness was a hope rather than an artifact.

The framing comes from a working definition: harness engineering is the work of building the cage that lets autonomous agents run safely for hours or days. Eval harnesses, agent loops, tool wiring, memory, checkpoints, observability, recovery, cost control. It is the unsexy plumbing under the sexy "AI agents" headline, and it is where the difference between a science-fair demo and a shipped product actually lives.

You do not need to be an ML engineer to do this work. You need vocabulary, judgment, and a mental model. This path supplies all three.

The structure mirrors the layers of an agentic system. Stage 1 is the single-agent loop — see it clearly, draw it from memory, name the five places it can fail. Stage 2 is the eval harness — the only artifact that lets you ship anything more than vibes. Stage 3 is the agent's hands and notebook — tools and memory as deliberate design surfaces, not as accidents. Stage 4 is multi-agent orchestration — when the swarm is a multiplier, when it is a tax, and how to tell the difference before you sign the bill. Stage 5 is production — the part of the system that wakes the on-call, the dashboards that prevent it, and the postmortem ritual that turns every 3am page into a regression test.

Each course in the path follows the same lesson shape, designed to keep cognitive load low and the operating-manual feel high:

The move — one line, the JTBD this lesson teaches.
The picture — a diagram or sketch of the mental model. Some are interactive, some are static, all are version-controlled in this repo.
Why it matters now — two or three sentences on what changed in 2024–2026 that makes this lesson current rather than 2019-era folklore.
A source you should trust — one or two citations, each with a sentence on why the source is credible, not just a URL.
A recipe — a ten- to thirty-line snippet or checklist you can use today.
The smell of it going wrong — three to five bullets on failure modes you should be able to pattern-match before they happen.
A judgment call from real work — one anecdote from PL, Ostronaut, talvinder.com, or sideb.club. What we tried, what broke, what we learned. The credibility layer.

The fifty lessons sit on top of the AI Manual's twelve-chapter spine. The Manual gives you the conceptual base — when AI is the right answer, how to climb the model-selection ladder, how to design for hallucination, how to think about cost and safety. This path takes you from "I can spec an AI feature" to "I can spec a system that runs autonomously, fails gracefully, and proves it works."

Work through the courses in order the first time. After that, treat the path as a reference — when an eval suite is the blocker, return to Course 2; when the on-call is paging, return to Course 5; when somebody on your team proposes a six-agent swarm for a problem one agent could solve, return to Course 4 before signing off.

A note on currency: the field is moving fast enough that any path written in 2026 will need refresh by 2027. The lessons here are written to age well — they cite primary sources, they teach the loop rather than the framework du jour, and the case studies are chosen because the patterns they illustrate will outlast the specific products. When a model name, a SDK, or a vendor name dates, the underlying judgment will not.

The path is opinionated about what it does not teach: it does not teach "how to write a prompt" (covered in the AI Manual), it does not teach "AI ethics as a separate topic" (covered as constraints inside each course), and it does not climax in a "build your own ChatGPT" capstone (the wrong frame — the work is building the harness, not building the model). When you finish the fifth course, you should be able to walk into a vendor demo, a leadership review, or a postmortem and ask the question that turns the room toward signal: show me the harness.

Stage 1 — The agent loop, demystified

An "AI agent" is not magic. It is a loop with a model, a context, a tool-call surface, a memory layer, and a stopping condition — and every production failure traces back to one of those five pieces. Before you can spec a multi-agent system, you have to see the single-agent loop clearly enough to draw it from memory.

Stage 2 — Eval harnesses you can trust

A model can demo anything. An agent can ship only what an eval suite proves it does reliably. If you cannot tell whether v17 of your prompt is better than v16, you do not have a product — you have a vibes-based science fair. The eval harness is the part of the system that turns hope into a metric.

Stage 3 — Tools and memory, designed deliberately

The agent's effective IQ is bounded by the toolbelt you hand it and the notebook you let it keep. Tools are the verbs; memory is the continuity. Most teams under-design both, then blame the model when the loop drifts at hour three. This stage is the product-design discipline applied to the agent's hands and its mind.

Stage 4 — Multi-agent orchestration (and when not to)

The instinct to "add another agent" almost always loses. Cognition's "Don't Build Multi-Agents" and Anthropic's counter-essay frame the live debate; both are right in different regimes. The skill is knowing which regime you are in before the swarm becomes a coordination problem you cannot debug.

Stage 5 — Production harnesses: observability, recovery, the bill

Agents in production fail in ways single-prompt features do not. They burn money in their sleep, they get stuck in tool-call loops, they wake the on-call at 3am because a context window blew up on input nobody anticipated. This stage is the operational layer — the dashboards, the kill switches, the cost meters, the postmortem ritual.