//pragmatic leaders

Harness Engineering — building the cage for autonomous agents

featured

The next eighteen months of product surface area is being shaped by what's possible with multi-agent autonomy, not by what's possible in a single prompt. This path teaches the vocabulary, the failure modes, and the operating discipline behind agent loops, eval harnesses, tools, memory, multi-agent orchestration, and production runtime — so you can spec a system that doesn't fall over at hour six.

forPMs, founders, and product leaders who have to scope, ship, or judge autonomous-agent systems and need a working mental model of the harness — the eval suites, agent loops, tools, memory, orchestration, and production runtime — without becoming an ML engineer.
outcomeYou can name the components of an agentic system, sketch a failure-mode tree on a whiteboard, push back on a vendor pitch that conflates capability with reliability, write the eval set that decides whether v1 ships, scope tool and memory surfaces deliberately, decide when a second agent is a multiplier versus a tax, and brief leadership on cost, observability, and recovery before a Sentry page wakes the team.
6 weeks5 stages20 items

This path is for the product person who is being asked to ship an autonomous-agent system and does not want to find out at hour six that the harness was a hope rather than an artifact.

The framing comes from a working definition: harness engineering is the work of building the cage that lets autonomous agents run safely for hours or days. Eval harnesses, agent loops, tool wiring, memory, checkpoints, observability, recovery, cost control. It is the unsexy plumbing under the sexy "AI agents" headline, and it is where the difference between a science-fair demo and a shipped product actually lives.

You do not need to be an ML engineer to do this work. You need vocabulary, judgment, and a mental model. This path supplies all three.

The structure mirrors the layers of an agentic system. Stage 1 is the single-agent loop — see it clearly, draw it from memory, name the five places it can fail. Stage 2 is the eval harness — the only artifact that lets you ship anything more than vibes. Stage 3 is the agent's hands and notebook — tools and memory as deliberate design surfaces, not as accidents. Stage 4 is multi-agent orchestration — when the swarm is a multiplier, when it is a tax, and how to tell the difference before you sign the bill. Stage 5 is production — the part of the system that wakes the on-call, the dashboards that prevent it, and the postmortem ritual that turns every 3am page into a regression test.

Each course in the path follows the same lesson shape, designed to keep cognitive load low and the operating-manual feel high:

  1. The move — one line, the JTBD this lesson teaches.
  2. The picture — a diagram or sketch of the mental model. Some are interactive, some are static, all are version-controlled in this repo.
  3. Why it matters now — two or three sentences on what changed in 2024–2026 that makes this lesson current rather than 2019-era folklore.
  4. A source you should trust — one or two citations, each with a sentence on why the source is credible, not just a URL.
  5. A recipe — a ten- to thirty-line snippet or checklist you can use today.
  6. The smell of it going wrong — three to five bullets on failure modes you should be able to pattern-match before they happen.
  7. A judgment call from real work — one anecdote from PL, Ostronaut, talvinder.com, or sideb.club. What we tried, what broke, what we learned. The credibility layer.

The fifty lessons sit on top of the AI Manual's twelve-chapter spine. The Manual gives you the conceptual base — when AI is the right answer, how to climb the model-selection ladder, how to design for hallucination, how to think about cost and safety. This path takes you from "I can spec an AI feature" to "I can spec a system that runs autonomously, fails gracefully, and proves it works."

Work through the courses in order the first time. After that, treat the path as a reference — when an eval suite is the blocker, return to Course 2; when the on-call is paging, return to Course 5; when somebody on your team proposes a six-agent swarm for a problem one agent could solve, return to Course 4 before signing off.

A note on currency: the field is moving fast enough that any path written in 2026 will need refresh by 2027. The lessons here are written to age well — they cite primary sources, they teach the loop rather than the framework du jour, and the case studies are chosen because the patterns they illustrate will outlast the specific products. When a model name, a SDK, or a vendor name dates, the underlying judgment will not.

The path is opinionated about what it does not teach: it does not teach "how to write a prompt" (covered in the AI Manual), it does not teach "AI ethics as a separate topic" (covered as constraints inside each course), and it does not climax in a "build your own ChatGPT" capstone (the wrong frame — the work is building the harness, not building the model). When you finish the fifth course, you should be able to walk into a vendor demo, a leadership review, or a postmortem and ask the question that turns the room toward signal: show me the harness.

01

Stage 1 — The agent loop, demystified

An "AI agent" is not magic. It is a loop with a model, a context, a tool-call surface, a memory layer, and a stopping condition — and every production failure traces back to one of those five pieces. Before you can spec a multi-agent system, you have to see the single-agent loop clearly enough to draw it from memory.

  1. 1The Agent Loop — what's actually running when an agent "works for hours"courseCourse 1 of the path. Eight lessons on what's actually running when an agent "works for hours" — the loop, the context window, the tool-call surface, the memory layer, the stopping condition, and the failure modes that show up between hour one and hour six.
  2. 2Tool Use, Function Calling, Agents — The Maturity Laddermanual12 min readChapter 6 of the AI Manual. The maturity ladder from single prompt to function calling to multi-step agents. Read it as the prerequisite frame — each rung adds value and adds blast radius. This path lives at the top of the ladder.
  3. 3Cursor — The AI Code Editor That Competed with GitHubcase6 min readThe clearest public anatomy of an agent loop a product audience can follow without reading code. Indexing, relevance ranking, prompt curation, model routing, and the accept/delete/modify signal as implicit online eval — every component is a product decision, not an engineering detail.
  4. 4Anthropic — Research Lab to Product Companycase7 min readBuilding a consumer product around your own foundation model. Tool use, fallback, and trust instrumented at the agent layer — a worked example of where the loop, the harness, and the product all meet.
02

Stage 2 — Eval harnesses you can trust

A model can demo anything. An agent can ship only what an eval suite proves it does reliably. If you cannot tell whether v17 of your prompt is better than v16, you do not have a product — you have a vibes-based science fair. The eval harness is the part of the system that turns hope into a metric.

  1. 1Eval Harnesses — how you know your agent isn't lying to itselfcourseCourse 2 of the path. Ten lessons on building eval suites for autonomous systems — task suites, graders, golden sets, regressions, leaderboard hygiene, and the difference between SWE-bench numbers and "would I let this ship on Friday."
  2. 2Eval Before Launchmanual15 min readChapter 4 of the AI Manual. The small, sharp eval set you build before any AI feature touches a real user — and run as a regression suite on every prompt and index change after. Course 2 is the agent-scale extension of this chapter.
  3. 3Klarna — When AI Customer Support Was Half-Rightcase17 min readThe cleanest cautionary anatomy of a weak eval harness — measuring deflection rate and resolution time while treating true resolution quality as a soft claim. The gap between the Feb 2024 launch announcement and the 2025 walk-back is what an incomplete eval suite costs in public.
  4. 4Harvey — Vertical AI for a High-Stakes Professioncase8 min readVertical AI in a high-accountability domain. Eval design, auditability, and how a narrow fine-tuned model became the moat over a general API. The reason buyers trust this product is the harness.
03

Stage 3 — Tools and memory, designed deliberately

The agent's effective IQ is bounded by the toolbelt you hand it and the notebook you let it keep. Tools are the verbs; memory is the continuity. Most teams under-design both, then blame the model when the loop drifts at hour three. This stage is the product-design discipline applied to the agent's hands and its mind.

  1. 1Tool & Memory Design — when the agent's effective IQ depends on its toolbeltcourseCourse 3 of the path. Ten lessons on tool surface design, MCP, memory architectures (episodic, semantic, procedural), retrieval as a memory substrate, and the failure modes when an agent forgets, repeats itself, or hallucinates a tool that does not exist.
  2. 2RAG, Fine-Tune, or Context Window?manual16 min readChapter 7 of the AI Manual. The decision tree for giving a model your data. Course 3 extends this from "the model" to "the agent that comes back to the same data tomorrow."
  3. 3Anthropic — Research Lab to Product Companycase7 min readThe most detailed public account of principled tool and memory design — explicit permission prompts before side-effect actions, visible tool-call transparency, graceful degradation when retrieval fails, and the Artifacts separation of conversation from output. Tool design as a trust problem made product-shaped.
04

Stage 4 — Multi-agent orchestration (and when not to)

The instinct to "add another agent" almost always loses. Cognition's "Don't Build Multi-Agents" and Anthropic's counter-essay frame the live debate; both are right in different regimes. The skill is knowing which regime you are in before the swarm becomes a coordination problem you cannot debug.

  1. 1Multi-Agent Orchestration — fan-out, swarms, and the cost of doing both badlycourseCourse 4 of the path. Ten lessons on fan-out patterns, swarms, handoffs, shared state, quota pooling, and the failure modes when two agents disagree, deadlock, or quietly do each other's work. The lived-lesson chapter is the PL parallel-triage 429 incident.
  2. 2Building With AI vs. Building AI Productsmanual12 min readChapter 11 of the AI Manual. The strategy lens — when adding agents is a productivity choice and when it is a business choice. Course 4 is the operational layer that follows.
  3. 3Replit Agent — What Multi-Step Orchestration Looks Like When It Shipscase11 min readFan-out in a live coding environment — orchestrator/worker patterns, parallel-vs-sequenced sub-tasks, state passing without context blowup, and what to do when a sub-agent fails mid-sequence. The canonical multi-agent problems made concrete in a shipping product.
  4. 4GitHub Copilot — The First Real AI Product, and What Five Years Taught Uscase14 min readThe bracket against the "we need ten agents" instinct. A single-agent product that resisted the urge to fan out and chose adoption durability over swarm complexity.optional
05

Stage 5 — Production harnesses: observability, recovery, the bill

Agents in production fail in ways single-prompt features do not. They burn money in their sleep, they get stuck in tool-call loops, they wake the on-call at 3am because a context window blew up on input nobody anticipated. This stage is the operational layer — the dashboards, the kill switches, the cost meters, the postmortem ritual.

  1. 1Production Harnesses — observability, recovery, the billcourseCourse 5 of the path. Twelve lessons on tracing, structured eval logging, cost meters, kill switches, recovery, replay, the Sentry-plus-Checkly stack that PL uses, and the postmortem ritual that turns every 3am page into a regression test.
  2. 2Cost & Latency as First-Class Product Constraintsmanual13 min readChapter 9 of the AI Manual. Cost-per-inference and p95 latency as dashboard numbers, not afterthoughts. Course 5 extends this from "the model" to "the agent that runs for six hours and may or may not finish."
  3. 3Safety, Privacy, Compliance for Shipping Teamsmanual15 min readChapter 10 of the AI Manual. The controls that keep a shipping team out of the news — PII scrubbing, audit logs, human approval gates. Read it alongside Course 5's lessons on kill switches and recovery.
  4. 4Air Canada — The Chatbot That Made a Promise the Airline Had to Keepcase15 min readThe single best public case for what a production harness failure looks like from the outside. The tribunal's ruling — "a chatbot's words are the company's words" — is the production-stage constraint every harness engineer needs to design against. Pair it with Course 5's lessons on kill switches and human-fallback escalation.
  5. 5DeepSeek — The Week the AI Capex Story Brokecase16 min readInference economics at frontier scale. Cost as a strategic lever, not just a P&L line. Pair it with Course 5's cost-meter lessons.optional