Harness Engineering: Single-Agent and Multi-Agent Systems Explained

When I built my first "AI agent," I made the same mistake almost everyone makes. I assumed the magic lived in the model. Pick a smart enough model, write a clever enough prompt, and the agent would more or less build itself.

It didn't. My agent would do three useful things, then confidently announce it was finished while the actual task was nowhere near done. Other times it would loop forever, calling the same tool with slightly different arguments like a person rattling a locked door. The model wasn't the problem. The model was fine. What I was missing was everything around the model — the code that decides when it runs, what it can see, which tools it can reach for, and how it knows when to stop.

That "everything around the model" has a name now: the harness. And learning to design it well is its own discipline, which people have started calling harness engineering. This article is the explainer I wish I'd had on day one. We'll start from the absolute basics, build a working single-agent system in code, and then walk up to multi-agent systems — when they help, when they hurt, and what the people building this stuff for a living actually recommend.

No prior agent experience assumed. Just bring a little Python.

First, the vocabulary

The agent world is drowning in jargon, and a lot of it is used loosely. Here are the handful of words you genuinely need, defined plainly.

LLM (Large Language Model). The text-prediction engine — Claude, GPT, Gemini, and so on. On its own, an LLM is a function: text goes in, text comes out. It has no memory between calls, can't click anything, and can't run code. It's a very smart brain in a jar.

Tool. A function you let the model call. A weather lookup, a database query, a file write, a web search. Tools are how the brain in the jar grows hands. You describe the tool to the model (its name, what it does, what arguments it takes), and when the model wants to use it, it emits a structured request instead of plain text.

The loop. This is the single most important idea in the whole article, so I'll repeat it later. An agent is not one model call. It's a *loop*: the model thinks, optionally calls a tool, sees the result, thinks again, and repeats until the job is done. Anthropic's working definition of an agent is exactly this — an LLM "autonomously using tools in a loop." That phrasing is worth tattooing somewhere.

Context window. The model's short-term memory: the total amount of text (your instructions, the conversation, tool results) it can consider at once. It's finite. Fill it up and the model starts forgetting the beginning. Managing what goes into it is half the battle.

Context engineering. The practice of curating what's in the context window at each step — putting the right information in front of the model at the right time, and keeping the junk out. LangChain's Harrison Chase has called this "the #1 job of engineers building AI agents," and it's a more dynamic cousin of old-school prompt engineering.

Workflow vs. agent. Anthropic draws a sharp line here that's worth internalizing early. In a workflow, the LLM and tools are orchestrated through *predefined code paths — you, the developer, hard-code the steps. In an agent, the LLM dynamically directs its own process*, deciding for itself which tools to use and in what order. Workflows are predictable; agents are flexible. Most real systems are a blend.

The harness. The scaffolding around the model that turns a raw LLM into a working agent: the loop, the tool definitions, the memory and context management, the error handling, the guardrails, the stopping conditions. The model is the engine; the harness is the rest of the car.

Harness engineering. The discipline of designing that scaffolding well. As Anthropic puts it, every component of a harness "encodes an assumption about what the model can't do on its own" — which is a quietly profound idea we'll come back to.

Hold onto those last three. They're the spine of everything below.

What a harness actually is (an analogy)

Imagine you've hired a brilliant, fast, slightly overconfident new engineer. They can write code, reason through problems, and learn quickly. But they have a peculiar condition: every few minutes their memory resets completely. They forget what they were doing, what they've already tried, and what the goal was.

You could fire them. Or you could build them a system: a clear task list pinned to the wall, a notebook where they write down what they finished, a rule that they tackle one item at a time, and a colleague who checks their work. With that scaffolding, the same forgetful-but-brilliant person becomes wildly productive.

That scaffolding is the harness. The LLM is the brilliant amnesiac. Harness engineering is the art of building the task list, the notebook, the rules, and the colleague.

Figure 2 — Anatomy of a harness: the model sits at the core, surrounded by the components a harness provides — the loop, tool definitions, context and memory management, stopping conditions, guardrails, and error handling. The model is the engine; the harness is the rest of the car.

This analogy isn't even a stretch — it's roughly how Anthropic describes building harnesses for long-running coding agents. They drew inspiration, in their words, from "knowing what effective software engineers do every day": break the work into tractable chunks, write progress to a file (they use a claude-progress.txt), and lean on git history so a fresh context window can quickly understand the state of the work. The model forgets; the harness remembers for it.

The simplest possible harness: a single agent

Let's build one. The cleanest way to understand an agent is to write the loop yourself, by hand, before reaching for any framework.

Here's a complete single-agent system using Anthropic's API. It has one tool. Read it once for shape, then we'll go line by line.

import anthropic

client = anthropic.Anthropic()  # reads your API key from the environment

# 1. DESCRIBE the tools the model is allowed to use.
tools = [
    {
        "name": "get_weather",
        "description": "Get the current weather for a given city.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name, e.g. 'Lisbon'"}
            },
            "required": ["city"],
        },
    }
]

# 2. IMPLEMENT the actual code behind each tool name.
def get_weather(city: str) -> str:
    # In real life this would hit a weather API.
    return f"It's 21°C and sunny in {city}."

TOOL_IMPLEMENTATIONS = {"get_weather": get_weather}


def run_agent(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]

    while True:  # <-- this loop IS the agent
        response = client.messages.create(
            model="claude-sonnet-4-6",  # use whatever the current model is when you read this
            max_tokens=1024,
            tools=tools,
            messages=messages,
        )

        # If the model didn't ask for a tool, it's done thinking. Return the answer.
        if response.stop_reason != "tool_use":
            return "".join(b.text for b in response.content if b.type == "text")

        # Otherwise: run every tool the model requested, then feed results back in.
        messages.append({"role": "assistant", "content": response.content})

        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                fn = TOOL_IMPLEMENTATIONS[block.name]
                result = fn(**block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result,
                })
        messages.append({"role": "user", "content": tool_results})


print(run_agent("What should I wear in Lisbon today?"))

Figure 1 — The single-agent loop: an LLM thinks, decides whether to call a tool, your code runs the tool and appends the result to the message list, and the cycle repeats until the model stops asking for tools and returns a final answer.

Now the walkthrough, because the whole concept of an agent is hiding in about fifteen lines.

The tools list is just a description. We're not giving the model our get_weather function — we're giving it a *menu. The model reads the name, the description, and the input schema, and decides on its own whether to order off that menu. This separation matters: the model proposes, your code disposes. The model can ask* to call get_weather, but it can never actually run anything. Your harness runs it. That gap is where all your control lives.

The `while True` loop is the agent. Seriously, that's it. Strip away the model and the tools and you're left with a loop that says: ask the model what to do; if it wants a tool, run the tool and tell it the result; repeat until it stops asking for tools. People build entire startups on top of this loop. The loop is the whole trick.

The stop condition (stop_reason != "tool_use") is how the agent knows it's finished. When the model stops requesting tools and just produces text, we hand that text back. In toy examples this is trivial. In real systems, *knowing when to stop* is one of the hardest parts of harness engineering — agents love to either quit early or never quit at all.

Feeding results back in is the part beginners forget. After a tool runs, you append the result to messages so the next loop iteration includes it. This is the agent's working memory. Forget this step and the model is Dory from *Finding Nemo* — asking the same question on every loop, never learning the answer.

That's a complete agent. One model, one loop, one growing list of messages. This is what people mean by a single-agent system: a single continuous thread of reasoning, with one context that accumulates everything as it goes. One worker, doing the whole job start to finish, never losing the thread of what came before.

For a huge number of tasks, this is all you need. Which brings us to a principle you should pin above your desk.

The most important rule in harness engineering

Anthropic's Building Effective Agents essay distills years of working with teams into one line: find the simplest solution possible, and only increase complexity when needed. OpenAI's own *Practical Guide to Building Agents* says effectively the same thing — maximize a single agent's capabilities first, because more agents mean more complexity and overhead, and "often a single agent with tools is sufficient."

This sounds obvious. It is not how most people behave. The agent ecosystem has a gravitational pull toward complexity — multi-agent swarms, elaborate frameworks, layers of orchestration — because complexity feels like sophistication. Resist it. The teams shipping reliable agents in production are almost universally the ones who started with the boring single-loop version and added structure only when something concretely broke.

Which connects to my favorite idea in this whole space. Anthropic frames it like this: every component you add to a harness encodes an assumption about what the model can't do on its own. A retry wrapper assumes the model can't recover from a failed call. A rigid step-by-step pipeline assumes it can't plan. A second agent assumes one agent can't hold the whole task.

Some of those assumptions are true today. But models keep getting better, and stale assumptions become dead weight — scaffolding that was load-bearing last year and is just friction this year. So the harness-engineering mindset isn't "add structure." It's "add the minimum structure, and keep questioning whether you still need it." A good harness is something you periodically try to delete parts of.

Where a single agent starts to strain

So when do you need more? A single agent runs out of road in a few recognizable ways. Watch for these symptoms.

Tool overload. Not just too many tools, but too many *similar* tools. When your agent has fifteen functions and four of them sound alike, the model starts picking the wrong one. OpenAI specifically flags overlapping tools — not just the raw count — as a signal to split your system up.

Conditional sprawl. When your system prompt grows into a thicket of if-this-then-that branches — "if the user is a refund case do X, unless it's over $500, in which case Y, but for enterprise accounts Z" — you're asking one prompt to be ten prompts in a trench coat. OpenAI's guide suggests that when prompt templates get this unwieldy, each logical segment may deserve its own agent. (Their gentler alternative, worth trying first: a single flexible base prompt with policy variables you swap in, rather than ten hand-maintained prompts.)

The context window fills up. This is the big one. Some tasks simply generate more relevant information than fits in a single context window — sprawling research questions, large codebases, anything that touches dozens of sources. One agent, one context, eventually hits a wall.

You need genuine parallelism. A single agent is sequential by nature: it does one thing, sees the result, does the next thing. If your task is "investigate these eight independent leads," doing them one at a time is slow, and worse, the early leads clog the context before you even reach the later ones.

That last pair — context limits and parallelism — is the real doorway to multi-agent systems. Not because multi-agent is fancier, but because it gives you something a single context physically cannot: more total working memory, explored in parallel.

Multi-agent systems

A multi-agent system is what it sounds like: multiple agents — each its own LLM-in-a-loop, each with its own separate context window — working together on one problem. The key word is *separate context*. That's the resource you're really buying.

The most common and most useful pattern is the orchestrator-worker pattern (sometimes called manager / sub-agents). One lead agent plans and delegates; several worker agents each handle a focused piece with a fresh, uncluttered context; the lead agent synthesizes their findings into a final answer.

This is exactly the architecture Anthropic described for Claude's Research feature. A LeadResearcher agent analyzes the query, decides on a strategy, and spins up several specialized subagents that search in parallel. Each subagent runs its own loop — Anthropic describes it as an OODA loop: observe what's been gathered, orient toward what's still needed, decide on a tool, act, repeat. The lead then pulls everything together (with, in their case, a separate pass just for citations).

Figure 3 — The orchestrator–worker pattern: a lead agent plans and fans the work out to several subagents that each run in parallel with their own context window, then synthesizes their findings into one answer. The catch: it can burn roughly 15× the tokens of a single chat.

Here's the shape of it in code. Notice that I'm reusing the single-agent `run_agent` from before — a worker agent is just our original single agent, pointed at a narrower task. Multi-agent systems are built out of single agents.

from concurrent.futures import ThreadPoolExecutor


def subagent_research(subtopic: str) -> str:
    """A WORKER: a focused single-agent loop with its own clean context."""
    return run_agent(
        f"Research this specific question and report concise findings:\n{subtopic}"
    )


def lead_researcher(user_query: str) -> str:
    # 1. PLAN — the orchestrator decides how to split the work.
    plan = run_agent(
        "Break this research request into 3 independent sub-questions. "
        "Return them as a numbered list and nothing else.\n\n"
        f"Request: {user_query}"
    )
    subtopics = [
        line.split(".", 1)[1].strip()
        for line in plan.splitlines()
        if line.strip()[:1].isdigit()
    ]

    # 2. DELEGATE — run workers in PARALLEL, each with a fresh context window.
    with ThreadPoolExecutor() as pool:
        findings = list(pool.map(subagent_research, subtopics))

    # 3. SYNTHESIZE — the orchestrator merges everything into one answer.
    joined = "\n\n".join(f"### {t}\n{f}" for t, f in zip(subtopics, findings))
    return run_agent(
        "You are the lead researcher. Synthesize these findings from your team "
        "into one coherent answer, resolving any contradictions.\n\n"
        f"Original question: {user_query}\n\n"
        f"Team findings:\n{joined}"
    )


print(lead_researcher("How have remote-work policies at big tech changed since 2020?"))

Three moves: plan, delegate, synthesize. The leverage is in step two — each worker explores its slice with a clean context window, so the system as a whole reasons over far more information than any single context could hold. Anthropic's own framing of *why* this works is refreshingly blunt: multi-agent systems excel largely because they "spend enough tokens to solve the problem." In their analysis of one benchmark, token usage alone explained around 80% of the performance variance. More agents means more parallel reasoning capacity, full stop.

The orchestrator-worker setup isn't the only flavor. There's also the handoff (or decentralized) pattern, where agents pass control to one another as peers rather than reporting up to a manager — good for routing. OpenAI's Agents SDK makes this a first-class concept:

from agents import Agent, Runner

spanish_agent = Agent(name="Spanish", instructions="Respond only in Spanish.")
english_agent = Agent(name="English", instructions="Respond only in English.")

triage = Agent(
    name="Triage",
    instructions="Detect the user's language and hand off to the right agent.",
    handoffs=[spanish_agent, english_agent],
)

result = Runner.run_sync(triage, "Hola, ¿cómo estás?")
print(result.final_output)

Orchestrator-worker centralizes the brain; handoffs distribute it. Both are valid; they suit different jobs. OpenAI's rule of thumb: the manager pattern fits structured workflows, the decentralized pattern fits dynamic routing.

The catch (there's always a catch)

Multi-agent systems are not a free upgrade. They're a trade, and the bill comes due in a few places.

Tokens — a lot of them. Anthropic found their multi-agent research system burned roughly 15× more tokens than a normal chat interaction. That's the cost of all those parallel contexts. The system delivered real quality gains (they reported their multi-agent setup substantially outperforming a strong single agent on their internal research evaluation), but it only makes economic sense for high-value tasks where the answer is worth the spend. Asking a five-agent swarm to look up a fact is like chartering a jet to cross the street.

Coordination is genuinely hard. The moment you have multiple agents, you inherit every problem distributed systems have wrestled with for decades: handoffs, conflicting decisions, partial failures, state that has to stay consistent across agents that can't see each other. Anthropic is candid that they had to build systems to *resume from failure points* rather than restart, and pair the model's adaptability with old-fashioned deterministic safeguards like retries and checkpoints.

The shared-context problem. This is the deepest objection, and it deserves real weight. The team at Cognition (the people behind the Devin coding agent) published a pointed counterpoint titled — no ambiguity here — *Don't Build Multi-Agents. Their core argument: parallel agents are fragile because actions carry implicit decisions*, and sub-agents can't see each other's implicit decisions. Their now-famous illustration: ask two sub-agents to build a Flappy Bird clone in parallel, and one might build a Super Mario–style background while the other builds a bird that doesn't match it at all — because neither shares the context of the other's choices. The pieces don't fit, because nobody agreed on what they were building.

If that sounds like it directly contradicts Anthropic, here's the fun part: it mostly doesn't.

The debate that turned out to be agreement

In June 2025, within roughly a day of each other, Cognition published Don't Build Multi-Agents and Anthropic published How We Built Our Multi-Agent Research System. The titles read like a prizefight. The actual content reads like two people describing the same elephant from different ends.

The reconciliation comes from one distinction, articulated cleanly by Philipp Schmid: it's not really single vs. multi-agent. It's read vs. write.

Read-heavy tasks — research, search, analysis, gathering information — *parallelize beautifully.* Ten agents reading ten sources don't step on each other; you just collect what they each found. This is precisely Anthropic's research use case, and multi-agent shines.
Write-heavy tasks — generating code, editing a document, producing one coherent artifact — *parallelize terribly,* because every writer is making implicit decisions the other writers need to respect. This is precisely Cognition's coding use case, and single-threaded wins.

Look closely and even Anthropic's design respects Cognition's warning. In Claude Code, sub-agents famously never write code — they investigate and report back, and a single thread does the actual writing. Read in parallel; write in one place. Both teams, in the end, are saying: centralize the decisions, and only parallelize the parts where independent work won't collide.

So the grown-up takeaway isn't "multi-agent good" or "multi-agent bad." It's: match the architecture to the shape of the task. Which we can now turn into an actual decision procedure.

A practical decision framework

When you're staring at a new project, walk down this list in order. Stop at the first honest "yes."

Figure 5 — Choosing an architecture, stop at the first "yes": fixed known steps → a plain-code workflow; one agent with good tools is enough → a single agent (the default); read-heavy, parallel, or overflows one context window → multi-agent; write-heavy and must stay coherent → single-threaded with context management.

Can a workflow (plain code) do it? If the steps are known and fixed, don't use an agent at all. A predictable if/else pipeline is cheaper, faster, and infinitely more debuggable than an LLM deciding things. Reserve agents for genuine ambiguity — judgment calls, messy unstructured input, decisions that resist hard-coded rules.
Can a single agent with good tools do it? This is your default for real agentic work. Start here. Most tasks never need more. Invest your energy in clean tool descriptions, tight instructions, and solid context management before you even think about a second agent.
Is it read-heavy and parallelizable, or does it overflow one context window? *Now* multi-agent earns its keep — research, broad search, breadth-first exploration across many independent sources. Reach for orchestrator-worker.
Is it write-heavy, producing one coherent artifact? Stay single-threaded. If it's huge, use context management — summarize and compress the running history rather than splitting the work across parallel writers who'll disagree with each other.

And one structural trick worth its own mention: for mixed tasks, separate the phases. Do the reading multi-agent and in parallel, then hand a clean summary to a single agent for the writing. You get parallelism where it's safe and coherence where it's needed. That "read in parallel, write in one place" split is, quietly, one of the most useful patterns in the entire field.

A taste of long-running harnesses

One more flavor of harness engineering, because it's where the discipline gets most interesting: keeping an agent productive across many context windows on a task that takes hours.

You can't hold an hours-long task in one context window. So the harness has to give the agent a way to remember across resets — which loops us right back to our forgetful-brilliant-engineer analogy. The pattern Anthropic uses: an initializer agent runs once, sets up the environment, and writes out a detailed plan and feature list; then a coding agent runs over and over, each time with a fresh context, grounded by a progress file and git history so it can instantly answer "what's done, what's next?"

In skeleton form:

def run_long_task(spec: str):
    progress = read_or_init("progress.md", spec)  # the agent's external memory

    while not progress["done"]:
        # Each iteration starts with a FRESH context window —
        # but is grounded by the spec + the progress file.
        context = (
            f"Goal:\n{spec}\n\n"
            f"Progress so far:\n{progress['log']}\n\n"
            "Do the next single unit of work, then update the progress file."
        )
        result = run_agent(context)
        progress = update_progress("progress.md", result)

The model is amnesiac; progress.md is the notebook that survives the amnesia. Notice the constraint baked into the prompt — "the next single unit of work." That one phrase exists to fight a specific failure: agents love to try to one-shot the whole thing and then declare victory early. The harness gently forces patience. Every line of a good harness is like that — a small, deliberate correction for a specific way the raw model tends to go wrong.

(Anthropic also notes the flip side, the thing that keeps harness engineering humble: as models improve, some of those corrections become unnecessary. Their hosted Managed Agents work is partly a bet on building stable interfaces around harnesses precisely because the harnesses themselves keep changing. Build your scaffolding to be deletable.)

Wrapping up

Let me compress everything down to what I'd actually want a past version of myself to remember.

The model is the easy part. The harness — the loop, the tools, the memory, the stopping conditions, the guardrails — is where agents are won or lost, and designing it well is the real skill. A single agent is just an LLM using tools in a loop, and for most tasks it's all you need; start there, always. Reach for multi-agent systems when the work is read-heavy and parallelizable, or when it simply won't fit in one context window — and pay attention to the cost, because you're buying capability with a steep token bill and real coordination overhead. When the task is write-heavy and has to cohere, keep it single-threaded. And whatever you build, keep it as simple as the problem allows, because every piece of scaffolding is an assumption that might already be obsolete.

Most of all: the famous "Don't Build Multi-Agents" vs. "Here's How We Built Multi-Agents" standoff was never really a contradiction. It was the field learning, out loud and in real time, that the question isn't single or multi. It's what does this particular task actually need? Answer that honestly, build the smallest thing that meets it, and you're already doing harness engineering better than most.

Now go write a loop. That's where it all starts.

References and further reading

These are the primary sources behind this article — all worth reading in full.

Anthropic

Building Effective Agents — the foundational essay on workflows vs. agents and the "keep it simple" principle: https://www.anthropic.com/research/building-effective-agents
How We Built Our Multi-Agent Research System — orchestrator-worker, the token economics, and the engineering challenges: https://www.anthropic.com/engineering/built-multi-agent-research-system
Effective Harnesses for Long-Running Agents — initializer + coding agents, progress files, harnesses as encoded assumptions: https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
Harness Design for Long-Running Application Development — the planner/generator/evaluator architecture and simplifying harnesses: https://www.anthropic.com/engineering/harness-design-long-running-apps
Scaling Managed Agents: Decoupling the Brain from the Hands — why stable interfaces matter as harnesses change: https://www.anthropic.com/engineering/managed-agents

OpenAI

A Practical Guide to Building Agents — model/tools/instructions, when to split agents, manager vs. decentralized patterns: https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf

Cognition

Don't Build Multi-Agents (Walden Yan) — the context-isolation counterargument and the Flappy Bird example. Widely discussed; see the Hacker News thread for the debate around it: https://news.ycombinator.com/item?id=45096962

Synthesis and commentary

Philipp Schmid, Single vs. Multi-Agent System? — the read-vs-write reframing that reconciles the debate: https://www.philschmid.de/single-vs-multi-agents
Simon Willison's notes on Anthropic's multi-agent system, including the OODA-loop framing: https://simonwillison.net/2025/Jun/14/multi-agent-research-system/