Multi-Agent Systems: Orchestrating a Team of AIs

The instinct, the first time an agent struggles on a big task, is to reach for more agents. Split the work, run pieces in parallel, let each one specialize. Sometimes that instinct is exactly right — and sometimes it turns a slow single-agent job into a fast, expensive, hard-to-debug mess of five agents stepping on each other. Multi-agent systems are a real architecture with real failure modes, not a universal upgrade you bolt on whenever a task feels too big for one context window.

This post is about when splitting actually pays off, the small set of patterns that account for most working multi-agent systems in practice, and the design discipline — self-contained task prompts, explicit output contracts, no two agents writing the same file — that keeps a team of agents from becoming a team you have to babysit.

Why split work across agents at all#

There are three legitimate reasons to reach for more than one agent, and it's worth naming them separately because only one of them is the reason most people assume.

Context isolation — the real reason, most of the time#

A single agent working through a large, multi-part task accumulates context from every part of it: files read for step one are still sitting in the transcript when the agent starts step four, whether or not they're relevant anymore. Past a certain point, that accumulated, mostly-irrelevant context doesn't just cost tokens — it measurably degrades the agent's ability to recall and reason about what actually matters for the step in front of it. Splitting a task across separate agents, each with its own clean context scoped to one part of the problem, is fundamentally a context-management technique before it's a speed technique. This is the number one reason well-designed multi-agent systems exist, and it's the one that has nothing to do with parallelism at all — you'd want the isolation even if the subagents ran one after another.

Parallelism — real, but smaller than it looks#

Running independent pieces of a task at the same time genuinely saves wall-clock time, and it's the benefit people picture first. But it only pays off when the pieces are actually independent — if worker B needs something worker A produces, running them in parallel just means B blocks anyway, and you've paid the coordination cost for nothing.

Specialization — useful, but rarely necessary on its own#

Giving one agent a harness tuned for security review and another a harness tuned for performance profiling can produce sharper results than one generalist agent trying to hold both mindsets at once. This is real, but it's usually a secondary justification layered on top of context isolation, not a reason to add agents by itself.

The price you pay#

None of this is free. Every additional agent adds coordination overhead — someone has to define the handoffs, someone has to merge the results. Token cost multiplies roughly with the number of agents involved, since each one carries its own system prompt and context. Latency can go up, not down, if agents end up waiting on each other despite being "parallel" in name. And debugging a five-agent pipeline that produced a wrong answer is categorically harder than debugging one agent's single transcript. For the large majority of tasks — a single file edit, a focused bug fix, a well-scoped feature — one agent with a good harness is not just adequate, it's the better choice. Reach for multiple agents when the task is genuinely large enough, or genuinely parallel enough, to justify the overhead — not by default.

Default to one agent

If you can't articulate which of the three benefits above a multi-agent design is actually buying you for a specific task, you're probably paying coordination and token overhead for nothing. Most tasks that feel too big for one agent are too big because the task itself needs to be scoped down, not because it needs more agents.

The patterns that actually work#

Almost every working multi-agent system in production reduces to one of a small number of shapes. Learning these shapes is more useful than trying to design a bespoke topology for every new task.

Orchestrator–workers#

A lead agent breaks a task into scoped pieces, dispatches one to each worker, and collects the results. This is the default shape for context isolation: each worker gets a clean context scoped to exactly its piece, does the work, and reports back a compact result rather than a running narration of everything it looked at along the way.

Figure 1

Orchestrator–workers

The lead agent dispatches a scoped task to each worker and collects a compact result back — the synthesis happens once, at the hub, not inside every worker.

The critical design rule for this pattern is where synthesis happens. Workers should return data — findings, diffs, a pass/fail signal — not a polished writeup of what they think the final answer should be. If every worker tries to synthesize its own conclusion, the orchestrator ends up reconciling four different framings of the same problem instead of combining four clean pieces of evidence. Keep synthesis at the hub; keep workers reporting facts.

Pipeline#

Agents run in a fixed sequence, each stage consuming the previous stage's output as its input — a research agent hands findings to a drafting agent, which hands a draft to an editing agent. Useful when the task has genuinely sequential stages, each requiring a different mode of work, but it doesn't buy you parallelism: stage three simply cannot start before stage two finishes.

Parallel fan-out + synthesis#

Several agents work the same underlying question from different angles at the same time — one checks correctness, one checks performance, one checks security — and a final step merges their independent findings into a single answer. This is orchestrator– workers with a specific twist: the workers aren't dividing the input, they're each examining the whole input through a different lens.

Adversarial verify#

One agent proposes a solution; a second agent — deliberately instructed to be skeptical rather than agreeable — tries to poke holes in it before it's accepted. This catches a specific failure mode single-agent setups are bad at: an agent grading its own work tends to agree with itself. A separate agent with no stake in the original answer, and an explicit mandate to find problems, catches issues a self-review pass reliably misses.

Judge panel#

Multiple agents independently evaluate the same output — say, three different reviewers scoring a generated response — and their verdicts are combined, by majority vote or averaging, into a final judgment. Useful specifically when you don't trust any single model's evaluation to be stable, and you're willing to pay for redundancy to reduce variance in the verdict.

Figure 2

Context isolation vs. one overloaded context

Three agents each holding a small, focused context stay sharp on their slice of the problem. One agent holding all three jobs at once ends up with a single context stuffed full of material most of it doesn't need for the step in front of it — and recall degrades as it fills.

Designing tasks for a subagent that can't see the conversation#

The single most common bug in multi-agent systems isn't a bad pattern choice — it's a task prompt written as if the subagent had been following along with everything that came before. It hasn't. A dispatched worker typically starts from a blank context containing only what the orchestrator explicitly hands it. Every assumption, every piece of prior context, every "as we discussed" needs to be spelled out in the dispatch, or the worker is reasoning about a task it doesn't actually have enough information to complete.

Write self-contained prompts. A worker task should read like a ticket handed to someone who just joined the project this morning: the goal, the relevant file paths or scope, any constraints, and nothing assumed from context the worker never saw.
Define the output contract explicitly. Tell the worker exactly what shape its report should take — a JSON object with specific fields, a short bulleted list of findings, a single pass/fail line. An orchestrator trying to parse four differently formatted free-text reports is doing avoidable work.
Never let two agents write the same file. Two workers editing the same file concurrently is a race condition with an LLM instead of a thread — one of them will silently clobber the other's change, and neither will know it happened. Partition file ownership across workers up front, or serialize the writes through a single agent.
Scope tightly, not generously. A worker prompt that says "look into the payments module" invites scope creep and an unpredictable report. "Check whether payments/refund.py handles a partial refund on an already-cancelled subscription, and report pass or fail with a one-line reason" gives the worker — and the orchestrator parsing its result — something concrete to act on.

Fan out, verify, synthesize#

The pattern that combines most of the above into something worth using as a default template: split a task across parallel workers, route their combined output through an adversarial check before it's trusted, and only synthesize a final answer once that check passes. A failure at the verification step sends the run back for another pass rather than letting a flawed result through.

Figure 3

Fan-out → verify → synthesize

Workers examine the problem in parallel and report data back. A skeptic agent checks the combined result before it's trusted — failing claims loop back for another pass, and only a passing result reaches synthesis.

Two details make this pattern reliable rather than merely elaborate. First, the skeptic has to actually be adversarial — instructed to look for reasons the combined output might be wrong, not asked to rubber-stamp it. Second, the retry loop needs a bound. An unbounded "keep retrying until the skeptic is satisfied" loop is exactly the kind of runaway behavior the next post in this series, on loop engineering, is about avoiding.

dispatch-worker.md (excerpt)

## Task for Worker B
Scope: payments/refund.py only. Do not read or modify any other file.

Goal: Determine whether a partial refund is correctly rejected when the
underlying subscription has already been cancelled.

Report format (exactly):
{ "check": "partial-refund-on-cancelled", "result": "pass" | "fail", "reason": "<one line>" }

A worked example: reviewing a large pull request across three domains#

Say a pull request touches a database migration, a payments endpoint, and the frontend that calls it — three domains, each requiring a different kind of scrutiny, in one changeset too large for a careful single-agent read to stay sharp all the way through. This is a reasonable candidate for the orchestrator–workers pattern, and walking through it end to end, prompts included, shows why the design rules above aren't academic.

The orchestrator reads the diff once, up front, and splits it into scoped dispatches — one per domain, each naming the exact files in scope and the specific risk to check for. Its own prompt stays deliberately thin: read the diff, decide the domains actually touched, dispatch one worker per domain, and don't review anything itself.

orchestrator.md

## Role
You are the review orchestrator for pull request #4821. You do not review
code yourself — you split the diff into scoped worker dispatches and
synthesize their reports into one comment.

## Steps
1. Read the full diff for PR #4821.
2. For each domain actually touched (migration, payments endpoint,
   frontend), dispatch exactly one worker using the templates below,
   filling in the real file paths from the diff.
3. Wait for all dispatched workers to report back in the JSON contract
   specified in their dispatch.
4. Synthesize: merge the reports into one review comment. Flag any case
   where one worker's finding has implications the other worker's report
   doesn't mention (e.g. a renamed column referenced elsewhere by name).
5. Do not re-review any file a worker already covered.

Each worker's dispatch is self-contained — it names the exact files in scope, the one risk to check, and the exact shape the report must take. Worker A never sees worker B's dispatch or findings, and it doesn't need to: the migration reviewer has no reason to reason about frontend accessibility, and forcing it to would just be wasted context.

worker-a-migration.md

## Task for Worker A (migration reviewer)
Scope: db/migrations/0047_add_refund_reason.sql only. Do not read or
modify any other file.

Goal: Determine whether this migration is safely reversible and whether
any column it renames or drops is still referenced elsewhere by name.

Report format (exactly, no prose outside the JSON):
{
  "worker": "migration",
  "reversible": true | false,
  "renamed_or_dropped_columns": ["<column>", ...],
  "result": "pass" | "fail",
  "reason": "<one line>"
}

worker-b-frontend.md

## Task for Worker B (frontend reviewer)
Scope: web/src/components/RefundForm.tsx only. Do not read or modify any
other file.

Goal: Check for accessibility regressions (label associations, focus
order, aria attributes) introduced by this diff, and list every column
or field name this component references by string.

Report format (exactly, no prose outside the JSON):
{
  "worker": "frontend",
  "fields_referenced_by_name": ["<field>", ...],
  "a11y_regressions": ["<finding>", ...],
  "result": "pass" | "fail",
  "reason": "<one line>"
}

The orchestrator collects two compact, structurally identical-enough reports and does the one job neither worker was asked to do: cross-reference them. Worker A's renamed_or_dropped_columns field and worker B's fields_referenced_by_name field are exactly the two pieces of data that need to sit next to each other for the real issue to surface — and neither worker had enough context, on its own, to spot it.

synthesis (orchestrator output, excerpt)

Cross-domain finding: worker A reports the migration renames
"refund_reason" -> "refund_reason_code". Worker B reports
RefundForm.tsx references "refund_reason" by name in a form label.
=> This diff will silently break that label. Blocking until the
   frontend field name is updated to match the new column.

That cross-reference is exactly the kind of thing that gets lost if each worker tries to write its own final verdict instead of reporting structured data the orchestrator can actually diff against the other worker's report.

What this actually costs: token, latency, and failure surface#

"Split it across agents" is not a free upgrade, and the worked example above is a reasonable size to put real numbers next to. Every worker carries its own system prompt and its own copy of whatever files it reads — none of that is shared or amortized across workers the way it would be inside one continuous conversation.

rough cost comparison — single agent vs. orchestrator + 2 workers

single agent   orchestrator + 2 workers
Input tokens (approx.)   ~18k            ~6k (orch) + ~9k + ~7k = ~22k
Output tokens (approx.)  ~2.5k           ~0.3k + ~0.4k + ~0.4k + ~1.2k synth
Wall-clock               one pass,       two workers in parallel, then
                         sequential      a short synthesis pass — often
                         through 3       faster than sequential single-
                         domains         agent despite more total tokens
Failure surface          1 transcript    3 transcripts + 2 dispatch
                         to debug        prompts + 1 output-contract
                                         mismatch to debug if something
                                         breaks

The pattern that shows up almost every time: multi-agent designs spend more total tokens than a single agent doing the same work, because each worker's system prompt and dispatch context is paid for independently. What they buy back is quality per domain — each worker's context stays scoped to exactly the risk it's checking — and, when the pieces are truly independent, wall-clock time. If neither of those is actually worth the token premium for a given task, the honest conclusion is that the task didn't need multiple agents in the first place.

How this fails in practice#

The orchestrator can't reconcile two reports that don't line up#

Symptom: synthesis produces a vague, hedged summary instead of a clear finding, even though the underlying issue was genuinely catchable. Cause: the two workers were given output contracts that don't share a join key — one reports "columns changed," the other reports "fields used," with no guarantee the field names line up in format (snake_case vs. camelCase, quoted vs. unquoted). The orchestrator ends up guessing whether two differently-formatted strings refer to the same thing. Fix: design output contracts as a pair, not independently — if synthesis needs to cross-reference two fields, name them consistently and specify the exact format each worker must use before either dispatch is sent.

A worker's report silently omits something the dispatch didn't ask for#

Symptom: a real issue existed inside a worker's scope but never made it into the review. Cause: the dispatch was scoped by risk ("check reversibility") instead of by file, so the worker correctly reported pass/fail on the one thing it was asked about and had no mandate to mention anything else it noticed. This isn't the worker malfunctioning — it did exactly what it was told. Fix: add an explicit "also flag anything else that looks wrong in scope, even outside the primary check" line to every dispatch, with its own field in the output contract, so incidental findings have somewhere to go instead of being silently dropped.

Two workers touch the same file and one clobbers the other#

Symptom: a change one worker made disappears, with no error anywhere in either transcript. Cause: file ownership wasn't partitioned before dispatch, and two workers both had write access to the same file — the second write simply overwrote the first, and neither worker had any way to know the other existed. Fix: partition file ownership explicitly in the orchestrator's dispatch step, before any worker starts, and if two workers genuinely need to touch the same file, serialize those specific writes through a single agent instead of running them concurrently.

The adversarial verify step never actually pushes back#

Symptom: the skeptic agent in a fan-out-verify pipeline approves everything on the first pass, every time, making it expensive theater rather than a real check. Cause: the skeptic's prompt asks it to "review the findings" without an explicit mandate to look for reasons they might be wrong — with nothing pushing it toward skepticism, it defaults to the same agreeable tone as any other review request. Fix: write the skeptic's prompt to require it to attempt to falsify the specific claims in front of it, and give it something concrete to try — re-run the check the claim rests on, not just read the claim and nod.

Trade-offs: context handoff formats and when to skip delegation#

Choosing a handoff format#

A structured JSON contract, like the ones above, is the right default when the orchestrator needs to programmatically cross-reference fields from multiple workers — it forces every worker to commit to a shape before dispatch, which is exactly what made the column-rename cross-reference catchable. A shorter free-text bullet list is fine when there's exactly one worker and the orchestrator is just relaying its finding to a human, with no cross-referencing to do. A shared scratch file that multiple workers append to, rather than a return value each reports individually, works for loosely-coupled fan-out where order doesn't matter — but it reintroduces exactly the same-file-concurrent-write risk described above unless each worker appends to a distinct, named section rather than editing the same lines.

When the orchestrator should just do it itself#

Delegation earns its overhead specifically when a piece of work needs a clean context the orchestrator's own conversation doesn't have — not merely because a task has multiple parts. If the orchestrator has already read the relevant file as part of planning the dispatch, dispatching a worker to re-read the same file and report back what it found is pure overhead: the orchestrator already has the context, and a round trip through a worker doesn't make that context any cleaner. The rule of thumb: dispatch when the worker needs to read something the orchestrator hasn't and shouldn't (to keep its own context clean for synthesis), and do it inline when the orchestrator already has everything a worker would need to fetch anyway.

Key takeaways

Context isolation — not parallelism — is the strongest reason to split work across agents; it's worth wanting even when the pieces run sequentially.
Orchestrator–workers, pipeline, parallel fan-out, adversarial verify, and judge panel cover almost every working multi-agent design — pick the shape that matches your actual dependency structure.
Synthesis belongs at the orchestrator. Workers should report data against a defined output contract, not a free-text conclusion — and paired contracts need a shared join key or cross-referencing breaks silently.
A subagent doesn't see the conversation that led to its dispatch — write task prompts that are self-contained, scope file ownership so no two agents write the same file, and default to one agent unless the task clearly earns the coordination cost.
Multi-agent designs usually cost more total tokens than a single agent doing the same work, not less — the payoff is quality per domain and wall-clock time on genuinely independent pieces, not a token discount.

Multi-agent systems only pay off when each worker is actually well-equipped for its slice of the task — which is exactly what the skills and commands in a Noddle Deck persona pack are built for: a scoped capability and knowledge layer a worker can load the moment its dispatch matches.

bash

noddle-deck pack install qa

Pairing a tightly-scoped worker prompt with a pack that already knows the review checklist for that role is a fast way to see the fan-out-and-verify pattern above hold up on a real codebase instead of a toy example.