Noddle Deck
All posts

Agent Engineering series

Loop Engineering: Agents That Run While You Sleep

Noddle Deck team11 min read
loop engineeringautomationagents

An agent that answers one question and stops is easy to reason about — you read the transcript, you judge the answer, you move on. An agent that runs unattended, on a schedule or in response to events, night after night, is a different kind of system entirely. It doesn't get judged once; it gets judged every time it runs, and by design there's often nobody watching in real time when it does. Loop engineering is the discipline of designing that kind of agent so it fails safely instead of failing loudly at 3 a.m. with nobody around to notice.

This is the part of agent engineering where the stakes quietly go up. A single-shot agent that makes a bad call wastes one turn. A loop that makes the same bad call can repeat it a hundred times before anyone looks — or, worse, use its own flawed output as the input to its next iteration and compound the mistake. Getting loops right isn't about making agents smarter; it's about building the same kind of safety rails around a repeating agent that you'd build around any other automated system that runs without a human reviewing each step.

What a loop actually looks like in practice#

"Loop" covers a range of concrete patterns, and it helps to name a few so the abstraction doesn't stay theoretical.

  • Scheduled runs. A cron-triggered agent that, say, checks for stale dependencies every morning and opens an update PR if it finds one.
  • Watch-fix loops. A CI pipeline goes red, an agent is triggered automatically, investigates the failing test, and proposes — or, in more autonomous setups, commits — a fix.
  • Inbox-zero patterns. An agent that processes a queue continuously — incoming support tickets, a backlog of flagged items — working through entries one at a time until the queue is empty, then waiting for more.
  • Self-verifying loops. Do the work, run a check, and if the check fails, use the failure output as the input to another attempt — repeating until the check passes or a retry budget runs out.

Every one of these shares the same underlying shape once you strip away the specific trigger, and that shape is the actual unit of design in loop engineering — not the business logic of any one use case.

The anatomy of a safe loop#

A well-designed loop has five parts, and the order they run in matters as much as the parts themselves: a trigger starts the run, a scoped task defines what this particular iteration is supposed to accomplish, a verification gate renders an objective judgment on the result, bounded retries give the loop a limited number of chances to correct course, and a reporting step either closes out successfully or escalates to a human. Skip the verification gate and every other part of this list stops mattering.

Figure 1

Anatomy of an agent loop

Anatomy of an agent loop diagramTriggerschedule / eventActscoped taskVerifytest / lint gateReportor retrypass → Report · fail → Act (bounded retries)
Trigger starts the run, act performs the scoped task, and the verification gate renders judgment — pass reports out, fail sends the run back to act for another bounded attempt.

Why the verification gate is the heart of the loop#

Everything upstream of the gate — the trigger, the scoped task, the agent's reasoning about what to do — can be wrong in ways that are individually survivable. What makes a loop dangerous rather than just occasionally mistaken is running that same fallible reasoning repeatedly, without an outside, objective check on the result. A test suite passing or failing is objective. A linter returning a clean exit code is objective. An agent's own claim that "this looks correct" is not — it's the same reasoning process that produced the change in the first place, asked to grade its own work. A loop without an objective verification gate doesn't catch its own mistakes; it amplifies them, because each iteration builds on the unverified output of the last one.

Figure 2

Convergence vs. runaway

Convergence versus runaway diagramerrorsiterationswith a verification gateno gate — compounding errors
A loop with a real verification gate drives its error count toward zero, iteration by iteration. A loop without one has nothing forcing errors down — and they compound instead.

This is the difference between a loop that's useful to leave running overnight and one that needs to be watched the entire time it executes. The gate is what turns "keep trying until it works" from a hope into an actual guarantee bounded by something you can point to — a specific test, a specific check, a specific exit code.

Guardrails that make unattended execution survivable#

None of these are exotic ideas — they're the same operational discipline any team already applies to a cron job or a background worker that touches production. Agent loops just make skipping them more tempting, because the failure mode looks like "the model made a mistake" rather than "we shipped an unguarded automated process," even though the second framing is the more accurate one.

  • Checkpointing. A loop that works through a multi-step task should persist its progress after each completed step, not just at the very end. If the process dies — a timeout, a crashed container, a killed job — the next run should resume from the last checkpoint instead of redoing everything from scratch. Without this, a long-running loop that fails on step nine of ten silently repeats the first eight every single retry, burning budget on work that already succeeded.
  • Budget and iteration caps. A hard limit on retry count, wall-clock time, or token spend per run. Without one, "retry until it passes" is indistinguishable from "retry forever," and forever is expensive.
  • Idempotency. Re-running the same iteration twice — because a trigger fired twice, or a retry re-executed a step that partially succeeded — should not leave the system in a worse state than running it once. Design each step so it can safely be repeated.
  • Dry-run before commit. Especially for the first few runs of a new loop, execute against a staging environment or in a mode that reports what it would have done without doing it, before trusting it to act directly on production.
  • A kill switch. A trivially accessible way to stop a running or scheduled loop immediately — a flag, an env var, a feature toggle — that doesn't require redeploying code to flip. If you can't stop it in under a minute, you don't have a kill switch, you have a plan to stop it eventually.
  • An audit log. Every iteration's trigger, action taken, verification result, and outcome, written somewhere a human can review after the fact. Unattended doesn't mean unaccountable — it means the accountability happens after the run instead of during it.

Escalating instead of grinding forever#

Bounded retries only solve half the problem. The other half is deciding what happens when the bound is hit and the task still isn't done. The answer should never be "keep going anyway" — it should be a clean handoff to a human, with enough context attached that the handoff is actually useful rather than just an alert saying something went wrong.

Figure 3

Human-in-the-loop escalation

Human-in-the-loop escalation diagramAttempt 1fails, retriesAttempt 2fails, retriesAttempt 3fails — threshold hitEscalatenotify a human, stop
The loop retries on its own for the first couple of failures. Hitting the threshold routes the run to a human instead of attempting a third, fourth, or hundredth automatic retry.

A good escalation includes what was attempted, what the verification gate reported at each attempt, and — critically — does not include a fourth automatic retry disguised as persistence. The threshold exists precisely so that a human makes the next call instead of the loop making it again with the same blind spots that caused the first three failures.

loop-config.yaml (excerpt)
trigger: on_ci_failure
task: investigate_and_fix
verification: run_test_suite
retries:
max_attempts: 3
on_exhausted: escalate
escalation:
notify: "#eng-oncall"
include: [attempts, verification_output, diff]

A worked example: a nightly dependency-update loop#

It's easier to see how the five parts fit together with a concrete run instead of the abstract shape. Take a common loop: every night, check whether any dependency has a security patch available, and if one does, upgrade it and open a pull request.

The trigger is a schedule — 2 a.m., low traffic, nobody expected to be watching. The scoped task is narrow on purpose: check one dependency manifest, not "improve the codebase," and the scope is written into the dispatch explicitly so the agent isn't left guessing how far its mandate extends. The agent runs the upgrade, and here is where most home-grown versions of this loop go wrong: they treat "the upgrade command exited successfully" as good enough. It isn't. The verification gate for this loop has to be the project's actual test suite plus its lint config — exactly the same checks a human contributor would be required to pass before merging — run against the upgraded dependency, not a weaker proxy for correctness.

If the suite passes, the loop opens a pull request and stops — it does not merge it. That's the human checkpoint, preserved deliberately. If the suite fails, bounded retries give the agent a small number of chances to try a different upgrade path — maybe a minor version bump instead of a major one — before giving up. And if every attempt fails, the loop escalates: it posts to a channel with the dependency name, the versions it tried, and the specific test failures, and then it stops entirely rather than retrying a fourth time with no new information. Every part of the anatomy diagram above shows up in this one small loop, and every guardrail from the list before it earns its place: the run is idempotent (checking the same dependency twice on two triggers doesn't double-upgrade it), it has an iteration cap (three attempts, not unlimited), and it has an audit trail (the pull request itself, plus the escalation message, are both a permanent record of what happened and why).

Written out as an actual scheduled workflow, the trigger and the retry policy live in one file, separate from the task logic — which is what makes the cap and the escalation path something you can review in a diff instead of something buried inside a prompt.

.github/workflows/dependency-patch-loop.yml
name: nightly-dependency-patch
on:
schedule:
- cron: "0 2 * * *" # 2 a.m., low traffic
jobs:
patch:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run patch agent (bounded)
run: |
./scripts/run-patch-agent.sh \
--idempotency-key "dep-patch-$(date +%F)" \
--max-attempts 3 \
--scope requirements.txt

The verification gate is a plain script — no model call inside it, nothing for the agent to talk its way past — that the loop runner invokes after every attempt and trusts as the sole source of truth on pass or fail.

scripts/verify-upgrade.sh
#!/usr/bin/env bash
set -euo pipefail
# Objective gate: real test suite + lint, nothing the agent can talk its
# way around. Exit code is the only signal the loop runner trusts.
pip install -r requirements.txt
pytest -q
ruff check .

And the escalation path is its own small script too — triggered only once the attempt budget is exhausted, carrying exactly the context a human needs to pick up where the loop left off.

scripts/escalate.sh
#!/usr/bin/env bash
set -euo pipefail
# $1: dependency name $2: attempts json $3: last verification output
curl -X POST "$SLACK_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d "{
\"channel\": \"#eng-oncall\",
\"text\": \"dependency-patch-loop exhausted 3/3 attempts on $1. Last gate output attached. No further retries will run automatically — needs a human decision.\",
\"attachments\": [{ \"text\": $(printf '%s' "$3" | jq -Rs .) }]
}"

Simulated across three consecutive nights, the shape looks like this — two nights where the gate catches a real problem and the loop corrects itself, one night where the gate passes clean on the first attempt:

run log (simulated, 3 nights)
Night 1 — dep-patch-2026-07-03
attempt 1/3: upgrade requests 2.31.0 -> 2.32.0
gate: pytest FAIL (test_retry_backoff expects old exception type)
attempt 2/3: upgrade requests 2.31.0 -> 2.31.1 (patch only, not minor)
gate: pytest PASS, ruff PASS
result: PR opened, loop stops. Idempotency key marks 2026-07-03 done.
Night 2 — dep-patch-2026-07-04
attempt 1/3: no CVE-flagged dependency found
result: no-op, loop exits cleanly, nothing logged as a failure.
Night 3 — dep-patch-2026-07-05
attempt 1/3: upgrade pyjwt 2.6.0 -> 2.8.0
gate: pytest FAIL (jwt.decode signature changed)
attempt 2/3: apply compat shim from changelog, retry same version
gate: pytest FAIL (shim didn't cover one call site)
attempt 3/3: same version, second shim covering remaining call site
gate: pytest PASS, ruff PASS
result: PR opened, loop stops at attempt 3/3 — one attempt away from
escalating instead of converging.

Night three is the case worth sitting with: it converged, but only on the last attempt the budget allowed. That's the cap doing its job either way — either it converges inside the budget, or it escalates instead of silently trying a fourth time with a bigger blast radius each time.

How this fails in practice#

The retry counter resets and the loop never escalates#

Symptom: a loop that should have escalated after three failed nights instead shows up in the audit log as "attempt 1/3" every single night, forever, and nobody gets paged. Cause: the attempt count lives only in the process's memory for that run, not in persisted state keyed to the task. Each night's invocation is a fresh process, so each night looks like a first attempt even though it's conceptually the same failing task recurring. Fix: persist the attempt count against the idempotency key, not the process — a small state file or database row keyed by dep-patch-2026-07-03 that survives across invocations is what makes "3 attempts total" actually mean three, instead of three separate "attempt 1"s.

The gate reports success against stale state#

Symptom: the loop opens a PR claiming the test suite passes, but the same suite fails the moment a human pulls the branch and runs it themselves. Cause: the verification step ran against a cached dependency install or a stale lockfile left over from a previous attempt, rather than a clean environment reflecting the actual current attempt. Fix: the gate has to run in an environment rebuilt for that specific attempt — reinstall dependencies, don't reuse a container or virtualenv another attempt already touched. A gate that's faster because it skipped a clean rebuild is a gate that's cheaper and wrong.

A flaky test burns through the retry budget on a real fix#

Symptom: a correct upgrade gets escalated as "failed after 3 attempts" even though the actual code change was fine — a timing-sensitive test failed intermittently across all three attempts by bad luck. Cause: the gate treats every failure as equally meaningful, so a known-flaky test consumes the same one-of-three budget as a genuine regression, and on a bad night it can eat the entire budget by itself. Fix: distinguish flaky-gate from real-failure explicitly — re-run just the failing test once before counting the attempt as failed, and track which tests fail intermittently across runs so a known-flaky test doesn't silently consume budget that should be reserved for the change actually being evaluated.

The same trigger fires twice and the loop runs the task twice#

Symptom: two near-identical pull requests open for the same dependency upgrade on the same night. Cause: the scheduler or webhook fired the trigger twice — a retried cron job, a duplicate webhook delivery — and nothing checked whether that specific task had already run before starting again. Fix: check the idempotency key before doing any work, not just when writing the result — ifdep-patch-2026-07-03 is already marked in-progress or done, the second invocation should exit immediately rather than re-running the upgrade from scratch.

Trade-offs: sizing caps, designing keys, and what to log#

Choosing a retry and iteration cap#

There's no universal right number — the right cap is a function of how expensive one iteration is and how bad the worst case looks if the loop keeps going past where a human would have stopped it. A loop whose iteration is cheap (a lint fix, a single file edit) and whose worst case is low-stakes (a rejected PR, no real risk) can afford a slightly higher cap, five or six attempts, because the cost of grinding a bit longer is small. A loop that touches anything with real blast radius — infrastructure changes, anything that can affect production even indirectly — should cap low, two or three attempts, and escalate fast, because the cost of the wrong kind of persistence is much higher than the cost of asking a human sooner. When in doubt, start lower than feels necessary; raising a cap later is a one-line change, but a loop that already did damage on attempt six isn't undone by lowering the cap afterward.

Designing an idempotency key#

A good idempotency key encodes exactly the dimensions along which the same task should be considered "the same run" — no more, no less. dep-patch-2026-07-03 works because the task is genuinely scoped to one calendar day; a key that omitted the date would treat every night's run as identical forever and never re-check for new CVEs, while a key that included something irrelevant, like a timestamp down to the second, would make every invocation look unique and defeat deduplication entirely. The general rule: build the key from the task's actual identity — what makes two runs the same task recurring versus two genuinely different tasks — and nothing else.

What an audit log needs, at minimum#

The minimum that makes an audit log actually useful after the fact: the trigger event and its timestamp, the exact scope of the task that ran, the raw verification output for every attempt (not a paraphrase — the actual exit code and stdout/stderr), the action ultimately taken, and the escalation status if one occurred. Skip the raw verification output and you're left debugging a production incident by reading the agent's own summary of what happened, which is exactly the kind of unverifiable self-report this whole discipline exists to avoid trusting.

Anti-patterns that turn a loop into an incident#

  • No stopping condition. A loop whose exit criteria are vague — "keep improving until it's good" — has no defined way to know it's done, which means in practice it either runs forever or stops arbitrarily.
  • Auto-merge without review. A loop that opens a PR is accountable. A loop that merges its own PR because the test suite passed has quietly removed the human checkpoint that catches the things tests don't — a subtly wrong approach that happens to pass every existing test.
  • Judging success by the model's own narration. "I've verified this is correct" in the agent's output is not a verification result — it's a sentence. The gate needs to be a signal the agent doesn't control the wording of: an exit code, a diff against an expected value, a test result.
  • Feeding a loop's own unverified output back into itself. If iteration two treats iteration one's unchecked result as ground truth, any error in iteration one becomes an assumption baked into everything after it.

If you wouldn't trust it unattended, don't run it unattended

A loop that needs a human sanity-checking its output every run hasn't earned automation yet — the verification gate isn't good enough. Fix the gate before you widen how often, or how unsupervised, the loop is allowed to run. Loosening the automation layer to compensate for a weak gate is how a minor bug becomes an overnight incident.

Key takeaways

  • The verification gate — an objective, model-independent check like a test or lint result — is the part of a loop that actually matters; everything else is scaffolding around it.
  • Bounded retries plus a real escalation path beat unbounded retries every time — a threshold that hands off to a human is a feature, not a failure to fully automate.
  • Idempotency, budget caps, dry-run modes, a fast kill switch, and an audit log are the guardrails that make unattended execution survivable, not optional extras.
  • Persist the attempt counter against an idempotency key derived from the task's real identity, not the process — a fresh process every run should not mean a fresh retry budget every run.
  • Distinguish a flaky gate from a real failure before it burns through the retry budget, and size the iteration cap by blast radius: cheap and low-stakes can afford more attempts, anything touching production should cap low and escalate fast.
  • Never judge a loop's success by the agent's own claim that it worked — judge it by a signal the agent doesn't control the wording of, and log the raw signal, not a paraphrase of it.

A loop is only as safe as the checks it runs against, and a Noddle Deck persona pack's commands are exactly the kind of concrete, exit-code-driven checks a verification gate should be built on instead of a model's own narration.

bash
noddle-deck pack install developer

Wiring one of those commands into a gate is a reasonable starting point before you build a bespoke check for a new scheduled or watch-fix loop of your own.

Put this into practice

Noddle Deck packs ship curated skills and slash-commands for your role — install one and see this in action.

Browse persona packs