Skip to main content

Concept 03

Self-Healing Loops

Workflows that diagnose failures, attempt fixes, and escalate only when self-correction runs out. Designed to fail gracefully, not to never fail.

"A workflow that stops on first failure requires human intervention for every routine error. Loops handle those automatically."

AI agents produce imperfect output. Tests will fail. Reviews will find issues. This is expected, not exceptional. Self-healing loops make this manageable.

Why loops improve on linear workflows

The straightforward approach: run Plan, run Build, run Test, run Review. If anything fails, throw an error and stop. This works for deterministic systems like compilers, where the same input always produces the same output. But AI agents are probabilistic. A builder might produce code that fails a test case. A reviewer might flag a style issue. These are not catastrophic failures. They are expected imperfections in a probabilistic process.

Self-healing treats failures as normal operations to handle, not exceptions that stall the workflow. When a test fails or a review finds issues, the workflow diagnoses the problem, applies a fix, and retries automatically.

Without loops

Build -> Test -> FAIL -> Stop. The developer reviews the failure and restarts.

With self-healing

Build -> Test -> FAIL -> Diagnose -> Patch -> Re-test -> PASS -> Continue. Routine failures are resolved automatically, saving developer time for non-routine problems.

The test retry loop

Triggered when tests fail after the Build phase produces code. The workflow packages the failure details (test name, expected vs. actual, stack trace, relevant source file) and hands them to a builder agent with the original plan. The builder patches the code, tests re-run, and the cycle repeats until all pass or the retry limit is reached.

Run Tests All suites Pass? YES Review NO Builder Patches Fix failures retry max 3 Escalate

What the builder agent receives on retry

Each retry gives the builder progressively richer context. This is artifact chaining in action. Failure details become input for the fix:

  • 1. The original plan artifact (what was supposed to be built)
  • 2. The current code state (what exists now)
  • 3. Specific test failure details (expected vs. actual, stack traces)
  • 4. Previous fix attempts (what was already tried and did not work)

The fourth item is critical. Without the history of previous attempts, the builder might try the same fix twice. The retry prompt explicitly instructs: "If a failure was present in the previous iteration and your prior fix did not work, try a DIFFERENT approach."

The review-patch loop

Triggered when the Review phase finds blocker-severity issues. The workflow packages the issue details (file, line, severity, description, suggested fix) and sends them to a builder agent. After patching, tests re-run (the test retry loop may fire within this), then review runs again. The cycle continues until there are no blockers or the retry limit is reached.

Review Analyze code Clean? YES Document BLOCKER Builder Patches Fix issues Re-test max 3 Escalate

Severity classification matters

Not every review finding triggers the loop. The Review agent classifies issues by severity:

Blocker

Triggers the loop. Must be fixed before the workflow can continue.

Tech-debt

Logged but does not block. Becomes a future task.

Skippable

Noted, no action. Style preferences, minor suggestions.

The full rebuild loop

The nuclear option. Triggered when Review finds architectural issues that patches cannot fix. The approach itself is wrong, not just the implementation details. The workflow returns to the Build phase with the review feedback as additional context.

Default max retries: 1. This is expensive. If the approach is fundamentally wrong, a human should probably weigh in. But one automatic retry with the review feedback often resolves the issue. The builder now knows what went wrong and can take a different approach.

Loop termination conditions

Happy path

All gates pass. Loop exits, workflow continues to the next phase.

Max retries exhausted

Loop exits, workflow escalates to human with full context.

No new information

If the retry produces the same failure as the previous attempt, the loop detects this and exits early rather than wasting retries. The system diffs failure outputs between iterations.

Why 3 retries is the default

In practice, most fixable issues resolve within 1-2 retries. If 3 retries have not fixed the problem, the issue is likely deeper: wrong approach, underspecified plan, or missing context. Each retry invokes an agent, so 3 retries triples the cost of a phase. Diminishing returns set in fast.

Teams can configure this: set 1 for fail-fast workflows, or 5 for complex integration tests where more iterations are justified.

Escalation: when loops fail

When a loop exhausts its retries, the workflow does not just say "Error: tests failed." It hands over a full forensic picture: what happened, what was tried, and why it did not work.

Original plan artifact

Add token refresh logic to src/auth/tokens.py. Modify the refresh_token() method to check expiry before attempting refresh. Add a fallback to re-authenticate if the refresh token itself has expired.

Files targeted: src/auth/tokens.py:112-130, src/auth/session.py:55-67

After providing input, the workflow resumes from the phase where it stopped, not from scratch (unless the human directs otherwise).

Anti-patterns

!

Infinite loops

Never remove the max retry limit. Always have a termination condition. An uncapped loop can burn through your entire budget on a single unfixable issue.

!

Retrying without new information

If the builder agent patches the same code the same way twice, the loop should break. Inject the failure history into the retry prompt so the agent tries a different approach each time.

!

Retrying non-deterministic failures

Flaky tests should be identified and quarantined, not retried endlessly. Add a "known flaky" list to the test gate config so the loop can distinguish real failures from noise.

Configuring loops

# workflow.yaml -- loop configuration
loops:
  test_retry:
    max: 3
    trigger: test_failure
    on_exhaust: escalate
  review_patch:
    max: 3
    trigger: blocker
    ignore: [tech_debt, skippable]
    on_exhaust: escalate
  full_rebuild:
    max: 1
    trigger: architectural_issue
    on_exhaust: escalate

escalation:
  target: human
  include: [plan, build_report, test_results, review_issues]