Concept 03
Self-Healing Loops
Workflows that diagnose failures, attempt fixes, and escalate only when self-correction runs out. Designed to fail gracefully, not to never fail.
"A workflow that stops on first failure requires human intervention for every routine error. Loops handle those automatically."
AI agents produce imperfect output. Tests will fail. Reviews will find issues. This is expected, not exceptional. Self-healing loops make this manageable.
Why loops improve on linear workflows
The straightforward approach: run Plan, run Build, run Test, run Review. If anything fails, throw an error and stop. This works for deterministic systems like compilers, where the same input always produces the same output. But AI agents are probabilistic. A builder might produce code that fails a test case. A reviewer might flag a style issue. These are not catastrophic failures. They are expected imperfections in a probabilistic process.
Self-healing treats failures as normal operations to handle, not exceptions that stall the workflow. When a test fails or a review finds issues, the workflow diagnoses the problem, applies a fix, and retries automatically.
Without loops
Build -> Test -> FAIL -> Stop. The developer reviews the failure and restarts.
With self-healing
Build -> Test -> FAIL -> Diagnose -> Patch -> Re-test -> PASS -> Continue. Routine failures are resolved automatically, saving developer time for non-routine problems.
The test retry loop
Triggered when tests fail after the Build phase produces code. The workflow packages the failure details (test name, expected vs. actual, stack trace, relevant source file) and hands them to a builder agent with the original plan. The builder patches the code, tests re-run, and the cycle repeats until all pass or the retry limit is reached.
What the builder agent receives on retry
Each retry gives the builder progressively richer context. This is artifact chaining in action. Failure details become input for the fix:
- 1. The original plan artifact (what was supposed to be built)
- 2. The current code state (what exists now)
- 3. Specific test failure details (expected vs. actual, stack traces)
- 4. Previous fix attempts (what was already tried and did not work)
The fourth item is critical. Without the history of previous attempts, the builder might try the same fix twice. The retry prompt explicitly instructs: "If a failure was present in the previous iteration and your prior fix did not work, try a DIFFERENT approach."
The review-patch loop
Triggered when the Review phase finds blocker-severity issues. The workflow packages the issue details (file, line, severity, description, suggested fix) and sends them to a builder agent. After patching, tests re-run (the test retry loop may fire within this), then review runs again. The cycle continues until there are no blockers or the retry limit is reached.
Severity classification matters
Not every review finding triggers the loop. The Review agent classifies issues by severity:
Blocker
Triggers the loop. Must be fixed before the workflow can continue.
Tech-debt
Logged but does not block. Becomes a future task.
Skippable
Noted, no action. Style preferences, minor suggestions.
The full rebuild loop
The nuclear option. Triggered when Review finds architectural issues that patches cannot fix. The approach itself is wrong, not just the implementation details. The workflow returns to the Build phase with the review feedback as additional context.
Default max retries: 1. This is expensive. If the approach is fundamentally wrong, a human should probably weigh in. But one automatic retry with the review feedback often resolves the issue. The builder now knows what went wrong and can take a different approach.
Loop termination conditions
Happy path
All gates pass. Loop exits, workflow continues to the next phase.
Max retries exhausted
Loop exits, workflow escalates to human with full context.
No new information
If the retry produces the same failure as the previous attempt, the loop detects this and exits early rather than wasting retries. The system diffs failure outputs between iterations.
Why 3 retries is the default
In practice, most fixable issues resolve within 1-2 retries. If 3 retries have not fixed the problem, the issue is likely deeper: wrong approach, underspecified plan, or missing context. Each retry invokes an agent, so 3 retries triples the cost of a phase. Diminishing returns set in fast.
Teams can configure this: set 1 for fail-fast workflows, or 5 for complex integration tests where more iterations are justified.
Escalation: when loops fail
When a loop exhausts its retries, the workflow does not just say "Error: tests failed." It hands over a full forensic picture: what happened, what was tried, and why it did not work.
Original plan artifact
Add token refresh logic to src/auth/tokens.py. Modify the refresh_token() method to check expiry before attempting refresh. Add a fallback to re-authenticate if the refresh token itself has expired.
Files targeted: src/auth/tokens.py:112-130, src/auth/session.py:55-67
After providing input, the workflow resumes from the phase where it stopped, not from scratch (unless the human directs otherwise).
Anti-patterns
Infinite loops
Never remove the max retry limit. Always have a termination condition. An uncapped loop can burn through your entire budget on a single unfixable issue.
Retrying without new information
If the builder agent patches the same code the same way twice, the loop should break. Inject the failure history into the retry prompt so the agent tries a different approach each time.
Retrying non-deterministic failures
Flaky tests should be identified and quarantined, not retried endlessly. Add a "known flaky" list to the test gate config so the loop can distinguish real failures from noise.
Configuring loops
# workflow.yaml -- loop configuration
loops:
test_retry:
max: 3
trigger: test_failure
on_exhaust: escalate
review_patch:
max: 3
trigger: blocker
ignore: [tech_debt, skippable]
on_exhaust: escalate
full_rebuild:
max: 1
trigger: architectural_issue
on_exhaust: escalate
escalation:
target: human
include: [plan, build_report, test_results, review_issues]