Concept 07
Autonomous Software Engineering
The graduation path from assisted to fully autonomous. Four levels of maturity, each earned through gate quality, not ambition.
"You cannot skip levels by just turning on auto-merge. The autonomy level must match the gate quality."
Think of self-driving cars. L1 is cruise control. L2 is lane assist. L3 is highway autopilot. L4 is fully autonomous. The conditions matter, and so does the engineering that got you there.
The Path from Manual to Autonomous
Most teams start by running prompts manually and reviewing every output. This is fine. The goal is to graduate, not to skip levels. Each level adds autonomy while maintaining or increasing safety through better gates, templates, and monitoring.
Autonomy is gate authority
The four levels don't describe different workflows. They describe different answers to the question: when a gate says "fail," who decides what happens next?
At Level 1, the human decides. Gates are informational. At Level 2, gates decide within phases (self-healing loops run), but humans decide between phases (checkpoints). At Level 3, gates decide everything except the final merge. At Level 4, gates decide everything including the merge.
The workflow phases, prompt templates, classification gates, and healing gates stay the same across levels. What changes is the disposition gate configuration: which phase failures are fatal, which are warnings, and whether a human checkpoint exists.
This is why you can't skip levels by configuration alone. You need to observe your gates making accurate decisions at lower authority before granting them higher authority. See Quality Gates: The three concerns inside every gate for the full gate taxonomy.
Level 1: Assisted
The agent helps. The human drives. You run each phase manually, inspect every output, decide whether to proceed. Write an issue, run the Plan prompt, read the plan, run Build, read the code, run tests, review, ship. All manual.
Risk
low. Human in full control at every step.
Use when
Learning the framework. High-risk projects. New workflows.
You need
Prompt templates. Basic tool config. You are the gate.
Disposition profile
| Phase | Disposition |
|---|---|
| All phases | Human decides after each phase |
Level 2: Supervised
The agent runs phases on its own. The human approves at checkpoints. Agent runs Plan and pauses. Human reviews, approves. Agent runs Build + Test (including retry loops). Human reviews test results. Agent runs Review. Human reviews findings. Agent runs Document + Deploy.
Risk
medium. Human validates at strategic checkpoints.
Use when
Established templates. Moderate confidence. Non-critical projects.
You need
Reliable templates. Quality gates. Checkpoint mechanism.
# workflow.yaml -- Level 2
autonomy: supervised
checkpoints:
after_plan: { require: human_approval, timeout: 24h }
after_test: { require: human_approval, timeout: 24h }
after_review: { require: human_approval, timeout: 24h }
Disposition profile
| Phase | Disposition |
|---|---|
| Plan | Checkpoint: human approves |
| Build | Continue (healing loops run autonomously) |
| Test | Checkpoint: human reviews failures |
| Review | Checkpoint: human reviews verdict |
| Document | Continue |
| Deploy | Human merges PR |
Level 3: Autonomous
The agent runs the entire workflow end-to-end. The human reviews only the final output, the PR. Issue goes in, PR comes out. Quality gates and self-healing loops are the safety net. The human reviews the PR like any code review.
Risk
medium-high. PR review catches issues if gates miss them.
Use when
mature workflow. Well-tuned gates. Strong test coverage.
You need
Strong gates. Coverage thresholds. Reliable loops. Monitoring.
Disposition profile
| Phase | Disposition |
|---|---|
| Plan | Required: abort on failure |
| Build | Required: abort on failure |
| Test | Optional: warn on failure, continue |
| Review | Required: abort on failure |
| Document | Required: abort on failure |
| Deploy | Creates PR, human merges |
Level 4: Autonomous Software Engineering (ASE)
The agent runs. The gates pass. The PR auto-merges. No human in the loop. Issue goes in, code goes to production. The Monitor phase watches for anomalies. If something breaks, a new issue is created automatically, and the workflow runs again.
Risk
high. Requires excellent gates, monitoring, and automated rollback.
Use when
proven workflow (95%+ success over 50+ runs). Low-risk changes.
You need
everything from L3 + auto-merge, monitoring, rollback, anomaly detection.
# workflow.yaml -- Level 4: ASE
autonomy: ase
checkpoints: none
gates:
test:
coverage_min: 90
all_pass: true
review:
max_blockers: 0
max_critical: 0
tech_debt_logged: true
deploy:
max_files_changed: 20
auto_merge:
enabled: true
delay: 5m
monitor:
anomaly_detection: true
rollback_on: [error_rate_spike, latency_increase]
auto_create_issue: true
Disposition profile
| Phase | Disposition |
|---|---|
| Plan | Required: abort on failure |
| Build | Required: abort on failure |
| Test | Required: abort on failure |
| Review | Required: abort on failure |
| Document | Optional: warn on failure, continue |
| Deploy | Auto-merge |
Measuring readiness
Moving between levels is not about changing what the workflow does. It's about changing how much authority you grant to gate decisions. The readiness criteria below measure whether your gates have earned that authority.
You earn the right to remove humans from the loop by building gates that catch what humans would catch. These metrics tell you when you are ready to move up:
Readiness criteria
| Metric | L1 -> L2 | L2 -> L3 | L3 -> L4 |
|---|---|---|---|
| Successful runs | 10+ | 50+ | 100+ |
| Gate accuracy | Any | >90% | >98% |
| Human override rate | Any | <20% | <5% |
| Average retries | Any | <2 | <1.5 |
Gate quality and autonomy are coupled. If your gates catch 70% of issues, Level 2 is a good fit. Better gates earn you more autonomy. Each improvement in gate accuracy is a checkpoint where the team can confidently let the workflow proceed on its own.
Why most teams should stay at level 2-3
For critical systems (payments, authentication, infrastructure), Level 2-3 with human oversight is the responsible choice. Level 4 is appropriate for low-risk changes, proven workflows, and projects with solid monitoring and rollback.
The framework does not push teams toward Level 4. It supports whatever level matches the team's risk tolerance and gate maturity. The destination is not always the top of the pyramid. It is the level where your quality and your comfort intersect.
The monitor phase as safety valve
At Level 3-4, the Monitor phase becomes critical. It closes the outer loop. Production anomalies (error rate spikes, performance degradation) automatically generate new issues that feed back into Plan. The workflow is self-correcting at the production level, not just the code level.
If Monitor detects a regression caused by a recent workflow run, it can trigger a revert, a hotfix workflow run, or a human alert, depending on configuration. This is what makes Level 4 viable: the workflow can fix its own mistakes in production, not just in development.