Skip to main content

Concept 07

Autonomous Software Engineering

The graduation path from assisted to fully autonomous. Four levels of maturity, each earned through gate quality, not ambition.

"You cannot skip levels by just turning on auto-merge. The autonomy level must match the gate quality."

Think of self-driving cars. L1 is cruise control. L2 is lane assist. L3 is highway autopilot. L4 is fully autonomous. The conditions matter, and so does the engineering that got you there.

The Path from Manual to Autonomous

Most teams start by running prompts manually and reviewing every output. This is fine. The goal is to graduate, not to skip levels. Each level adds autonomy while maintaining or increasing safety through better gates, templates, and monitoring.

L1 Assisted Human drives every phase L2 Supervised Human approves at checkpoints L3 Autonomous Human reviews the PR L4 ASE Zero human intervention Increasing Autonomy Increasing Gate Quality

Autonomy is gate authority

The four levels don't describe different workflows. They describe different answers to the question: when a gate says "fail," who decides what happens next?

At Level 1, the human decides. Gates are informational. At Level 2, gates decide within phases (self-healing loops run), but humans decide between phases (checkpoints). At Level 3, gates decide everything except the final merge. At Level 4, gates decide everything including the merge.

The workflow phases, prompt templates, classification gates, and healing gates stay the same across levels. What changes is the disposition gate configuration: which phase failures are fatal, which are warnings, and whether a human checkpoint exists.

This is why you can't skip levels by configuration alone. You need to observe your gates making accurate decisions at lower authority before granting them higher authority. See Quality Gates: The three concerns inside every gate for the full gate taxonomy.

L1

Level 1: Assisted

The agent helps. The human drives. You run each phase manually, inspect every output, decide whether to proceed. Write an issue, run the Plan prompt, read the plan, run Build, read the code, run tests, review, ship. All manual.

Risk

low. Human in full control at every step.

Use when

Learning the framework. High-risk projects. New workflows.

You need

Prompt templates. Basic tool config. You are the gate.

Disposition profile

PhaseDisposition
All phasesHuman decides after each phase
L2

Level 2: Supervised

The agent runs phases on its own. The human approves at checkpoints. Agent runs Plan and pauses. Human reviews, approves. Agent runs Build + Test (including retry loops). Human reviews test results. Agent runs Review. Human reviews findings. Agent runs Document + Deploy.

Risk

medium. Human validates at strategic checkpoints.

Use when

Established templates. Moderate confidence. Non-critical projects.

You need

Reliable templates. Quality gates. Checkpoint mechanism.

# workflow.yaml -- Level 2
autonomy: supervised
checkpoints:
  after_plan: { require: human_approval, timeout: 24h }
  after_test: { require: human_approval, timeout: 24h }
  after_review: { require: human_approval, timeout: 24h }

Disposition profile

PhaseDisposition
PlanCheckpoint: human approves
BuildContinue (healing loops run autonomously)
TestCheckpoint: human reviews failures
ReviewCheckpoint: human reviews verdict
DocumentContinue
DeployHuman merges PR
L3

Level 3: Autonomous

The agent runs the entire workflow end-to-end. The human reviews only the final output, the PR. Issue goes in, PR comes out. Quality gates and self-healing loops are the safety net. The human reviews the PR like any code review.

Risk

medium-high. PR review catches issues if gates miss them.

Use when

mature workflow. Well-tuned gates. Strong test coverage.

You need

Strong gates. Coverage thresholds. Reliable loops. Monitoring.

Disposition profile

PhaseDisposition
PlanRequired: abort on failure
BuildRequired: abort on failure
TestOptional: warn on failure, continue
ReviewRequired: abort on failure
DocumentRequired: abort on failure
DeployCreates PR, human merges
L4

Level 4: Autonomous Software Engineering (ASE)

The agent runs. The gates pass. The PR auto-merges. No human in the loop. Issue goes in, code goes to production. The Monitor phase watches for anomalies. If something breaks, a new issue is created automatically, and the workflow runs again.

Risk

high. Requires excellent gates, monitoring, and automated rollback.

Use when

proven workflow (95%+ success over 50+ runs). Low-risk changes.

You need

everything from L3 + auto-merge, monitoring, rollback, anomaly detection.

# workflow.yaml -- Level 4: ASE
autonomy: ase
checkpoints: none
gates:
  test:
    coverage_min: 90
    all_pass: true
  review:
    max_blockers: 0
    max_critical: 0
    tech_debt_logged: true
  deploy:
    max_files_changed: 20
auto_merge:
  enabled: true
  delay: 5m
monitor:
  anomaly_detection: true
  rollback_on: [error_rate_spike, latency_increase]
  auto_create_issue: true

Disposition profile

PhaseDisposition
PlanRequired: abort on failure
BuildRequired: abort on failure
TestRequired: abort on failure
ReviewRequired: abort on failure
DocumentOptional: warn on failure, continue
DeployAuto-merge

Measuring readiness

Moving between levels is not about changing what the workflow does. It's about changing how much authority you grant to gate decisions. The readiness criteria below measure whether your gates have earned that authority.

You earn the right to remove humans from the loop by building gates that catch what humans would catch. These metrics tell you when you are ready to move up:

Readiness criteria

Metric L1 -> L2 L2 -> L3 L3 -> L4
Successful runs 10+ 50+ 100+
Gate accuracy Any >90% >98%
Human override rate Any <20% <5%
Average retries Any <2 <1.5

Gate quality and autonomy are coupled. If your gates catch 70% of issues, Level 2 is a good fit. Better gates earn you more autonomy. Each improvement in gate accuracy is a checkpoint where the team can confidently let the workflow proceed on its own.

Why most teams should stay at level 2-3

For critical systems (payments, authentication, infrastructure), Level 2-3 with human oversight is the responsible choice. Level 4 is appropriate for low-risk changes, proven workflows, and projects with solid monitoring and rollback.

The framework does not push teams toward Level 4. It supports whatever level matches the team's risk tolerance and gate maturity. The destination is not always the top of the pyramid. It is the level where your quality and your comfort intersect.

The monitor phase as safety valve

At Level 3-4, the Monitor phase becomes critical. It closes the outer loop. Production anomalies (error rate spikes, performance degradation) automatically generate new issues that feed back into Plan. The workflow is self-correcting at the production level, not just the code level.

Workflow Production Monitor anomaly -> new issue ok

If Monitor detects a regression caused by a recent workflow run, it can trigger a revert, a hotfix workflow run, or a human alert, depending on configuration. This is what makes Level 4 viable: the workflow can fix its own mistakes in production, not just in development.