Autonomous Software Engineering

"You cannot skip levels by just turning on auto-merge. The autonomy level must match the gate quality."

Think of self-driving cars. L1 is cruise control. L2 is lane assist. L3 is highway autopilot. L4 is fully autonomous. The conditions matter, and so does the engineering that got you there.

The Path from Manual to Autonomous

Most teams start by running prompts manually and reviewing every output. This is fine. The goal is to graduate, not to skip levels. Each level adds autonomy while maintaining or increasing safety through better gates, templates, and monitoring.

Autonomy is gate authority

The four levels don't describe different workflows. They describe different answers to the question: when a gate says "fail," who decides what happens next?

At Level 1, the human decides. Gates are informational. At Level 2, gates decide within phases (self-healing loops run), but humans decide between phases (checkpoints). At Level 3, gates decide everything except the final merge. At Level 4, gates decide everything including the merge.

The workflow phases, prompt templates, classification gates, and healing gates stay the same across levels. What changes is the disposition gate configuration: which phase failures are fatal, which are warnings, and whether a human checkpoint exists.

This is why you can't skip levels by configuration alone. You need to observe your gates making accurate decisions at lower authority before granting them higher authority. See Quality Gates: The three concerns inside every gate for the full gate taxonomy.

L1

Level 1: Assisted

The agent helps. The human drives. You run each phase manually, inspect every output, decide whether to proceed. Write an issue, run the Plan prompt, read the plan, run Build, read the code, run tests, review, ship. All manual.

Risk

low. Human in full control at every step.

Use when

Learning the framework. High-risk projects. New workflows.

You need

Prompt templates. Basic tool config. You are the gate.

Disposition profile

Phase	Disposition
All phases	Human decides after each phase

L2

Level 2: Supervised

The agent runs phases on its own. The human approves at checkpoints. Agent runs Plan and pauses. Human reviews, approves. Agent runs Build + Test (including retry loops). Human reviews test results. Agent runs Review. Human reviews findings. Agent runs Document + Deploy.

Risk

medium. Human validates at strategic checkpoints.

Use when

Established templates. Moderate confidence. Non-critical projects.

You need

Reliable templates. Quality gates. Checkpoint mechanism.

# workflow.yaml -- Level 2
autonomy: supervised
checkpoints:
  after_plan: { require: human_approval, timeout: 24h }
  after_test: { require: human_approval, timeout: 24h }
  after_review: { require: human_approval, timeout: 24h }

Disposition profile

Phase	Disposition
Plan	Checkpoint: human approves
Build	Continue (healing loops run autonomously)
Test	Checkpoint: human reviews failures
Review	Checkpoint: human reviews verdict
Document	Continue
Deploy	Human merges PR

L3

Level 3: Autonomous

The agent runs the entire workflow end-to-end. The human reviews only the final output, the PR. Issue goes in, PR comes out. Quality gates and self-healing loops are the safety net. The human reviews the PR like any code review.

Risk

medium-high. PR review catches issues if gates miss them.

Use when

mature workflow. Well-tuned gates. Strong test coverage.

You need

Strong gates. Coverage thresholds. Reliable loops. Monitoring.

Disposition profile

Phase	Disposition
Plan	Required: abort on failure
Build	Required: abort on failure
Test	Optional: warn on failure, continue
Review	Required: abort on failure
Document	Required: abort on failure
Deploy	Creates PR, human merges

L4

Level 4: Autonomous Software Engineering (ASE)

The agent runs. The gates pass. The PR auto-merges. No human in the loop. Issue goes in, code goes to production. The Monitor phase watches for anomalies. If something breaks, a new issue is created automatically, and the workflow runs again.

Risk

high. Requires excellent gates, monitoring, and automated rollback.

Use when

proven workflow (95%+ success over 50+ runs). Low-risk changes.

You need

everything from L3 + auto-merge, monitoring, rollback, anomaly detection.

# workflow.yaml -- Level 4: ASE
autonomy: ase
checkpoints: none
gates:
  test:
    coverage_min: 90
    all_pass: true
  review:
    max_blockers: 0
    max_critical: 0
    tech_debt_logged: true
  deploy:
    max_files_changed: 20
auto_merge:
  enabled: true
  delay: 5m
monitor:
  anomaly_detection: true
  rollback_on: [error_rate_spike, latency_increase]
  auto_create_issue: true

Disposition profile

Phase	Disposition
Plan	Required: abort on failure
Build	Required: abort on failure
Test	Required: abort on failure
Review	Required: abort on failure
Document	Optional: warn on failure, continue
Deploy	Auto-merge

Measuring readiness

Moving between levels is not about changing what the workflow does. It's about changing how much authority you grant to gate decisions. The readiness criteria below measure whether your gates have earned that authority.

You earn the right to remove humans from the loop by building gates that catch what humans would catch. These metrics tell you when you are ready to move up:

Readiness criteria

Metric	L1 -> L2	L2 -> L3	L3 -> L4
Successful runs	10+	50+	100+
Gate accuracy	Any	>90%	>98%
Human override rate	Any	<20%	<5%
Average retries	Any	<2	<1.5

Gate quality and autonomy are coupled. If your gates catch 70% of issues, Level 2 is a good fit. Better gates earn you more autonomy. Each improvement in gate accuracy is a checkpoint where the team can confidently let the workflow proceed on its own.

Why most teams should stay at level 2-3

For critical systems (payments, authentication, infrastructure), Level 2-3 with human oversight is the responsible choice. Level 4 is appropriate for low-risk changes, proven workflows, and projects with solid monitoring and rollback.

The framework does not push teams toward Level 4. It supports whatever level matches the team's risk tolerance and gate maturity. The destination is not always the top of the pyramid. It is the level where your quality and your comfort intersect.

The monitor phase as safety valve

At Level 3-4, the Monitor phase becomes critical. It closes the outer loop. Production anomalies (error rate spikes, performance degradation) automatically generate new issues that feed back into Plan. The workflow is self-correcting at the production level, not just the code level.

If Monitor detects a regression caused by a recent workflow run, it can trigger a revert, a hotfix workflow run, or a human alert, depending on configuration. This is what makes Level 4 viable: the workflow can fix its own mistakes in production, not just in development.

Related concepts

Self-Healing Loops

Loops are the mechanism that makes L3-L4 possible

Tool Permissions

Permissions are the safety control at all levels

The Agentic Layer

Layer quality determines the achievable autonomy level

Artifact Chaining

Reliable chaining is required for unattended workflow runs

Tool Permissions Get Started