Skip to content

Is it really "the end of software engineering"?

A paper landed on my feed with the title "The End of Software Engineering." I opened it ready to roll my eyes.

Six days later the same paper had a new title, "Agentic Software," and a conclusion that said the opposite of the first one. The most honest edit in the paper was the title itself.

My short answer to the question is no. Software engineering is not ending. It is being relocated, from writing the decision logic to specifying, orchestrating, and verifying the systems that generate it.

Here is the one number that proves it, and the paper buried it under a dramatic title. On isolated coding tasks, agents score above 80 percent. On continuous, real-world evolution of a codebase, they drop to at most 38 percent. That gap is the whole story: agents are real as augmentation, not yet as autonomy.

The cleanest argument for that view is the paper's own revision history. The author got there before I did.

The title that changed in six days

The first version, arXiv:2606.05608v1 dated 4 June 2026, was called "The End of Software Engineering: How AI Agents Are Fundamentally Restructuring the Software Paradigm."

The second version, arXiv:2606.05608v2 dated 10 June 2026, dropped the two words that did all the dramatic work: "End" and "Fundamentally." It became "Agentic Software: How AI Agents Are Restructuring the Software Paradigm."

Same author. Same evidence. Same formal models. Softer claims.

When a paper walks back its strongest framing within a week, the revision is not noise. It is data about how much the author trusted the original claim once it was on the page.

What the paper actually argues

Strip away the framing and the paper builds a tidy, four-move case. I will walk all of it, because the useful parts are the ones the title drowned out.

NOTE

The paper rests on three central claims: agentic software is a first-principles necessity, not a fad; software is redefined, not replaced; and the work becomes an emergent discipline rather than a dead one. Everything else is scaffolding for those three.

Move 1: code is no longer where the decisions live

In traditional software, code is the carrier of decision logic. A human decides what the system should do, writes that logic down, and the system runs it.

The paper formalizes traditional software as a triple:

S=(C,D,E)S = (C, D, E)

Here CC is compute resources, DD is the set of deterministic decision rules written into the source code, and EE is the execution environment. The load-bearing detail is that DD is static: it is fixed before any input arrives, so every change means a human finds the right rule and edits it by hand.

An agent is formalized as a different tuple:

A=(MLLM, T, Mmem, Π)A = (M_{\text{LLM}},\ T,\ M_{\text{mem}},\ \Pi)

The model MLLMM_{\text{LLM}} is the reasoning engine, TT is a set of callable tools, MmemM_{\text{mem}} is a memory subsystem, and Π\Pi is a planner. It runs as a loop: pick an action from the current state and memory, execute it, observe the new state, repeat.

at=MLLM(st,Mmem),st+1=exec(at)a_t = M_{\text{LLM}}(s_t, M_{\text{mem}}), \qquad s_{t+1} = \text{exec}(a_t)

Decision logic that used to be written down ahead of time is now produced at runtime, on demand. The agent generates code to solve the task in front of it, runs it, and discards it. What persists is the agent's capability, not the code it emitted along the way.

TIP

Keep this one sentence and you have the paper's spine: the durable asset stops being the code and becomes the agent that writes it.

Move 2: the complexity argument for why this is happening now

Human cognition is roughly fixed, while the interaction surface of a system grows combinatorially. For nn components, each pair may or may not interact, so the number of possible dependency graphs is:

2(n2)2^{\binom{n}{2}}

The paper bounds this growth as Θ(2n2)\Theta(2^{n^2}), super-exponential, against a human capacity that stays flat. That mismatch, it argues, is why outsourcing reasoning to a model is attractive: model capability scales with training compute, so it can ride the curve that human cognition cannot.

WARNING

Even this argument was edited between versions. Version 1 stated the growth as a clean exponential, Θ(2n)\Theta(2^{n}). Version 2 corrected it to the super-exponential Θ(2n2)\Theta(2^{n^2}) above. A paper that fixes its own math mid-revision is telling you how settled any of it is.

Move 3: three generations of software delivery

The paper places this as the third step in a familiar progression, where each generation hands more of the complexity to the provider.

DimensionSoftware 1.0 (Local)Software 2.0 (SaaS)Software 3.0 (Agent-as-a-Service)
Core mechanismInstalled binaries you runHosted apps you log intoAgents that act on your behalf
Complexity ownerYouThe vendor's serversThe agent
Revenue modelLicense saleSubscriptionOutcome and usage
ExemplarsMicrosoft, OracleSalesforce, AWSOpenAI, Anthropic

The framing is borrowed in spirit from Karpathy's "Software 2.0," but pushed one generation further: from shipping software, to hosting it, to selling the result an agent produces.

Move 4: what the agent is made of

The paper's Figure 1, adapted from a 2024 survey by Wang and colleagues, sketches the agent as a reasoning core wrapped in perception, memory, action, and tools, all looping against an external environment.

flowchart LR
    Env[External environment] -->|Perception| Core
    Core[LLM reasoning core] -->|Action| Env
    Core <-->|read / write| Mem[(Memory)]
    Core -->|Planning| Plan[Plan and next step]
    Plan --> Core
    Core -->|invoke| Tools[Tools]
    Tools --> Env
Show Mermaid source
flowchart LR
    Env[External environment] -->|Perception| Core
    Core[LLM reasoning core] -->|Action| Env
    Core <-->|read / write| Mem[(Memory)]
    Core -->|Planning| Plan[Plan and next step]
    Plan --> Core
    Core -->|invoke| Tools[Tools]
    Tools --> Env

Set that against the old model and the contrast becomes a table the paper spells out dimension by dimension. This is its Table 2, the clearest single page in the whole document.

DimensionTraditional engineeringAgentic engineering
Core artifactSource codeAgent capability
Control centerExplicit logic in codeLLM reasoning core
Decision mechanismPredefined rulesRuntime inference
Development cycleWrite, compile, test, deploySpecify, generate, verify
Human roleAuthor of the logicDirector of intent
Complexity ceilingHuman cognitionModel capacity, scales with compute
Output unitShipped softwareResolved outcome
Error handlingDebug the codeRe-prompt, re-plan, re-verify
EvolutionManual editsSelf-directed adaptation

That is the idea worth keeping. Not "agents replace software," but "code stops being the durable asset." It is a real shift in where the value sits, and a useful lens even if you never touch an agent framework.

What the paper offers as evidence

The argument would be just a nice diagram without numbers, so the paper brings three kinds, and they are not equally strong.

The hardest of them is SWE-bench Verified, a benchmark of real GitHub issues. An open model, Lingma SWE-GPT 72B, resolves 30.20 percent of issues, against GPT-4o at 31.80 percent, with a smaller 7B variant at 18.20 percent. The paper notes the 72B model beats a Llama 3.1 405B baseline by 22.76 percent relative, despite being roughly six times smaller.

The softer numbers come from a coordination pilot and a self-improving agent.

The paper reports a multi-agent debugging deployment with a 93 percent reduction in root-cause time and 200-plus engineering hours saved per month across 20-plus enterprise workflows. (Kumar and Ramagopal, the paper's reference 7.)

It also points to an open-source agent with 179,000-plus GitHub stars that self-patches its own skills, searches its conversation history, and delegates to subagents.

CAUTION

I read the last two as the paper reports them, not as established fact. The self-improving agent in particular reads partly aspirational, and it maps suspiciously well onto features people wish their tooling already had.

The honest part: the EvoClaw cliff

Here is where the paper is at its best, and it is the part the dramatic title buried.

The paper cites a benchmark called EvoClaw (Deng et al., arXiv:2603.13428). It tests 12 frontier models across four agent frameworks on continuous software evolution: sustained work across a commit history, where errors accumulate and each change has to preserve what already worked.

The result is a cliff.

Performance falls from above 80 percent on isolated tasks to at most 38 percent in continuous settings.

The paper's own Figure 2 plots the drop from 82 to 38, a fall of 54 points.

xychart-beta
    title "EvoClaw: isolated tasks vs continuous evolution"
    x-axis ["Isolated tasks", "Continuous evolution"]
    y-axis "Success rate (%)" 0 --> 100
    bar [82, 38]
Show Mermaid source
xychart-beta
    title "EvoClaw: isolated tasks vs continuous evolution"
    x-axis ["Isolated tasks", "Continuous evolution"]
    y-axis "Success rate (%)" 0 --> 100
    bar [82, 38]

The paper names four reasons, and every working engineer will recognize them:

  • Context drift, as the codebase grows past what the agent can hold in view.
  • Error propagation, where a small early mistake compounds downstream.
  • Technical-debt blindness, where the agent optimizes for finishing the task and not for living with it later.
  • Verification fidelity, where the agent passes the tests while quietly introducing a semantic bug that only shows up on new inputs.

IMPORTANT

If you remember one thing from this paper, remember the gap between above 80 percent and 38 percent. That is the line between "agents are real as augmentation" and "agents are not yet real as autonomy." It is far more useful than any headline.

Where the paper says this is heading

The paper closes its argument with a four-stage roadmap. I include it because the dates are the giveaway: most of the bold capability sits comfortably in the future.

StageEraAgent capabilityHuman roleRepresentative systems
I. Tool-augmented2023-2025Suggests inside a human-driven workflowAuthorCopilot, Claude Code
II. Single-task autonomous2025-2027Completes scoped tasks end to endReviewerDevin, OpenHands
III. Multi-agent teams2026-2029Coordinated agents split the workOrchestratorLangChain, MetaGPT
IV. Self-evolving ecosystems2028+Systems improve themselvesStewardGeneral AI assistants

The recommendations follow the stages. Practitioners are told to shift toward intent engineering, agent orchestration, and observability, and to move from human-in-the-loop to agent-in-the-driver's-seat with human oversight. Researchers are pointed at long-context state, verification, alignment, and the economics of all this. Organizations are told to find agent-ready workflows, build evaluation frameworks, and redesign team structures around them.

The clickbait test

Line up the two versions and the pattern is consistent. Every change softens the same nerve.

The title lost "End" and "Fundamentally." The abstract moved from "not an incremental improvement but a fundamental restructuring" to "a fundamental restructuring of what software is."

Claim 2 was renamed from "Paradigm Shift, Not Optimization" to "Software Redefined, Not Replaced." Its substance flipped too, from agents that "eliminate the software artifact" to a careful note that the agent itself is software.

The section once called "Eliminating the Intermediary" became "The Agent as Software." The discipline section moved from "A New Discipline" to "Expanding the Discipline."

Then the conclusion, where the walk-back is impossible to miss. Version 1 ended: "The old software engineering is ending; the new one has already begun." Version 2 ended: "The old software engineering is not ending; it is growing into something larger."

That is the whole article in two sentences the author wrote himself. Version 1 sold an ending. Version 2 sold an expansion.

What I actually believe

I used to treat "end of X" headlines as a tax on my attention. Skim, dismiss, move on. I read this one because the title change made me curious, and the title change turned out to be the most interesting thing in it.

Here is the judgment I will commit to, with its trade-off attached. "Agent as software" is a better frame than "agent replaces software." Adopting it costs you the clean drama of a replacement story, the tweet-sized prediction.

What you get back is a frame that survives contact with a real codebase. It tells you where to look: at the durable capability, not the disposable code.

The second thing I believe is narrower and more practical. I would hand an agent an isolated, well-scoped task with good tests today. I would not hand it sustained evolution of a system I have to maintain.

That is not caution for its own sake. It is exactly what the 80-to-38 gap predicts. The trade-off is real: drawing that line means accepting slower automation of the messy, high-value work, in exchange for not shipping a subtle semantic bug that passes every test.

What this means for engineers and managers

If even half of the agentic framing holds, two things change for teams and one thing does not.

The first shift is that intent and specification quality become a first-class engineering skill, not a soft one. When the agent generates the logic, the leverage moves to how precisely you can state the goal, the constraints, and what "good" looks like.

Vague tickets used to produce slow humans. Now they produce confident, wrong agents. The trade-off is that this work is harder to measure than lines shipped, so it is easy to under-invest in.

The second shift is that evaluation and observability stop being optional and become infrastructure. You cannot supervise a system whose reasoning you cannot see.

Tracing an agent's steps, scoring its output against a rubric, and catching the test-passing semantic bug are now platform concerns. They sit where logging and monitoring landed a decade ago.

The thing that does not change yet is the need for humans on sustained, multi-commit work. The benchmark says it plainly. Architectural oversight, quality calibration, and governance are still ours.

That is engineering. It just sits higher up the stack than typing the logic by hand.

Reading AI papers like an architect

This paper is a good reminder to separate three things that often get blended.

There are verified facts, like the SWE-bench numbers, which trace to a primary source. There are forecasts, like the roadmap that puts self-evolving ecosystems in "2028 and beyond," which are arguments about the future, not evidence about the present. And there are self-referential or vendor-flavored claims, like the self-patching agent, which deserve a raised eyebrow until someone independent checks them.

It also helps to read the provenance. This is a single-author position paper, dated mid-2026, from an author whose affiliation is an investment firm rather than a software-research lab.

None of that makes it wrong. All of it tells you to weigh the framing as a sharp opinion, not as settled consensus. The useful posture is to take the distinction and the EvoClaw gap, and to hold the dramatic numbers loosely.

The best edit was the title

So, is it really the end of software engineering? No. The author answered that himself, six days after asking it, by changing "is ending" to "is not ending; it is growing into something larger."

The work is not disappearing. It is moving. Less of it is writing the decision logic. More of it is specifying intent, orchestrating the systems that generate the logic, and verifying what they produce.

The strongest evidence says the same thing in numbers. Above 80 percent on isolated tasks, 38 percent on continuous evolution. Agents are real as augmentation, not yet as autonomy.

The paper made one genuinely honest move, and it was not a claim or a benchmark. It was the title. The strong version did not survive contact with its own author. That is usually the tell.