Coding agents grow up: from autocomplete to multi-agent orchestration.

For most of the past two years, the phrase coding agent meant something fairly modest: an autocomplete engine wrapped in a chat interface, occasionally able to run a shell command or read a file. That definition is now obsolete. The leading systems from Anthropic, OpenAI, Google, and a growing cluster of startups have crossed into a different category entirely. They plan tasks, decompose them across specialized sub-agents, evaluate their own outputs, and revise their approach when verification fails. The competitive frontier has moved from model quality to system architecture.
What changed in the underlying architecture.
Early coding assistants treated each request as a stateless prediction problem. Given a prompt and some surrounding context, produce the most likely next tokens. This worked for short completions but degraded sharply on anything requiring more than a few coherent steps. The newer generation operates on a fundamentally different loop: plan, act, observe, critique, repeat.
A typical modern agent run begins with a planner that breaks a request into discrete sub-goals. Each sub-goal is handed to an executor that can read files, run tests, invoke linters, or call other tools. After execution, a separate evaluator examines the result against the original specification. If the evaluator rejects the output, the loop restarts with revised context. This separation of roles, often implemented as distinct prompts or even distinct models, is what distinguishes multi-agent orchestration from prompt chaining.
Self-evaluation as a first-class capability.
The most consequential shift is that self-evaluation has stopped being a research curiosity and become a product feature. When an agent writes code, it now routinely runs the code, parses the failure output, and treats that output as new input. Test suites, type checkers, and static analyzers function as cheap external graders. The agent learns nothing in the machine-learning sense during this loop, but it explores the solution space with feedback that earlier systems ignored.
A simplified version of this loop looks like the following:
The recursion terminates when verification passes or a budget is exhausted. Production systems add caching, parallelism, and rollback, but the skeleton is consistent across vendors.
Why labs are differentiating on architecture.
Raw model capability has compressed. The gap between the top three or four frontier models on standard coding benchmarks is small enough that benchmark scores no longer settle purchasing decisions. What separates products now is how the surrounding system handles long-horizon tasks: context management, tool invocation latency, parallel sub-agent coordination, and recovery from partial failure.
Anthropic has emphasized sub-agents that operate in isolated contexts and report back compressed summaries. OpenAI has invested in persistent task environments that survive across sessions. Google has pushed integration with its own developer tooling and large-context retrieval. Each approach reflects a bet about which bottleneck matters most, and each produces a noticeably different user experience even when the underlying model quality is comparable.
Practical consequences for developers.
Three implications follow for anyone integrating these tools into real work. First, specification quality matters more than prompt cleverness. Agents that self-evaluate need something concrete to evaluate against. Vague tasks produce vague verification, which produces drift. Writing precise acceptance criteria, even informally, materially improves outcomes.
Second, tool surfaces are part of the contract. An agent is only as capable as the tools it can call. Exposing a clean CLI, a typed API, or a deterministic test command gives the agent verifiable footholds. Exposing only a GUI or an underspecified script forces the agent to guess.
Third, cost and latency profiles have changed. A single user request may now trigger dozens of model calls across multiple agents. Token budgets that were generous for chat are tight for orchestration. Teams adopting these tools at scale need to model usage in terms of agent runs, not individual completions.
Where the category is heading.
The trajectory points toward agents that maintain longer-lived state, coordinate with other agents across organizational boundaries, and take on tasks measured in hours rather than minutes. The open questions are not whether this happens but how reliability scales with task length, how verification holds up when specifications are themselves ambiguous, and how teams will divide responsibility between human reviewers and automated critics. The autocomplete era produced useful tools. The orchestration era is producing something closer to a junior collaborator, with all the supervision overhead that implies.
Chamith Dilshan
Editor in Chief
Founder of C2Labs. Writing about AI, science, and technology in Sinhala and English.