Beyond Prompt Engineering: Context, Harness, and the Product Architecture of AI Agents

Much of the early conversation around applied AI focused on prompt engineering: if the output was weak, the prompt needed improvement. Recent work in agent engineering suggests a broader shift. Prompt engineering still matters, yet it now functions as only one layer in a larger system. Context engineering determines what information the model sees, when it sees it, and what gets preserved across work. Harness engineering goes further still: it defines the runtime environment, verification loops, handoff artifacts, decomposition strategy, and coordination structures that allow an agent to work productively across long horizons rather than a single short session. Anthropic’s recent engineering posts trace this progression clearly, from context curation to initializer agents, structured progress files, parallel agent teams, planner-generator-evaluator loops, and decoupled execution architectures. This essay argues that the next generation of AI products will be differentiated less by clever prompts alone than by the systems built around model reasoning (Anthropic Engineering, 2025a; Anthropic Engineering, 2025b; Anthropic Engineering, 2026a; Anthropic Engineering, 2026b; Anthropic Engineering, 2026c).

1. Introduction

There is a persistent tendency in AI discourse to compress everything into prompt engineering. The framing is attractive because it is intuitive: better instructions, better outputs. But as models are increasingly asked to operate over longer time horizons, use tools, recover from failures, and build complete applications, the central challenge shifts. The question is no longer only what the model should be told; it is also what the model should remember, what it should retrieve, how it should verify its work, and how it should recover when its reasoning begins to drift. In that sense, the frontier has moved from prompt design toward system design (Anthropic Engineering, 2025a; Anthropic Engineering, 2026c).

For product builders, this matters because agent quality is not just a model-capability question. It is a product-architecture question. The best agent is not merely the one with the strongest base model, but the one embedded in the most supportive environment: the right memory design, the right retrieval pattern, the right decomposition logic, the right evaluator, and the right execution boundaries. In this sense, agent engineering is becoming closer to product architecture than to prompt writing (Anthropic Engineering, 2025a; Anthropic Engineering, 2026b).

2. Prompt Engineering Is Now Only One Layer of the Stack

Anthropic’s post on context engineering explicitly describes context engineering as the “natural progression” of prompt engineering. Prompt engineering remains the craft of writing and organizing instructions for model behavior, but context engineering expands the frame to the entire set of tokens and artifacts available during inference. That includes prompts, tool outputs, retrieved files, external data, MCP surfaces, and message history. Once the problem is framed this way, it becomes clear that output quality depends not only on instruction quality, but on context quality (Anthropic Engineering, 2025a).

This distinction becomes even more important once an agent works across multiple turns. A one-shot system may succeed or fail primarily on prompt phrasing, but a long-running system accumulates state. That state must be filtered, prioritized, compressed, and refreshed. As Anthropic notes, the key engineering question becomes: what configuration of context is most likely to yield the desired behavior? That is a much more expansive and product-relevant question than prompt wording alone (Anthropic Engineering, 2025a).

For an AI product manager, this reframing changes where differentiation comes from. A product can no longer assume that stronger models alone will guarantee strong outcomes. Two teams can use the same underlying model and get materially different results if one system handles memory, retrieval, and coordination better than the other. In practice, this means the product layer increasingly determines whether model intelligence becomes usable work.

3. Context Engineering as Runtime State Design

Anthropic defines context engineering as the set of strategies for curating and maintaining the optimal set of tokens during inference. This matters because context is both finite and degradable: as the window fills, models may lose focus, become confused, or recall information less reliably. The practical implication is that context should be treated as a scarce resource rather than an infinitely expandable buffer (Anthropic Engineering, 2025a).

One of the most useful ideas in this post is the move away from dumping everything into context upfront. Instead, agentic systems can maintain lightweight references to information and retrieve the full material only when necessary. This turns context from a static transcript into an actively managed working set. The model no longer needs all relevant data in immediate view; it needs a good enough index into what can be fetched on demand (Anthropic Engineering, 2025a).

This also clarifies a broader architectural lesson. In agent systems, not every capability belongs in the model’s immediate working context. Some capabilities should remain at the boundary layer, invoked only when needed, rather than continuously exposed inside the reasoning loop. Otherwise, the model may spend too much attention on tool surfaces and transport mechanics rather than the task itself. This reinforces a central point of context engineering: good performance depends not on maximizing what is available in context, but on carefully controlling what enters context and when (Anthropic Engineering, 2025a).

In product terms, context engineering is therefore not just a technical discipline; it is a design philosophy. It asks what should stay in active working memory, what should be externalized, what should be reloaded only when needed, and what should be transformed before being shown to the model. These are precisely the kinds of product decisions that begin to separate robust agents from flashy demos.

4. Harness Engineering: From Reasoning to Sustained Work

If context engineering manages what the model sees, harness engineering manages what the model is able to do over time. Anthropic’s November 2025 post on long-running agents shows that good long-horizon performance does not come from simply looping a model call indefinitely. Instead, it comes from structuring the environment so that each fresh session can orient itself, make incremental progress, and leave durable artifacts for the next session (Anthropic Engineering, 2025b).

Anthropic’s solution is a two-part harness: an initializer agent and a coding agent. The initializer sets up the environment on the first run, including an init.sh script, a progress file, and an initial git commit. Subsequent coding agents then make incremental progress while leaving structured updates behind. If every new session begins with no memory, then continuity cannot be assumed; it must be manufactured by the harness (Anthropic Engineering, 2025b).

Anthropic also found that agents needed an explicit feature list to avoid either trying to one-shot the whole application or prematurely declaring the work complete. In one example, the initializer produced a structured feature file in JSON, with each feature initially marked as failing. Later agents were instructed to change only the pass status after proper verification. The article notes that JSON worked better than Markdown because the model was less likely to overwrite or corrupt it. This illustrates how harness design is not merely procedural scaffolding; it is part of the cognition support system itself (Anthropic Engineering, 2025b).

The broader implication is that agent quality depends on whether the environment makes progress legible. A model performs much better when the next task is visible, success criteria are explicit, and bad states are recoverable. In that sense, harness engineering is less about constraining the model than about giving it a world it can navigate coherently.

5. Parallel Agents and the Importance of Feedback Structure

Anthropic’s C compiler experiment extends this logic to a much larger scale. In that project, 16 agents worked in parallel on a shared Rust codebase. Over nearly 2,000 Claude Code sessions and about $20,000 in API costs, the agent team produced a roughly 100,000-line compiler that could build Linux 6.9 on x86, ARM, and RISC-V (Anthropic Engineering, 2026a).

The central implication of that post is not just the headline artifact, but the design of the loop around it. Anthropic emphasizes tests that keep agents on track without human oversight, work structures that allow multiple agents to make progress in parallel, and a simple loop that lets one task immediately give way to the next. Much of the engineering value came from making progress observable and sustained, rather than from model prompting alone (Anthropic Engineering, 2026a).

This has an important implication for product design. When an AI product fails in long-running work, the instinct is often to blame the model. But many such failures are environment failures. The model may not know which subproblem matters, how to detect regressions efficiently, or how to avoid drowning in irrelevant output. Better harnesses address these by improving the structure of feedback. A good agent product, then, is partly a model product and partly a feedback-design product.

6. Why the March 2026 Harness Post Stands Out

The March 2026 article on harness design for long-running application development most clearly shows that harnesses do not merely improve reliability; they can also expand the ambition, richness, and coherence of what gets built (Anthropic Engineering, 2026c).

Rajasekaran begins from a useful observation: prompt engineering and earlier harness work had already pushed Claude well above baseline, but both eventually hit ceilings. To move further, he introduced a multi-agent structure inspired partly by generator-evaluator dynamics: first a generator-evaluator setup for frontend design, then a three-agent architecture — planner, generator, and evaluator — for long-running autonomous coding. The planner turns a short prompt into a richer product specification, the generator builds against that spec, and the evaluator verifies the result against concrete criteria (Anthropic Engineering, 2026c).

This separation is productively elegant. A naive single-agent workflow collapses planning, implementation, and judgment into one stream of reasoning. The three-agent loop deliberately splits them apart. The planner handles scope and ambition. The generator handles execution. The evaluator handles correctness and quality control. Once separated, each stage becomes easier to tune and easier to improve independently.

Anthropic’s examples show why this matters. In the frontend-design setup, the evaluator used explicit grading criteria and the Playwright MCP to interact with the live page directly before scoring the result. In the long-running application setup, the generator and evaluator negotiated sprint contracts, and the evaluator used Playwright to test UI features, API endpoints, and database states against granular criteria. In one retro game maker run, the planner expanded a one-sentence prompt into a 16-feature spec across ten sprints, going far beyond a minimal implementation (Anthropic Engineering, 2026c).

The significance of this structure is that harness design affects not only reliability, but also product depth. The planner prevents underscoping. The evaluator prevents shallow completion. Together, they create a system that is more likely to produce something that feels like a product rather than a demo.

7. Harnesses Should Shrink as Models Improve

One of the most valuable arguments in the March 2026 post is that harnesses are not static truths. They are temporary supports around current model weaknesses. Rajasekaran explains that, after Opus 4.6 improved long-horizon performance and long-context retrieval, parts of the earlier harness became unnecessary. He removed the sprint construct while keeping the planner and evaluator, arguing that the newer model could sustain the work more natively without that additional decomposition layer (Anthropic Engineering, 2026c).

This is an important lesson for AI product management. Good harness design is not about adding the most scaffolding possible. It is about identifying the minimum amount of structure that still produces quality. A harness that once created lift can later become overhead: extra orchestration, extra token cost, extra latency, extra complexity, and even extra context clutter. In strong products, supports should be removed once they no longer pay for themselves.

That principle also reinforces a broader lesson in agent design: structure should remain only where it materially improves quality, reliability, or coherence. As models improve, the product challenge is not just to add new layers, but to know when to remove old ones.

8. Managed Agents and the Move Toward Stable Interfaces

Anthropic’s Managed Agents post takes this one step further by asking what happens when harness assumptions continue to go stale across model generations. Its answer is architectural: build around interfaces that outlast today’s harness. In this design, the “brain” is decoupled from the “hands,” meaning that Claude and the harness are separated from sandboxes, tools, and session storage. Each becomes a replaceable interface rather than a tightly coupled containerized bundle (Anthropic Engineering, 2026b).

This architecture has clear product and infrastructure advantages. Anthropic reports that when everything lived inside one container, the system became fragile: if the container failed, the session was lost, debugging became difficult, and the server effectively became a “pet” rather than interchangeable infrastructure. Decoupling reduced that fragility and allowed sessions, harnesses, and execution environments to fail and recover independently (Anthropic Engineering, 2026b).

It also improved security and performance. Anthropic writes that in the coupled design, model-generated code could potentially access credentials living in the same container, so the structural fix was to make sure tokens were never reachable from the sandbox where generated code ran. The same architectural shift also reduced latency: Anthropic reports that p50 time-to-first-token dropped by roughly 60% and p95 by over 90% once containers were provisioned only when needed (Anthropic Engineering, 2026b).

This reframes harness design as part of platform design. Once agents are expected to operate over longer time horizons and more complex environments, reliability and capability are no longer enough. The system must also be debuggable, replaceable, and safe under failure.

9. Conclusion

The common story in AI product circles is that stronger models will gradually reduce the need for engineering around them. Anthropic’s recent engineering work suggests a more nuanced reality. Better models do reduce some scaffolding, but they also expand the range of what can be delegated. As a result, the engineering challenge shifts upward: away from one-shot prompting and toward systems that let models remember selectively, retrieve intelligently, hand off work cleanly, coordinate over time, and act safely in complex environments (Anthropic Engineering, 2025a; Anthropic Engineering, 2025b; Anthropic Engineering, 2026b; Anthropic Engineering, 2026c).

Prompt engineering is therefore not obsolete. It has simply been demoted from the whole problem to one layer of the stack. Context engineering decides what the model can think with. Harness engineering decides whether that thinking becomes durable work. Platform architecture decides whether that work becomes trustworthy, scalable, and worth shipping. For anyone hoping to build AI-native products, that is the real frontier now.

References

Anthropic Engineering. “Building a C compiler with a team of parallel Claudes.” 2026. Link

Anthropic Engineering. “Effective context engineering for AI agents.” 2025. Link

Anthropic Engineering. “Effective harnesses for long-running agents.” 2025. Link

Anthropic Engineering. “Harness design for long-running application development.” 2026. Link

Anthropic Engineering. “Scaling Managed Agents: Decoupling the brain from the hands.” 2026. Link

1. Introduction#

2. Prompt Engineering Is Now Only One Layer of the Stack#

3. Context Engineering as Runtime State Design#

4. Harness Engineering: From Reasoning to Sustained Work#

5. Parallel Agents and the Importance of Feedback Structure#

6. Why the March 2026 Harness Post Stands Out#

7. Harnesses Should Shrink as Models Improve#

8. Managed Agents and the Move Toward Stable Interfaces#

9. Conclusion#