Whitepaper No. 01 · 2026April 9, 2026

Engineering History · Tug-of-war for Control

The Evolution of LLM Application Engineering · 2023 to 2026

Prompt → Context → Agent → Harness. A four-year tug-of-war for control.

By

Leo Song

Contents

  1. 01Opening · An overlooked main line
  2. 022023 · Prompt Engineering
  3. 032024 · Context Engineering
  4. 042025 · Agentic Engineering
  5. 052026 · Harness Engineering
  6. 06The main line across four years
  7. 07Coda · Where the curve is pointing

Abstract

§

The past three years of LLM narrative usually reads as “stronger models, more applications,” a monotonically increasing curve of progress. But if you've actually shipped LLM systems, that curve is deeply misleading. The real story is different: every time the model takes a step forward, humans lose a layer of control, and engineers have to invent a higher-order mechanism to take it back. This isn't a curve of progress. It is a continuous tug-of-war.

§ 00

This isn't a curve of progress. It's a continuous tug-of-war.

Opening · An overlooked main line

The past three years of LLM narrative reads as “stronger models, more applications,” a monotonically increasing curve of progress. But the real story is different:

Every time the model takes a step forward, humans lose a layer of control, and engineers have to invent a higher-order mechanism to take it back.

This isn't a curve of progress; it's a continuous tug-of-war. Prompt, Context, Agent, Harness. At the surface they look like technical upgrades. Underneath, it's control swinging back and forth. Every leap in capability strips away an old method of constraint, forcing engineers to invent a higher-order one.

In a more mathematical framing: a two-axis system in which the X-axis (capability) rises monotonically while the Y-axis (control) oscillates. The “tug-of-war” isn't a metaphor. It is the actual trajectory of the system.

FIG. 01 · FOUR ERAS, FOUR DOMAINS
YearProblemControl mechanismUnderlying nature
2023Doesn't listenPromptActivate latent reasoning
2024Doesn't knowContextAttention budgeting
2025Acts wronglyAgent frameworkStateful decision loop
2026UncontrollableHarnessDeterministic shell + probabilistic core

Hold this main line, and the four chapters below stop reading like a glossary.

§ 01 · 2023

How to talk to an AI correctly

Prompt Engineering

ChatGPT detonated the consumer market in late 2022; GPT-4, Claude 2, and Gemini followed in 2023. Models became unprecedentedly capable of generation but were highly unpredictable: prone to hallucination, drift, and brittle reasoning. For most developers, this was the first time they had to seriously think about “how do I collaborate with a probabilistic model.”

Out-of-the-box behavior was unstable, context windows were small (GPT-4 launched at 8K, later 32K; Claude 2 hit 100K mid-year). The bottleneck was Instruction. The capability already existed in the weights. It just needed to be “woken up.” Developers used linguistic tricks to activate latent reasoning.

Core techniques

  • Role & format constraints · assign an identity (“You are a senior frontend expert”), enforce output structure (JSON, markdown tables).
  • Few-shot prompting · paste a few exemplars and let the model imitate.
  • Chain-of-Thought (CoT) · “let's think step by step,” the most famous magic spell of 2023, dramatically improving math and logic accuracy.
  • Self-Consistency & Tree of Thoughts · sample multiple reasoning paths and vote, or organize reasoning as a tree search.
  • ReAct · alternate between “thought” and “tool call.” The conceptual seed of every Agent that came later.

Training-side undercurrent: RLHF became standard; Anthropic's Constitutional AI proposed an alternative alignment approach. As models became more obedient, the necessity of prompt engineering quietly diminished.

On the surface, this era is “surface-level linguistic optimization.” Underneath, it is an entire generation of developers learning, for the first time, how to collaborate with a probabilistic system.

Prompt Engineering's later “retreat” is widely misread. Prompts didn't disappear. They got compiled. In 2023, prompts were a hand-written interface language. In subsequent years, prompts became the system's intermediate representation (IR): system prompts are auto-assembled by frameworks, tool schemas are auto-generated prompts, RAG is dynamic prompt injection, agent planning is the LLM writing prompts to itself. Humans write prompts in fewer places, but prompts as token sequences are everywhere. They descended from interface to IR, the way C didn't disappear but became the compilation target of other languages.

Representative work: Wei et al.'s Chain-of-Thought paper (NeurIPS 2022, 9000+ citations); Yao et al.'s ReAct and Tree of Thoughts; Khattab et al.'s DSPy (ICLR 2024), which turned “writing prompts” from a human craft into a compiler task the watershed moment for prompt-engineering culture moving toward systems engineering.

§ 02 · 2024

How to feed an AI the right knowledge

Context Engineering

Models grew smarter; out-of-the-box instruction-following improved dramatically. Discussions began over whether “prompt engineering is a pseudo-discipline.” At the same time, Gemini 1.5 Pro pushed context to 1M tokens; Claude 3 followed. AI applications entered their first genuine enterprise era.

The bottleneck shifted from “instruction” to Knowledge. The best prompts in the world cannot help if the model has never seen your company's private data, financial reports, or codebase. But at a deeper level, the year's real question was never “does the model have the knowledge,” it was “under a finite token and attention budget, which information deserves to be seen?” On the surface Context Engineering feeds knowledge. At its core, it is resource scheduling.

Core techniques

  • RAG · vectorize enterprise documents, retrieve, inject. LangChain and LlamaIndex became the de facto frameworks.
  • Chunking & Embedding · how to split long documents, choose embedding models, design hybrid retrieval.
  • GraphRAG · by mid-2024, pure vector retrieval was failing at cross-document, multi-hop, global questions. Microsoft's GraphRAG ignited the year: extract entities and relations with an LLM, run community detection, generate hierarchical summaries.
  • Hybrid Retrieval · the industry accepted “no silver bullet.” BM25 + Vector + Graph + Rerank pipelines became the standard for any serious system.
  • RRF + Learned Reranker · by H2 2024, pure vector + BM25 was insufficient. The industry converged on a two-stage retrieval pattern: Reciprocal Rank Fusion (RRF) for candidate fusion, then a cross-encoder reranker (BGE-reranker, Voyage reranker) for fine-grained reordering. This pipeline turned “which information deserves attention” from art into reproducible engineering.
  • Long Context vs. RAG · Gemini 1.5's million-token window sparked a foundational debate: “If we can stuff everything in, why retrieve?” The eventual answer: both, each in its place.
  • Lost in the Middle & Context Rot · attention to the middle of long contexts decays sharply, birthing “information architecture” techniques: layout, front-loading, strategic repetition.
  • Prompt Caching · caching high-frequency system context made “long system prompt + tool definitions + knowledge docs” economically viable. The infrastructure that made the Agent era possible.

Training-side undercurrent: model vendors began specifically training long-context capability (needle-in-a-haystack as standard eval); DPO and simpler alignment methods replaced complex RLHF pipelines; model iteration accelerated.

Every technique of the year, RAG, GraphRAG, hybrid retrieval, lost-in-the-middle, prompt caching, structured context, is, at the bottom, the same problem from different angles: how to do optimal information scheduling under a finite token and attention budget.

The year's deepest cognitive upgrade: “similar” doesn't mean “relevant”; “relevant” doesn't mean “reasonable to act on”; “reasonable” doesn't mean “worth the budget”.

A bridge to the next era: in November 2024, Anthropic released the Model Context Protocol (MCP), an open standard for LLMs to talk to external tools and data sources. At the time it looked like just another Context-era addition. In hindsight, it laid the foundation for 2025's Agentic explosion: MCP pulled “tool ecosystems” from fragmented function-calling toward a standardized protocol, paving the path for Agent deployment at scale.

§ 03 · 2025

How to let AI plan and execute multi-step tasks autonomously

Agentic Engineering

Late 2024 brought OpenAI o1 and the rise of “reasoning models”; early 2025 saw DeepSeek-R1 open-source, turning RL-based reasoning from a closed black box into reproducible technology. Claude's extended thinking and Gemini 2.0 Flash Thinking followed. Meanwhile, Claude Computer Use, OpenAI's Operator, and various coding agents (Devin, Cursor Agent, Cline) showed the industry, for the first time, AI that could actually do work. The MCP protocol shipped a year earlier picked up exactly the right moment for agent deployment at scale.

The bottleneck shifted from “knowledge” to Autonomy. Single-turn Q&A was largely solved, but real-world tasks demand multi-step planning, tool calls, self-correction. The question stopped being “does the model know,” and became “can the model figure out what to do, and then actually do it, step by step.”

Core techniques

  • Reasoning Models · o1, o3, R1 internalized CoT from a user trick into a training objective. Models learned via RL to reason at length internally before producing output. The prompt-level “think step by step” was made obsolete.
  • Tool Use standardization · function calling (2023) → parallel tool calls (2024) → MCP protocol (late 2024–2025). MCP turned “every vendor builds their own” into “reusable tool ecosystems,” the USB standard for agents.
  • Multi-agent orchestration mainstreamed · CrewAI, AutoGen, LangGraph turned ad-hoc agent coordination into production code. LangGraph's stateful graph runtime, with persistent checkpoints, human-in-the-loop hooks, and visual debugging, became the de facto template.
  • Agentic Workflows · Planning, Tool Use, Reflection, Multi-Agent Collaboration crystallized as standard components.
  • Computer Use / GUI Agents · Claude Computer Use and successors let models see screens, click mice, type keys, breaking the “only API calls” boundary.
  • Coding Agents explosion · Cursor, Claude Code, Devin made “AI ships a full PR” an everyday occurrence. Software engineering was the first profession to feel rewriting.

Training-side undercurrent: Agentic Training and Tool-use RL became the frontier. Models were taught, during training, how to use tools, plan, and recover from failure. This is the root reason agents could run reliably; the application layer's “orchestration tricks” were merely cashing in training-side dividends.

The essence: from “function call” to “persistent process”

The shift from “single generation” to “multi-step decision process” doesn't go far enough. The more accurate framing: the LLM stopped being a function call and became a persistent process. It acquired, for the first time, state, time, and an interface to the outside world. Traditional LLMs are stateless, single-turn, synchronous. Agents are stateful, long-trajectory, asynchronous, running a perception → planning → action → feedback closed loop.

The transition sounds elegant; the cost is brutal. Once you give a probabilistic system “time” and “state,” errors cascade, costs explode, observability collapses. AI gained, for the first time, the ability to autonomously advance a task under incomplete information, and exposed, for the first time, engineering risks that are uncontrollable, unpredictable, unauditable.

Five systematic failure modes

This year, agent failure modes converged into a manual-grade taxonomy:

  1. Error amplification · small per-step errors compound exponentially across multi-step decisions, dragging the trajectory in completely the wrong direction.
  2. Goal drift · the agent slowly diverges from its original goal, running ever further down a richly detailed but misdirected path.
  3. Tool misuse · picking the wrong tool at the right time, or passing wrong arguments to the right tool. Wastes context, wastes budget, poisons subsequent reasoning.
  4. Infinite / dead loops · the agent oscillates between two actions, never converging on a terminal state.
  5. Context poisoning · an early erroneous output gets written into context; subsequent reasoning compounds on the error, dragging the entire trajectory off course.

Which sets the stage for 2026's main battleground.

§ 04 · 2026

Deterministic Shell + Probabilistic Core

Harness Engineering

Agents could run, but not for long, not stably, not affordably. Contexts overflowed, costs spiraled, errors cascaded, audit trails vanished. The industry exited the “can we build agents” honeymoon and entered the “how do we ship agents to production” sober period. “Harness,” an old term from ML evaluation (lm-evaluation-harness), was given new meaning: the entire runtime framework wrapping the model: context management, tool orchestration, safety guardrails, eval systems, observability and rollback.

OpenClaw vs Claude Code · two fates of the same capability

In early 2026, Peter Steinberger's OpenClaw became the most concrete demonstration of why Harness Engineering matters. The same project, within a single quarter, exposed both the opportunity and the danger of the agent paradigm.

The opportunity side: Peter himself recounted an “accidental discovery.” He sent his agent a voice message. With no preprogramming, the agent autonomously: detected the file lacked an extension → read the file header to identify it as opus → converted it via ffmpeg to wav → noticed Whisper was missing → found an OpenAI API key in the environment → called the API directly via curl → returned the transcription. Peter's words: “How the f*** did you do that?” That kind of emergent tool composition is what makes the agent paradigm so attractive.

The danger side: after OpenClaw was handed to ordinary users, the same quarter produced a string of incidents: Meta's superintelligence head Summer had her entire inbox deleted by her agent (which then volunteered “sabotaging her career” as the rationale); a Klein NPM supply-chain attack (prompt injection in a GitHub issue title caused 4,000 developer machines to silently install OpenClaw); an Australian Commonwealth Bank A$1B mortgage fraud (AI-fabricated payslips); an Amazon production outage (an agent “rewrote” the prod environment from scratch); $90/day cost explosions.

FIG. 02 · CLAUDE CODE vs OPENCLAW
DimensionClaude CodeOpenClaw
Release strategy4–5 mo internal + 3-layer safetyOpen-sourced direct to public
SandboxBuilt-in open-source sandboxNone
GuardrailsFull Deterministic ShellAlmost none
Permission modelPer-tool, human-in-the-loopBlanket authorization
Outcome$2B ARR, 4% of all GitHub commitsAbsorbed by OpenAI; founder poached
Major incidentsNone publicEmail deletion · NPM supply-chain · $1B AU loan fraud · Amazon outage

Almost the entire difference traces back to investment in the Harness Engineering layer. In one year, Claude Code reached $2B ARR, captured 4% of public GitHub commits, and continues to accelerate. OpenClaw was absorbed by OpenAI, its founder poached, its product spinning out of control in users' hands.

The core architectural paradigm

The most important engineering consensus of 2026 reduces to one phrase:

Deterministic Shell + Probabilistic Core

AI systems are bifurcating into two layers: the upper layer is the non-deterministic LLM (reasoning, generation, judgment); the lower layer is deterministic software (state machines, guardrails, eval, logging, permissions). Harness Engineering, at its core, forces a probabilistic system into a deterministic engineering framework. This isn't weakening the model. It's acknowledging a humble truth: the only way to make a probabilistic system reliable in the real world is to wrap it in a deterministic shell.

Four clusters of concern

  • Context governance: Context Compaction, Subagent isolation, cross-session Memory, dynamic budget allocation.
  • Observability and evaluation: Eval Harness (LLM-as-a-Judge, trajectory-level eval, regression tests, adversarial red-teaming); Observability for Agents (tracing, cost attribution, failure clustering).
  • Constraints and permissions: Guardrails / Policy layers, agent permission models, Deterministic Scaffolding, Structured Output (schema validation before state mutation).
  • Cost and scheduling: Inference Optimization, Economic Harness, Model Hierarchy (planner / executor / verifier / guardrail), Multi-Agent Orchestration.

The 12 Agentic Harness Patterns

Bilgin Ibryam's 12 patterns, reverse-engineered from the Claude Code source leak, fall into four clusters mapping precisely to the four concerns above:

Memory & Context (5):

  • Persistent Instruction File: a root-level CLAUDE.md auto-loaded each session.
  • Scoped Context Assembly: multi-level loading (org / user / project / subdirectory).
  • Tiered Memory: compact index + topic files + on-disk archive.
  • Dream Consolidation: autoDream background process, 8 phases + 5 compaction types.
  • Progressive Context Compaction: HISTORY_SNIP → Microcompact → CONTEXT_COLLAPSE → Autocompact.

Workflow & Orchestration (3): Explore-Plan-Act Loop, Context-Isolated Subagents, Fork-Join Parallelism.

Tools & Permissions (3): Progressive Tool Expansion, Command Risk Classification, Single-Purpose Tool Design.

Automation (1): Deterministic Lifecycle Hooks. Claude Code exposes 26 publicly documented lifecycle hooks (PreToolUse, PostToolUse, SessionStart, CwdChanged, SubagentStart, PreCompact, …), executing entirely outside the prompt. The Shell is no longer “a guard at the door.” It is “a set of hooks nailed into every lifecycle point of the process.”

The essence: a shift from “single model interaction” to “system-level software engineering.” The AI engineer's role looks more and more like a hybrid of SRE + security engineer + data engineer. The core insight: the model is just one component of the system; the framework around it determines whether it can actually deliver value.

§ 05

The main line across four years

The four layers are not replacement, but accumulation. Any serious AI system today must do all four well simultaneously:

  • Prompt didn't die. It became the system's IR. Humans no longer write it by hand, but it's everywhere.
  • Context didn't retreat. It evolved from “RAG one-trick pony” to “attention budget engineering.” In long-trajectory agents, its importance grew.
  • Agent didn't leave Context behind. It is Context in higher form, dynamically deciding what to look at on every step.
  • Harness wraps the previous three. It puts them inside an observable, constrainable, rollback-capable shell.

Undercurrent one: parallel evolution on the training side. Behind this entire curve: RLHF made Prompt retreat, long-context training made Context viable, Reasoning RL made Agent possible, Agentic RL made Harness a manageable engineering problem. Application and model layers push each other forward; this is not a one-way dependency.

FIG. 03 · TRAINING-SIDE DIVIDENDS
Application LayerTraining-Side Dividend
PromptRLHF / Constitutional AI
ContextLong-context pretraining + Needle-in-a-Haystack
AgentTool-use RL + Reasoning RL (o1 / R1)
HarnessAgentic RL + Safety RL (long-trajectory stability, error recovery)

Undercurrent two: severely underappreciated latency. Every transition between stages is an order-of-magnitude jump: Prompt is milliseconds to seconds; Context is seconds; Agent is seconds to minutes; Harness (long-trajectory agents) is minutes to hours. Latency has become a second “budget” alongside tokens, and in interactive settings, it's the more rigid of the two.

§ 06

Coda · Where the curve is pointing

The next station is likely System-of-Agents Engineering, a qualitative jump from “how do we make one agent run reliably” to “how do we make a population of agents collaborate, price themselves, attribute responsibility, and be governed.”

Four future problems

  1. Communication and protocol layer. MCP solved “how an agent invokes a tool;” not “how agent talks to agent.” The next generation of protocols must handle negotiation and planning, trust and identity, state synchronization.
  2. Responsibility attribution. When A delegates to B, B calls C, and C errs, who is responsible? A joint problem for engineering, contract law, insurance, and compliance.
  3. Economic systems. Tokens become currency. Agent-to-agent interaction stops being “program calls” and becomes “economic behavior.” Internal markets, reputation systems, and token-as-currency will emerge.
  4. The collapse of explainability. A single agent is already hard to explain; multi-agent systems are virtually impossible. This will birth a new discipline: Agent Forensics.

When the future arrives early · Mythos and Glasswing

In April 2026, Anthropic's red team published Mythos Preview, pulling all four problems into the present tense. Mythos is an unreleased successor to Claude, run through a cybersecurity evaluation harness:

  • A 27-year-old OpenBSD TCP SACK bug was found. 1000 evaluation runs, each under $50, total under $20,000.
  • A 16-year-old FFmpeg H.264 codec vulnerability was found. FFmpeg is “one of the most thoroughly tested open-source projects in the world” missed by every fuzzer for 16 years.
  • A 17-year-old FreeBSD NFS vulnerability was exploited fully autonomously. The previous-gen Opus 4.6 needed human guidance; Mythos needed none.
  • A 7-stage Linux kernel exploit chain was constructed. Cost under $2,000. Time under one day.
  • Working exploits found in every major browser. Mythos produced 181 working exploits on Firefox's JS engine vs Opus 4.6's 2.
“Mitigations like KASLR and stack canaries provide security through complexity rather than hard barriers; language models efficiently grind through these steps.”

The implication reaches far beyond cybersecurity. Many of the “defense mechanisms” the industry has accumulated for decades are, at their core, complexity walls that “cost attackers more time” and LLMs are removing time from the attack-defense equation. Security through complexity is failing across the board.

Anthropic responded with Project Glasswing: no general availability, $100M of usage credits earmarked exclusively for defensive red-teaming. The signal is unambiguous: the stronger the capability, the heavier the shell; and when the shell is too heavy to mass-produce, the only remaining answer is “don't mass-produce it yet.” This is Harness Engineering taken to its extreme form: the constraint escalates from the “system layer” to the “release-policy layer.”

Closing

Not a march of progress. A tug-of-war. And the meaning lies in the fact that it never ends.

Three plain sentences for those still in the trenches:

  1. Never trust any “silver bullet.” Every declaration of “Prompt is dead,” “RAG is dead,” “Agents will solve everything” has been corrected by reality. No layer died; each merely retreated to its proper place. A good engineer doesn't chase the new. They know which layer to apply which technique at.
  2. Don't be afraid of “working down.” One of the most valuable skills in 2026 is rewriting problems that look like they need an LLM into a form traditional code can solve. The Deterministic Shell + Probabilistic Core architecture means “not using the LLM” is itself an architectural decision, and often the right one.
  3. Always remember: this is a tug-of-war, not a triumphal march. Every leap in model capability is not a finish line; it's a new starting line. New capability brings new loss of control; new loss of control calls for new restraint. The engineer's job is forever to balance “letting it do things” against “keeping it controllable,” and the balance point moves every year.

Prompt manages “I don't know how to say it.” Context manages “I don't know what to look at.” Agent manages “I don't know what to do next.” Harness manages “I don't know where the whole system is heading.” Each layer draws a finer boundary around a different face of not knowing.

May this curve, in the end, lead to deeper humility. In a probabilistic world, there remains a sovereign God who holds every uncertainty. Soli Deo Gloria.