Back to blogs
·12 min read··
agent-harnessai-engineeringllmclaude-codecodexlangchainmartin-fowler

Agent Harness: The Core Infrastructure of AI Engineering in 2026

When models are no longer the bottleneck, Harness becomes the decisive battlefield. A comprehensive analysis of the Agent Harness engineering paradigm shift from Anthropic, OpenAI, LangChain, and Martin Fowler.

Agent Harness: The Core Infrastructure of AI Engineering in 2026

"Agent = Model + Harness" — The defining formula of the AI engineering circle in 2026

In the spring of 2026, the AI engineering community was ignited by a seemingly obscure term.

"Agent Harness.

OpenAI used it to build a product with a million lines of code, zero of which were handwritten by humans. Anthropic's Claude Code used it to achieve 100% self-hosting development. LangChain released Deep Agents, making Harness a first-class citizen. Martin Fowler wrote a lengthy article specifically defining this emerging engineering field.

This is not another framework hype. This is an engineering paradigm shift in progress — when model capabilities reach a threshold, what determines an Agent's success is no longer the model itself, but the "Harness" wrapped around it.


I. What is an Agent Harness?

1.1 One-Sentence Definition

An Agent Harness is the software infrastructure layer wrapped around an LLM, responsible for everything except the model itself.

Simon Willison's classic definition:

"Agent = LLM running tools in a loop"

And the Harness is the complete engineering system that makes this loop run.

1.2 Core Formula

Agent = Model + Harness
Component Responsibility Analogy
Model Reasoning, text generation Brain
Harness Tools, memory, environment, verification, orchestration Body, senses, nervous system

LangChain's Vivek Trivedy first systematically articulated this formula in March 2026, which was then rapidly adopted by OpenAI, Martin Fowler, and academia.

1.3 Why is it trending now?

In 2025-2026, the AI industry discovered a counterintuitive fact:

Discovery Impact
The same model with different harnesses can have performance gaps of up to 6x Model is not the only bottleneck
OpenAI, Anthropic, Vercel published "We removed 80% of our Agent tools" More tools ≠ better results
Claude Code, Codex and other products rely on harness structure Harness becomes core competency

Core insight: When model capabilities reach a threshold, engineering encapsulation (Harness) determines agent reliability more than incremental model improvements.


II. What Problems Does Harness Solve?

2.1 Native Limitations of Models

A pure LLM can only do one thing: receive text input and output text.

It cannot:

  • ❌ Remember information across sessions
  • ❌ Execute code or call APIs
  • ❌ Access external file systems
  • ❌ Self-verify and self-correct
  • ❌ Decompose complex tasks and track progress

2.2 Four Compensations by Harness

Limitation Harness Solution
No memory Persistent storage, context compression, long-term memory systems
No action capability Tool calling (MCP/A2A), code execution sandbox, browser control
No planning capability Task decomposition, progress tracking, sub-agent orchestration
No verification capability Automated testing, static analysis, self-correction loops

2.3 A Concrete Example

Suppose you want an Agent to develop a web application:

Model without Harness:

User: Help me write a clone of claude.ai
Model: (generates thousands of lines of code)
User: Where does the code run?
Model: I cannot execute code, I can only generate text...

Agent with Harness:

User: Help me write a clone of claude.ai
Harness:
  1. Initialize project directory and git repository
  2. Create feature_list.json (200+ feature points)
  3. Generate base architecture code
  4. Start dev server for verification
  5. Record progress to claude-progress.txt
  6. Incrementally develop one feature per session
  7. Automated testing verification
  8. Submit PR

III. ETCLOVG: The Seven-Layer Harness Architecture

In April 2026, a survey paper Agent Harness Engineering: A Survey proposed the most systematic Harness engineering framework to date.

┌─────────────────────────────────────────────────────────────┐
│                    ETCLOVG Seven-Layer Architecture          │
├─────────────────────────────────────────────────────────────┤
│  G │ Governance                                              │
│    │ Permission models, auditing, declarative constraints    │
├────┼─────────────────────────────────────────────────────────┤
│  V │ Verification & Evaluation                               │
│    │ Benchmarking, fault attribution, regression feedback    │
├────┼─────────────────────────────────────────────────────────┤
│  O │ Observability & Operations                              │
│    │ Trace tracking, cost monitoring, reliability signals    │
├────┼─────────────────────────────────────────────────────────┤
│  L │ Lifecycle & Orchestration                               │
│    │ Single-agent loops, multi-agent orchestration           │
├────┼─────────────────────────────────────────────────────────┤
│  C │ Context & Memory Management                             │
│    │ Short-term chat, long-term memory, drift prevention     │
├────┼─────────────────────────────────────────────────────────┤
│  T │ Tool Interface & Protocol                               │
│    │ MCP, A2A protocols, tool discovery, session management  │
├────┼─────────────────────────────────────────────────────────┤
│  E │ Execution Environment & Sandbox                         │
│    │ Containers, microVMs, browser sandboxes, OS permissions │
└────┴─────────────────────────────────────────────────────────┘

3.1 Layer Details

E - Execution Environment

  • Determines where Agent code runs
  • Sandbox isolation, permission control, resource limits
  • From Docker containers to microVMs to browser environments

T - Tool Interface

  • MCP (Model Context Protocol): Standardized tool description and calling
  • A2A (Agent-to-Agent Protocol): Inter-agent communication
  • Tool discovery, selection, execution, result feedback

C - Context Management

  • Short-term: Current session conversation history
  • Medium-term: Cross-session memory summaries
  • Long-term: Knowledge bases, vector retrieval, persistent state

L - Lifecycle Orchestration

  • Single agent: ReAct loops, Plan-and-Execute
  • Multi-agent: Parent agent delegating to child agents
  • Task decomposition, progress tracking, failure retry

O - Observability

  • Trace recording: Input/output of each step
  • Cost tracking: Token consumption, API call expenses
  • Performance monitoring: Latency, success rate, error rate

V - Verification & Evaluation

  • Benchmarks: SWE-bench, Terminal-bench, etc.
  • Regression testing: Ensuring changes don't break existing functionality
  • Human evaluation: Human judgment for complex tasks

G - Governance & Security

  • Permission control: What resources can the Agent access
  • Audit logs: All operations traceable
  • Security policies: Preventing prompt injection, unauthorized access

IV. Industry Giants' Harness Practices

4.1 Anthropic: Claude Code's Harness Design

In November 2025, Anthropic published Effective Harnesses for Long-Running Agents, sharing Claude Code's core harness design.

Core Problem: Long-running agents "lose memory" across sessions

Solution: Initializer + Coding Agent Dual Mode

First Session: Initializer Agent
  ├── Create project directory structure
  ├── Initialize git repository
  ├── Write feature_list.json (200+ features, all marked failing)
  ├── Create claude-progress.txt (progress log)
  ├── Write startup script (init.sh)
  └── Submit initial commit

Subsequent Sessions: Coding Agent
  ├── Read claude-progress.txt to understand progress
  ├── Select highest priority failing feature from feature_list
  ├── Implement the feature
  ├── Run automated tests for verification
  ├── Mark feature as passing
  ├── Update claude-progress.txt
  └── Submit commit

Key Insights:

  • Initializer solves the "trying to do too much at once" problem
  • Coding Agent solves the "prematurely declaring victory" problem
  • claude-progress.txt is the bridge for cross-session memory

4.2 OpenAI: Extreme Harness Engineering Practice

In February 2026, OpenAI published Harness engineering: leveraging Codex in an agent-first world, describing an extreme experiment:

A 3-person team, 5 months, building a million-line-code product from scratch, zero lines of human-written code.

Constraints:

  • No manual code writing allowed
  • All code must be generated by Codex Agent
  • Humans only: write requirements, review results, adjust Harness

Results:

  • Development speed improved 10x
  • Code coverage reached production standards
  • CI/CD, documentation, monitoring all generated by Agent

Core Methodology:

Human: Write requirement description (declarative)
   ↓
Harness: Decompose tasks, generate code, run tests
   ↓
Human: Review results, adjust requirements
   ↓
Harness: Iterate and optimize

4.3 LangChain: Deep Agents' Batteries-Included Harness

In March 2026, LangChain released Deep Agents, positioned as a "batteries-included agent harness".

Design Philosophy:

  • Out-of-the-box: Planning, context management, delegation all built-in
  • Model-agnostic: Support any model provider
  • Native integration: Seamless with LangSmith tracing and deployment

Core Architecture:

# Deep Agent's Harness structure
parent_agent
  ├── planning_model      # Planning task decomposition
  ├── sub_agent_registry  # Child agent registry
  ├── memory_files        # Persistent memory files
  ├── skills              # Reusable skill modules
  └── human_in_the_loop   # Human intervention points

4.4 Martin Fowler: Theoretical Framework for Harness Engineering

In April 2026, Martin Fowler defined Harness Engineering from a software engineering perspective.

Core Model: Feedforward + Feedback

        ┌─────────────┐
        │   Guides    │  ← Feedforward
        │  (Human)    │     Anticipate agent behavior, guide in advance
        └──────┬──────┘
               ↓
        ┌─────────────┐
        │  Coding Agent│
        │   (LLM)     │
        └──────┬──────┘
               ↓
        ┌─────────────┐
        │   Sensors   │  ← Feedback
        │  (Auto)     │     Observe agent output, help self-correct
        └─────────────┘

Three Types of Harness:

Type Goal Examples
Maintainability Harness Code maintainability Code duplication, cyclomatic complexity, test coverage
Architecture Fitness Harness Architecture compliance Module boundary checks, performance tests, architecture rules
Behaviour Harness Functional correctness Functional test suites, end-to-end verification, manual testing

Two Execution Types:

Type Characteristics Execution
Computational Deterministic, fast, reliable CPU (tests, linters, type checkers)
Inferential Semantic analysis, non-deterministic GPU/NPU (AI code review, semantic verification)

V. Key Data in Harness Engineering

5.1 Performance Gap Experiments

Core findings from the paper How Much Heavy Lifting Can an Agent Harness Do?:

In a noisy Collaborative Battleship game, using the same LLM:

Harness Layer Win Rate Improvement LLM Calls
Belief-only Baseline 0
+ Declarative Planning +24.1% 0
+ Symbolic Reflection ±0.140 F1 Few
+ LLM Revision Gate Only 4.3% of turns trigger Minimal

Conclusion: The declarative planning layer alone brings the maximum gain, without any LLM calls needed.

5.2 Industry Comparison

Product/System Harness Characteristics Key Metrics
Claude Code Self-hosting, Initializer + Coding Agent 100% code generated by itself
Codex (OpenAI) Agent-first, zero human code constraint 10x development speed improvement
Deep Agents Batteries-included, model-agnostic Out-of-the-box usability
Manus General computer use, browser + file system End-to-end task completion
OpenHands Open-source software engineering agent SWE-bench leading

VI. Harness vs Orchestrator

This is a frequently confused concept:

Concept Responsibility Analogy
LLM Reasoning, text generation Brain
Harness Tools, memory, environment, verification Body and senses
Orchestrator Control flow, decision logic, retry strategies Nervous system
Framework Development toolkit, abstraction interfaces Toolbox

Relationship:

  • Orchestrator decides "what to do"
  • Harness provides the capability for "how to do it"

Example:

Orchestrator: "This task requires calling a search tool, then analyzing results"
Harness: Provides search tool implementation, manages API keys, handles timeouts, formats results

VII. 5 Major Open Challenges in Harness Engineering

7.1 Execution Environment Hardening and Scaling

  • How to balance security (sandbox isolation) with functionality (file system access)?
  • Selection model between containers, microVMs, and browser environments

7.2 Reliable State in Long-Running Agents

  • How to quantify information loss during context compression?
  • How to recover from durable artifacts rather than compressed history?

7.3 Trace-Native Failure Diagnosis

  • Make tracing (trace) a first-class citizen of the system, not just post-hoc analysis
  • Automatically attribute faults from trajectories, generate regression tests

7.4 Standard Handoffs Across Agents, Tools, and Humans

  • Handoff content: intent, constraints, permissions, artifacts, budget, risk level...
  • Rich enough to ensure safety and recovery, yet simple enough for wide adoption

7.5 Adaptive Simplification as Models Improve

  • As models get stronger, some Harness interventions may become burdens
  • Need mechanisms to automatically ablate, optimize, and simplify Harness layers

VIII. Implications for Developers

8.1 Skill Shift

From To
Writing code Designing Harness
Debugging programs Debugging agent behavior
Unit testing Harness verification strategies
Code review Agent output review

8.2 Interview Advantage

If you're interviewing for AI-related positions, understanding Harness allows you to answer like this:

"I understand Agent Harness as the engineering infrastructure layer wrapped around an LLM. In content moderation scenarios, Harness manages the context of moderation rules (C layer), calls classification model tools (T layer), orchestrates multi-round moderation workflows (L layer), and monitors model effectiveness metrics (V layer). Optimizing Harness structure may improve moderation accuracy more than switching models."

8.3 Learning Path

1. First experience Agent programming with Claude Code / Codex
2. Read LangChain's The Anatomy of an Agent Harness
3. Study engineering blogs from OpenAI and Anthropic
4. Try building a simple Harness yourself (tool calling + memory)
5. Focus on the ETCLOVG seven-layer model for systematic understanding

IX. Final Thoughts

In 2026, AI engineering is undergoing a silent revolution.

A year ago, everyone was competing over who had more model parameters and larger training datasets. Today, the industry consensus has shifted: models are fuel, Harness is the engine.

OpenAI achieved 10x development speed with Harness. Anthropic enabled Claude Code's self-hosting development with Harness. LangChain made Harness reusable infrastructure. Martin Fowler defined it as a new branch of software engineering.

For ordinary developers, this means:

  • You don't need to train models, but you need to understand how to harness them
  • Prompt engineering is just the starting point, Harness engineering is the destination
  • Agent = Model + Harness, neither can be omitted

In 2026, Harness Engineering is becoming the core infrastructure of AI engineering. Understanding it now is the right time.


References

  1. OpenAIHarness engineering: leveraging Codex in an agent-first world (2026-02)
  2. AnthropicEffective harnesses for long-running agents (2025-11)
  3. AnthropicScaling Managed Agents: Decoupling the brain from the hands (2026-04)
  4. LangChainThe Anatomy of an Agent Harness (2026-03)
  5. Martin FowlerHarness engineering for coding agent users (2026-04)
  6. Survey PaperAgent Harness Engineering: A Survey (OpenReview 2026)
  7. Core PaperHow Much Heavy Lifting Can an Agent Harness Do? (arXiv 2026)
  8. GitHub Repo — github.com/Gloriaameng/Awesome-Agent-Harness

This article is continuously updated. If you have good practices or discoveries, feel free to share.