How to Build an AI Agent Harness In 2026

Emily Winks profile picture
Data Governance Expert
Updated:06/04/2026
|
Published:04/13/2026
22 min read

Key takeaways

  • Harness quality determines agent reliability — LangChain gained 13.7 benchmark points with zero model changes.
  • Step 0 (data layer certification) is the most-skipped step — uncertified tables cause confident wrong answers.
  • Start with a small tool set and a 20-line ReAct loop. Add infrastructure only after observing real failure modes.

What does it take to build an AI agent harness in 2026?

An AI agent harness is everything around the model that makes it reliable: the loop, the tools, the state, the permissions, the evals, and the certified data underneath. Across 522 enterprise queries, Atlan's research found that governed metadata lifts AI SQL accuracy by 38%, and 2.15x on medium-complexity ones. Building one takes 4 to 12 weeks and 5,000 to 20,000 lines of infrastructure code. This guide walks the 10-step sequence, with code and a "done" test for each step.

Steps to build an AI agent harness:

  • Step 0: Certify the data layer — only certified assets reach the agent
  • Step 1: Scope the agent and make the build-or-buy call
  • Step 2: Build the core ReAct loop with a hard exit
  • Step 3: Author your AGENTS.md
  • Step 4: Layer the system prompt
  • Step 5: Classify permissions with lineage
  • Step 6: Manage context and wire the glossary handshake
  • Step 7: Build the persistence layer
  • Step 8: Deploy observability and verification
  • Step 9: Add guardrails and human checkpoints
  • Step 10: Evaluate the harness on its own

Are your AI agents stuck in POC?

Assess Context Maturity

Atlan’s Context Engineering Studio is the IDE where AI builders design context repositories, test agent behavior against the Enterprise Data Graph, and ship governed context to production, so the data layer the harness depends on is certified before the loop runs. An AI agent harness is everything around the model that makes it reliable: the loop, the tools, the state, the permissions, the evals, and the certified data underneath. Across 522 enterprise queries, Atlan’s research found that governed metadata lifts AI SQL accuracy by 38%, and 2.15x on medium-complexity ones. Building one takes 4 to 12 weeks and 5,000 to 20,000 lines of infrastructure code. This guide walks the 10-step sequence, with code and a “done” test for each step.

One engineer ran the same coding task five times against the same model. The first run, against an empty repo, succeeded 20% of the time. After adding an AGENTS.md file, the success rate climbed to 60%. Verification commands pushed it to 80%. A progress file took it to near 100%. The model never changed once. Ramya Chinnadurai documented this stage-by-stage climb on X in May 2026, and the closing line lands hard: do not swap the model, add one more piece of the harness.

That curve is the whole argument for reading this guide. You are past choosing a framework, and you can already write a reasoning loop. What you need now is the build order for everything else, with a clear test for when each piece is done.

Before the sequence, a quick frame so the steps make sense.

Understanding AI agent harness

Permalink to “Understanding AI agent harness”

According to Martin Fowler’s harness engineering framework, an agent is the combination of a model and a harness, and only the model reasons. The harness handles everything else. It’s the code, configuration, and execution logic that surrounds a language model.

The model cannot maintain persistent memory across sessions. It cannot call an external API with guaranteed retry logic. It cannot validate its own outputs against a schema, enforce a permission policy, or manage the state of a long-running task. These are harness responsibilities, and when they fail, the model keeps generating output regardless.

Three facts shape every decision in building a harness, and the reasons behind it.

  • Frontier models now perform similarly on standard benchmarks, so the harness is where reliability is won or lost.
  • Harness changes outperform model changes.
  • Most failures trace back to the data the harness feeds to the agent, not to the orchestration code.

If you want to compare orchestration frameworks before you build, read the best AI agent harness tools breakdown of LangGraph, CrewAI, AutoGen, Mastra, and the rest. This guide stays framework-agnostic. The pattern works regardless of which runtime you pick.

Prerequisites to building an AI agent harness

Permalink to “Prerequisites to building an AI agent harness”

You need four things in place. You need access to a frontier model and its API, a framework, a repository with conventions an agent can read, and an expectation about evals. Consider these before you start building a harness.

  • Access to a frontier model and its API. The choice matters less than people assume, because the harness carries most of the reliability. Pick one you can call reliably and move on.
  • A deferred framework decision. You do not need to commit to LangGraph or CrewAI yet. Author the build sequence first, then bind it to a runtime once the shape is clear.
  • A repository with conventions that an agent can read. Birgitta Böckeler of Thoughtworks frames the goal well in her InfoQ talk: you want to curate the information your agent sees to get better results. A repo built only for humans is a repo your agent cannot navigate.
  • An expectation about evals. You will not bolt them on at the end. They get designed alongside the build, which is why Step 10 names specific frameworks and shows how to wire them.

Build Your AI Context Stack

Get the blueprint for implementing context graphs across your enterprise. This guide walks through the four-layer architecture — from metadata foundation to agent orchestration — with practical implementation steps for 2026.

Get the Stack Guide

How to build an AI agent harness in 2026

Permalink to “How to build an AI agent harness in 2026”

Building an AI agent harness is a 10-step process. It starts with certifying the data layer, and concludes by proving the harness works reliably. You also need to consider where build sequences break to build a reliable harness.

The substrate-first build: this guide opens at Step 0, the data layer, because that is where most harnesses fail, not at the reasoning loop.

10-Step AI Agent Harness Build Sequence

Step What you build Group
Step 0 Certify the data layer — certified assets only, single metric definitions, live lineage signals Data layer
Step 1 Scope the agent and make the build-or-buy call — one workflow, decision written down Configuration
Step 2 Build the core ReAct loop — reason, act, observe, hard exit Configuration
Step 3 Author AGENTS.md — conventions, sensitive paths, verify command Configuration
Step 4 Layer the system prompt — persona, scope, format, thin layer Configuration
Step 5 Classify permissions with lineage — certified reads, approval writes, block destructive Infrastructure / Safety
Step 6 Manage context and wire the glossary handshake — curate per step, governed term resolution Infrastructure / Safety
Step 7 Build the persistence layer — append-only, event-sourced, replayable Configuration
Step 8 Deploy observability and verification — trace from log, cost/latency, pre-completion check Configuration
Step 9 Add guardrails and human checkpoints — input/output validation, human sign-off Infrastructure / Safety
Step 10 Evaluate the harness on its own — real-trace evals, regression gate, independent scorer Infrastructure / Safety

Steps 1–2 form the minimum viable harness (MVH). Steps 3–10 are the path from MVH to production, with Step 0 as the data-layer foundation every other step depends on.

Step 0: Certify the data layer

Permalink to “Step 0: Certify the data layer”

A harness without a certified data layer is a hallucination machine with logging. The orchestration can be flawless while the agent reads a table that was renamed last week. The query runs, the schema matches, the harness logs a success, and the answer is wrong.

Atlan’s research across 522 enterprise queries found a 38% improvement in AI SQL accuracy when agents read governed metadata instead of bare schemas, with a 2.15x improvement on medium-complexity queries. The model was identical. The data context was not.

Three things have to be true before any agent touches your data.

  • Each asset that can read must be certified, so the agent never accidentally queries a deprecated table.
  • Each metric it reasons over must have a single definition, so “recognized revenue” means the same thing across all systems.
  • And the lineage must be live, so that a schema change upstream reaches the agent before it produces an incorrect answer.

The three data-layer failure modes, freshness rot, uncertified table selection, and schema drift, are covered in depth in data quality for AI agent harnesses. Read it once. Then come back and build the rest of the harness on top of a layer you trust.

Done when: the agent can only reach assets certified in Atlan, every metric it uses resolves to one glossary definition, and a schema change upstream raises a signal that the harness can read.

Step 1: Scope the agent and make the build-or-buy call

Permalink to “Step 1: Scope the agent and make the build-or-buy call”

Step 0 gave the agent trustworthy inputs. Step 1 decides what the agent is for. Scope one workflow, not a platform. A fraud-detection agent that reviews flagged transactions is a buildable target. An “analytics assistant for the whole company” is not. A narrow scope is what makes every later step testable, because you can write an acceptance criterion for a single workflow.

The build-or-buy decision sits inside this step. You can adopt a vertical harness like Claude Code or Codex, extend an opinionated one like LangChain’s Deep Agents, or roll your own. Each trade controls for speed differently. Rather than rebuild that comparison here, use the Atlan decision matrix for harness tools, which scores eleven options across orchestration, observability, and licensing.

One caveat should temper the “just buy one and swap models later” instinct. Harnesses are not interchangeable across models. A X user put it out bluntly that Claude in a Codex harness is bad, and so is GPT in Claude Code. Different models suit different harness profiles.

The portable thing is not the harness; it is the governed context underneath it, which is exactly what Step 0 built.

The build-or-buy trade-off

If your priority is Lean toward Because
Speed to a working prototype A bought vertical harness The batteries are included
Maximum control over the state and tools Roll your own on a runtime You decide every loop and gate
A specific stack or language fit Match the framework to it Fewer integration seams
Avoiding lock-in An open, portable context layer Context travels even when harnesses do not

Done when: the agent’s job fits in one sentence, the build-or-buy call is made and written down, and the team agrees on what “success” means for this single workflow.

Step 2: Build the core ReAct loop

Permalink to “Step 2: Build the core ReAct loop”

The loop is simple to state and easy to get subtly wrong. The model reasons about the task, calls a tool, observes the result, and repeats until it reaches an answer or a stop condition. The danger is the missing stop condition, which can turn a loop into an infinite loop. Define the exit before you define anything else.

Here is the minimal shape in Python. It is a pattern to adapt, not production code.

def run_agent(task, tools, model, max_steps=10):
    state = {"task": task, "history": []}
    for step in range(max_steps):
        thought = model.reason(state)              # Reason
        if thought.is_final:
            return thought.answer
        result = tools.call(thought.action)        # Act
        state["history"].append(                   # Observe
            {"action": thought.action, "result": result}
        )
    raise StopCondition("max steps reached")        # Hard exit

Two design choices earn their keep early. Push as much of the loop toward determinism as you can. Harrison Chase of LangChain found that the reliable path is to make more and more of your agent deterministic rather than leaving every decision to the model. And give the agent a feedback signal inside the loop.

Done when: The loop runs a full task end-to-end, exits cleanly on both success and the step ceiling, and records every action and observation for later inspection.

Step 3: Author your AGENTS.md

Permalink to “Step 3: Author your AGENTS.md”

The loop runs. Now tell the agent how your codebase and data world actually work. AGENTS.md is an open Markdown standard for agent-readable instructions, used by over 60,000 open-source projects. It tells the agent what conventions to follow, which paths are sensitive, how to run tests, and what to avoid.

The rule, from Google’s Addy Osmani, is that every line should trace to a specific thing that went wrong.

A starting structure for a data-facing agent looks like this.

# AGENTS.md

## Scope
This agent answers revenue questions for the finance domain.

## Data it may read
- Only assets certified in Atlan
- The glossary term "recognized_revenue" defines the metric

## Conventions
- Always cite the source table and its certification date
- Never query tables tagged "deprecated."

## How to verify
- Run "make eval-finance" before declaring a task complete

Keep it short and point outward. Instead of a thousand-line instruction block, a practitioner on the Fragmented podcast recommends pointing the agent at a docs folder organized by feature. The file is a map, not a manual.

Done when: The file fits on a screen or two, every rule traces to a real failure, and the agent can find your test command and your certified-data rule without a human in the loop.

Step 4: Layer the system prompt

Permalink to “Step 4: Layer the system prompt”

AGENTS.md tells the agent about your repo. The system prompt tells it who it is. The system prompt sets the persona, task scope, output format, and the constraints the agent must adhere to across every turn. It works best as a thin layer that defers to the durable artifacts you already built.

The non-functional requirements, the testing rules, the review standards, all of that live better in docs and tests than in a swelling prompt.

The instruction hierarchy matters more than its length. When the system prompts, AGENTS.md, and the task notes disagree, the agent picks one under pressure, and you rarely control which. Keep the prompt small, keep it consistent with the files beneath it, and resist the urge to solve every edge case with another paragraph.

Done when: the prompt fits in under a minute of reading, it does not contradict AGENTS.md, and it names the one job the agent does and the format it returns.

Step 5: Classify permissions with lineage

Permalink to “Step 5: Classify permissions with lineage”

The agent now reasons, reads instructions, and runs a loop. Before it can act on anything that matters, it must first classify what it is allowed to do.

Atlan drives the allow-list from lineage. A read permission should not just check a table name; it should also verify whether the asset is certified and trace its origin. Data lineage makes that allow-list governed rather than guessed. Governance travels with the context rather than living in a separate config that nobody updates.

permissions:
  read:
    - certified: true              # only assets certified in Atlan
  write:
    - require_approval: true       # draft, then human sign-off
  destructive:
    - blocked: true                # deletes and drops are off by default
allow_list:
  source: atlan_lineage            # derive the list from lineage, not a static file
  rule: upstream_of(certified_marts)

Enforce these deterministically, not by politely asking the model.

Done when: Read access is gated to certified assets, write actions route through approval, destructive operations are blocked by default, and the allow-list updates when lineage changes.

Step 6: Manage context and wire the glossary handshake

Permalink to “Step 6: Manage context and wire the glossary handshake”

Permissions decide what the agent may touch. Context management decides what it actually sees, and this is where most long-running agents quietly degrade.

Context rot is real and measurable. Chroma’s research shows model performance varies significantly as input length changes, with simple tasks degrading as context fills with distractors. The fix is curation, not capacity.

Wire a handshake between your business glossary and AGENTS.md. Each glossary term that the agent needs becomes a context block that the harness injects at runtime, so “recognized revenue” arrives with its certified definition attached rather than as a bare column name. The glossary is the source of truth, the AGENTS.md reference points to it, and the harness resolves the term to its governed meaning before the model reasons.

Done when: The agent receives only relevant, certified context per step, glossary terms resolve to governed definitions at runtime, and context length stays inside the band where your model holds reasoning quality.

Step 7: Build the persistence layer

Permalink to “Step 7: Build the persistence layer”

Context handles a single run. Persistence handles what survives between runs.

The payoff is restartability. An agent that persists its progress to disk can resume after a crash, hand off to a fresh context window, and be audited afterward. A minimal config makes the intent explicit.

state:
  backend: filesystem
  path: ./agent-state/
  events: append-only        # event-sourced, replayable
  reload_on_boot: true       # resume from the last event

This layer also feeds the next one. Because every event is recorded, observability in Step 8 reads from the same log rather than from a separate tracing system bolted on later.

Done when: The agent can stop and resume without losing progress, state reconstructs from the event log alone, and a completed run can be replayed step by step.

Step 8: Deploy observability and verification

Permalink to “Step 8: Deploy observability and verification”

Observability is not optional infrastructure. Harrison Chase from LangChain calls observability and evals a core part of that stack that confirms the agent is working as intended. The first time an agent gives a confident wrong answer in production, the trace is the only thing that tells you whether the failure was the model, the loop, or the data.

The cost angle gives this step a sharp edge. You are paying rent on every turn. Observability is how you see the bill and trim it, and a governance layer outside the agents can further cut spending.

Verification belongs here too. Run security and reliability review agents on every push and CI, checking proposed changes against the documented standards. The verification reads the same governed context that the agent does, which is why Step 0 had to come first.

Done when: Every run is traced from the event log, per-task cost and latency are visible, and a verification check runs automatically before the agent declares a task complete.

Inside Atlan AI Labs & The 5x Accuracy Factor

Learn how context engineering drove 5x AI accuracy in real customer systems. Explore real experiments, quantifiable results, and a repeatable playbook for closing the gap between AI demos and production-ready systems.

Download E-Book

Step 9: Add guardrails and human checkpoints

Permalink to “Step 9: Add guardrails and human checkpoints”

Observability tells you what happened. Guardrails decide what is allowed to happen at all. Guardrails layer on top of the permission classification from Step 5 with input validation, output validation, network isolation, and human-in-the-loop checkpoints.

Rather than merging agent changes directly into the main, open a pull request that puts the human in the loop. Reversibility lowers the cost of any single mistake.

Hold the line between deterministic and inferential controls. A content filter that runs as code is a guarantee. A model instructed to behave is a hope. It’s advisable to put the controls that matter in code.

Done when: Irreversible actions require human sign-off, inputs and outputs pass deterministic checks, the agent cannot loop indefinitely, and every guarded action is reversible or logged.

Step 10: Evaluate the harness on its own

Permalink to “Step 10: Evaluate the harness on its own”

Evals are not benchmark runs you do once. The best evals are discovered from the real-world traces and run as regression tests rather than being built from scratch.

This is where you name specific frameworks, because vague advice helps no one mid-build. Pick one, turn a handful of production traces into test cases, and gate releases on them.

Evaluate the harness independently of the model, so you know which one moved. LangChain’s own work is the cleanest proof that this matters: it lifted a coding agent from 52.8% to 66.5% on Terminal Bench, a 13.7-point swing, by tuning only the harness and keeping the model fixed.

One more guard. Do not let the agent grade its own homework. A self-scoring agent inflates its own grade even when the quality is obviously mediocre to a human. Use an independent scorer.

Done when: Evals derive from real traces, they run on every release, and a regression in the harness fails the build before it reaches production.

Where do build sequences break?

Permalink to “Where do build sequences break?”

You have the sequence. Here is where teams lose it, mapped back to the steps that prevent each break.

Five failure modes recur, and each ties to a step you just built.

  • Tool sprawl breaks Step 1. Vercel removed 80% of its agent’s tools and got a 100% success rate with fewer tokens and faster runs.
  • Context flooding breaks Step 6, the reason curation beats capacity.
  • Missing governance breaks Step 0 and Step 5, the silent failures that produce confident wrong answers.
  • No evals break Step 10.
  • No AGENTS.md breaks Step 3, which is the cheapest fix on the list.

Best practices to lock in:

  • Certify the data layer before you write the loop, so the agent never reads an uncertified table.
  • Keep the tool set small, because selection quality drops as the options multiply.
  • Enforce permissions and guardrails in code, not in the prompt.
  • Keep AGENTS.md short and pointed at the docs it should rely on.
  • Build evals from real production traces and run them on every release.

Atlan has a full taxonomy of thirteen named anti-patterns across architectural, execution, and data-layer tiers in agent harness failures and anti-patterns. When a build stalls, that page is the diagnostic to reach for because it sorts failures by the layer they occur in.


Frequently asked questions

Permalink to “Frequently asked questions”

How long does it take to build a production AI agent harness?

Permalink to “How long does it take to build a production AI agent harness?”

Most teams reach a production-ready harness in 4 to 12 weeks, writing roughly 5,000 to 20,000 lines of infrastructure code. The range depends on scope, existing data governance, and how many of the 10 steps you can buy rather than build. Teams with a certified data layer already in place move faster, because Step 0 is the slowest step for organizations starting from ungoverned data.

What is the difference between an agent framework and an agent harness?

Permalink to “What is the difference between an agent framework and an agent harness?”

A framework is the programming model and runtime for building agents, such as tool definitions and coordination patterns. A harness is the full assembled system that runs in production: the framework plus the loop, state, permissions, context, guardrails, evals, and the certified data the agent reads.

What is the difference between a deterministic harness and an inferential harness?

Permalink to “What is the difference between a deterministic harness and an inferential harness?”

A deterministic harness enforces behavior in code, so a control either runs or it does not. An inferential harness leaves the same decision to the model, which turns a guarantee into a hope. The reliable pattern pushes as much of the loop toward determinism as possible, then reserves model judgment for the steps that genuinely need reasoning. Harrison Chase of LangChain frames the goal as making more and more of the agent deterministic rather than leaving every decision to the model.

How many tools should an AI agent have?

Permalink to “How many tools should an AI agent have?”

Fewer than most teams assume. Vercel removed 80% of its agents’ tools, cutting from a large set to a focused few, and saw success rates rise to 100% with fewer tokens and faster runs. Tool selection quality degrades as the tool set grows past roughly 20 options.

How often should I retune evals?

Permalink to “How often should I retune evals?”

Treat evals as living regression tests, not a one-time benchmark. Run the full suite on every release, so a harness change that breaks behavior fails the build before it ships. Whenever a production trace exposes a failure mode the suite missed, turn that trace into a new case, because the strongest evals are discovered from real traces. There is no fixed calendar here; the trigger is new evidence from production, not the passage of time.

How do I know if a failure is in the harness or the model?

Permalink to “How do I know if a failure is in the harness or the model?”

Trace it, then check the data layer first. Most failures attributed to the model are harness failures, and most harness failures are data-layer failures: stale context, uncertified tables, or schema drift. Reproduce the failure from your event log, confirm that the data the harness fed to the agent was current and certified, and then check for schema drift in tool calls. Only conclude the model is at fault after ruling out the data and the harness.

Can I share a harness across teams?

Permalink to “Can I share a harness across teams?”

Share the context layer, not the harness profile. A harness tuned for one workflow and one model rarely transfers cleanly, because different models suit different harness shapes. What does travel is the governed data underneath it: the certification, lineage, and glossary definitions that any MCP-aware harness can inherit through MCP, SQL, or APIs. Engineer that context once, and every team reads from the same trusted substrate.

How Atlan fits underneath any harness

Permalink to “How Atlan fits underneath any harness”

Atlan is not a harness, and it does not orchestrate agents. It is the governed data substrate every harness in this guide reads from.

Atlan’s Context Lakehouse saw 8 billion reads in 90 days, all from agents and MCPs reading shared context across the enterprise. Its 522-query study measured a 38% improvement in AI SQL accuracy from governed metadata, with a 2.15x lift on medium-complexity queries.

Here is how that substrate maps to the steps you built. The fit is direct, not decorative.

  • Step 0 and Step 5 read certification and lineage from active metadata, so the agent only ever touches assets it can trust.
  • Step 6’s glossary handshake resolves business terms through the business glossary, so “recognized revenue” carries one governed definition.
  • Every step reaches this context the same way, through the Atlan MCP server, which exposes glossary, lineage, and certification to any MCP-aware harness.
  • Schema changes surface as signals through data contracts, so drift reaches the harness before it reaches a wrong answer.

The portability point closes the loop from Step 1. Harness profiles are not portable across models, but the governed context is. You engineer it once, certify it once, and let any harness inherit it through MCP, SQL, or APIs.

CME Group built the same foundation at scale. In its first year, the team cataloged over 18 million assets and defined more than 1,300 glossary terms, with data teams trusting and reusing that context across use cases.

In the same vein, DigiKey’s framing captures why this matters for harness builders. Sridher Arumugham, Chief Data and Analytics Officer, called Atlan more of a context operating system than a catalog of catalogs, activating metadata for everything from discovery to an MCP server delivering context to AI models.

"We're excited to build the future of AI governance with Atlan. All of the work that we did to get to a shared language amongst people...can be leveraged by AI via context infrastructure."

— Joe DosSantos, VP Enterprise Data & Analytics, Workday

"Atlan is much more than a catalog of catalogs. Atlan is the context layer for all our data and AI assets."

— Sridher Arumugham, Chief Data & Analytics Officer, DigiKey

Watch the Context Studio demo to see how the certified data layer feeds a live harness.

Watch the Demo →

Share this article

signoff-panel-logo

Atlan is the Context Layer for AI — a Leader in the Gartner Magic Quadrant for D&A Governance (2026) and the Forrester Wave for Data Governance (Q3 2025). Atlan unifies your data, business knowledge, and the meaning behind your terms into one Enterprise Data Graph that gives every team and every AI agent the trusted context they need. Trusted by Mastercard, Workday, General Motors, CME Group, HubSpot, FOX, Virgin Media O2, Elastic, and 400+ enterprises representing $10T+ in market cap.

Bridge the context gap.
Ship AI that works.

[Website env: production]