Harness Engineering

Home
Enterprise AI
Open Cloud ^{Codes}
Citizen Developer ^{Codes}
Design Pattern ^{fyi}
Amit Puri
Resources
Books
- - Citizen Developer
  - Accidental Builder
  Citizen Development in Microsoft 365 with Power Platform
  
  Highlights
  
  CODE without coding - Create real-time apps with Power Fx spreadsheets and low-code magic.
  
  BUILD with ease - Learn Microsoft 365 services, cloud computing basics, and the rich ecosystem of citizen development.
  
  BOOST your efficiency - Dive into design thinking with tools like Microsoft Loop, Whiteboard, Forms, and Sway.
  
  COLLABORATE smarter - Get to grips with Microsoft Lists, SharePoint Online, and OneDrive for seamless teamwork.
  
  Video
  
  About Kindle Book
  
  A Guide to Citizen Development in Microsoft 365 with Power Platform: Democratizing App Development: The M365 Way Kindle Edition. This book is crafted for professionals, students, and educators across schools, colleges, and universities who have prior experience with Microsoft Office, Windows 10/11, and devices like PCs, laptops, or Macs. While some chapters cater to advanced professionals, the content remains beneficial for a wider readership. The book spans from introductory to advanced topics, with clear demarcations for each level. Buy Now
  
  Follow Us
  Artificial Intelligence - The Accidental Builder
  
  PART I
  
  Part I — Mindset
  See the problem. Build the mindset. Change the conversation.
  
  Chapter 1 - The Problem Nobody Sees Every invisible problem is a lost opportunity. Normalised workarounds keep those opportunities out of sight. Surface them to reimagine.
  
  Chapter 2 - The Builder's Mindset The assumptions to drop, the habits to build, the discipline that protects your time to create.
  
  Chapter 3 - Collaborate, Don't Circulate Conversations that produce decisions versus conversations that produce more conversations.
  
  Chapter 4 — Influence, Bias, and the Art of the Trade-off The loudest voice. The my-solution syndrome. The edge case trap. Navigate all three.
  
  PART II
  
  Part II — Method
  Claim the identity. Tame the complexity. Choose the tools.
  
  Chapter 5 - The Citizen Developer Identity The tech divide, the dependency trap, and what a genuine win-win looks like.
  
  Chapter 6 - The Complexity Monster what complexity is made of, ways to measure it, and AI’s role in redistributing it rather than adding to it.
  
  Chapter 7 - Your AI Toolkit The tools that matter, organised by the problem they solve. Not by vendor. Not by hype.
  
  Chapter 8 - Demystifying the Jargon enough to participate without faking it.
  
  PART III
  
  Part III — Build
  Engineer the prompt. Build the solution. Sustain the practice.
  
  Chapter 9 - Prompt, Agentic Context & Harness Engineering Moving from a single instruction to a robust, multi-agent architecture with testing harnesses.
  
  Chapter 10 - Build Your First Solution Problem statement to working prototype to something documented, governed, and handed over.
  
  Chapter 11 - The Forward Deployed Engineer & The Enterprise Stack The Reality Check: Entering the enterprise environment. How FDEs integrate the prototype into legacy stacks, navigate data governance, geography, and regulatory constraints.
  
  Chapter 12 - The Perpetual Builder Stay current, grow a methodology, bring others in, sustain the practice.
  
  About The Book
  
  Artificial Intelligence - The Accidental Builder: The Evolution of AI Vibe Coding - Become The Citizen Architect Of What Comes Next!
  
  See what's been missed. Act before certainty. Collaborate without circling. Cut through complexity-preserving friction. Choose tools without hype. Build, Govern, Ship - and keep building. Buy Now
  
  Follow Us

Discover Model Context Protocol (MCP) to enhance your AI capabilities

Model Context Protocol

The Problem

Engineering organizations adopting AI coding agents encounter a predictable and dangerous pattern:

Confident, Fast, and Wrong: Agents without sensors produce incorrect outputs at machine speed — defects scale with velocity
Context Drift: Critical rules injected into conversation history are lost to compaction; agents that worked yesterday forget their constraints today
State Degradation: Multi-session tasks lose context at session boundaries — agents rediscover solved problems and repeat resolved failures
Uncontrolled Autonomy: Agents with broad tool access take unexpected actions; without sensors and guardrails, scope creep is inevitable
Measurement Blindness: 94% of organizations accumulate AI costs in blind spots — tech debt, validation overhead, and developer burnout invisible in current metrics

The Solution: Harness Engineering

We will build a production-grade agent harness using the Guides-and-Sensors framework that can:

Apply the Ratchet Principle: Every agent failure permanently raises the floor — encoded in guides and sensors that prevent recurrence
Survive Context Compaction: Critical rules live in CLAUDE.md / AGENTS.md — the only guaranteed survivors of long-session compaction
Maintain Durable State: MEMORY.md, session journals, and task boards provide continuity across session boundaries and agent handoffs
Expand Trust Incrementally: Sensor pass rates — not intuition — gate each stage of autonomy expansion
Make Behavior Observable: Every tool call audited, every sensor result recorded, every cost tracked

End-to-End Harness Engineering Scenario

Throughout this guide, we walk through a complete harness engineering implementation from a real-world fintech deployment. The scenario covers all eight harness components working together in a production self-improving agent system.

Context: A fintech engineering team deploying AI coding agents across a 200K-LOC codebase serving regulated financial services:

Failure Detection: PostToolUse hook catches secret exposure — audit logs provide full reproduction context
Root Cause via Observability: LLM traces reveal the exact context path that led the agent to copy credential patterns
Three-Part Ratchet Fix: Guide rule + sensor wiring + MCP index exclusion — atomic harness update merged as a single PR
Cross-Agent Verification: Fix validated across implementation, review, and orchestrator agents before deployment
Data-Backed Trust Expansion: 30-day sensor pass rate of 96.3% justifies promoting the agent from Stage 3 to Stage 4 autonomy

The scenario demonstrates the complete ratchet cycle, sensor-guided trust expansion, and the harness as a living organizational asset that improves with every failure.

SECTION 2: HARNESS ENGINEERING OVERVIEW

What Is Harness Engineering? The Fourth Paradigm of AI Engineering

Harness Engineering is the discipline of designing everything around an AI model that makes it a useful, reliable, and governable agent. Named after the equestrian harness — the complete equipment set for channeling a powerful but unpredictable animal — the principle is captured in a single architectural equation: Agent = Model + Harness. The model is the stateless reasoning engine; the harness is the runtime software infrastructure that coordinates tool dispatch, context management, sensor validation, and safety enforcement. Coined by Mitchell Hashimoto and formalized by Ryan Lopopolo at OpenAI in February 2026, it is now recognized as the fourth paradigm of AI engineering, following prompt engineering, context engineering, and agent engineering.

The Four Paradigms of AI Engineering

Agentic Context Engineering

2022–2023

Prompt Engineering

A single instruction as the entire program

2023–2024

Context Engineering

Curating the context window with retrieval, memory, and tools

2024–2025

Agent Engineering

Handing the loop to the model — reason, act, observe

2025–Present

Harness Engineering

Engineering the runtime infrastructure around the model

The Ratchet Principle

The core operating principle of harness engineering, attributed to Mitchell Hashimoto:

Never Repeat Mistakes: "Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."
Systematic Encoding: Encode fixes in guides (CLAUDE.md) or sensors (hooks, tests) rather than simply retrying
One-Way Improvement: Each fix permanently raises the floor of agent quality
Infrastructure as Knowledge: The harness accumulates organizational knowledge over time

Agent = Model + Harness

The fundamental architectural equation of harness engineering:

Model (Stateless Reasoning Engine): Claude, GPT-4o, Gemini — interchangeable compute that reasons and generates
Harness (Runtime Infrastructure): Guides, sensors, tools, memory, orchestration, permissions, and observability
65% of Failures Are Harness Failures: Context drift, schema misalignment, and state degradation — not model limitations
Harness Is the Differentiator: Cursor and Codex run on overlapping models; the harness is what differs

The Eight Load-Bearing Harness Components

A complete, production-ready harness is a layered system of eight interdependent components. Each is a distinct engineering domain:

1. Guides (Pre-Action Steering)

Persistent instructions that steer the agent before it acts, surviving context compaction by living in the system prompt layer. Examples: CLAUDE.md, AGENTS.md, .cursorrules, Trellis spec files.

2. Sensors (Post-Action Validation)

Checks that validate the agent after it acts — linters, test runners, output parsers, evals, and semantic validators that enforce the ratchet principle.

3. MCP Tool Interfaces

Model Context Protocol servers that give agents controlled, auditable access to external systems — APIs, databases, CI/CD pipelines, ticketing systems, and compliance checks.

4. Memory & State

Durable context across sessions — MEMORY.md, vector stores, conversation history, and session journals that prevent state degradation over long-running tasks.

5. Orchestration

Multi-agent coordination — task boards, sub-agent dispatch, worktree isolation, A2A delegation patterns, and session handoff protocols.

6. Observability

End-to-end tracing, audit logs, cost controls, token budgets, and LLM observability tooling that make agent behavior inspectable and debuggable.

7. Permissions & Guardrails

Allowlists, tool-risk ratings, human-in-the-loop triggers, and approval gates that enforce what the agent can and cannot do — especially for high-stakes actions.

8. Garbage Collection

Codebase entropy management — periodic pruning of dead code, stale rules, and conflicting patterns that accumulate as agents generate at scale.

Why Harness Engineering Emerged in 2026

Three forces converged to make harness engineering the defining discipline of 2026:

Model Parity: Performance gaps between frontier models narrowed; the harness became the differentiating variable
Scale Evidence: OpenAI's February 2026 experiment shipped 1M+ LOC across 1,500+ PRs — all AI-authored. Engineers spent their time on harness, not code.
Measurement Crisis: The State of Engineering Excellence 2026 (700 practitioners, 5 countries) found that 94% of organizations had AI adoption costs accumulating in blind spots their metrics couldn't see
Production Pain: Teams learned that deploying agents without sensors and guardrails produced confident, fast, and wrong outputs at scale

SECTION 3: HARNESS FRAMEWORK & PRIMITIVES

The Guides-and-Sensors Framework: Building Reliable Agent Harnesses

The Guides-and-Sensors framework is the organizing principle of harness engineering. Guides are constraints that steer the agent before it acts — persistent instructions that survive context compaction. Sensors are checks that validate the agent after it acts — automated verification systems that enforce correctness and apply the ratchet principle. Together, they form a closed-loop control system around any AI model.

Guide Primitives: Pre-Action Steering

Core guide files that inject persistent instructions into agent sessions:

CLAUDE.md: Anthropic's Claude Code reads this file automatically on session start — injecting it into the system prompt where it survives context compaction. Keep under 100 lines; move critical rules here.
AGENTS.md: OpenAI Codex's equivalent. Hierarchical — a root AGENTS.md provides org-wide rules; sub-directory files narrow scope to individual repos or services.
.cursorrules: Cursor IDE's guide file. Team-shareable via version control; applied to every AI interaction in that workspace.
Trellis Spec Files: Progressive spec system — agents load only the standards, task PRDs, and session journals relevant to the current step, replacing monolithic guide files.

Sensor Primitives: Post-Action Validation

Core sensor types that validate agent outputs and enforce the ratchet principle:

Static Linters: ESLint, Pylint, Rubocop — deterministic checks run post-generation to catch syntax, style, and safety violations before commit
Test Runners: Unit and integration tests as sensors; the agent must pass the test suite before a change is accepted
Output Parsers: Schema validators that verify agent responses conform to expected JSON, YAML, or structured formats
LLM Evals: A second model reviews the first model's output — "LLM-as-judge" patterns for open-ended quality checks where deterministic sensors are insufficient

Hook Primitives: Programmatic Interception

Hooks are the lowest-level harness primitive — programmatic interception points that fire around every agent action, enabling fine-grained sensor composition:

PreToolUse Hooks: Fire before any tool call — validate inputs, enforce permissions, log intent, and block disallowed actions before execution begins
PostToolUse Hooks: Fire after tool execution — validate outputs, log results, trigger follow-on actions, and update memory with outcomes

Pre-commit Hooks: Git-level sensors that run the full sensor suite before any agent-generated code reaches version control
Session Hooks: Fire on session start/end — load context from MEMORY.md, write session journals, and trigger garbage collection workflows

MCP Tool Interface Primitives

Model Context Protocol (MCP) servers are the standard tool interface primitive in harness engineering:

Local MCP Servers: Run on the agent's machine — filesystem access, bash execution, browser automation (e.g., npx @playwright/mcp)
Remote MCP Servers: HTTP-based servers connecting agents to cloud services — Linear, Sentry, Jira, Slack, CI/CD systems
Symbol Indexing Servers: Token Savior-style MCP servers that index codebases by symbol (functions, classes, call graphs), cutting active tokens by 77% and wall time by 76%
AI Gateways: Kong AI Gateway, Portkey — governance layer between agent and MCP servers, enforcing rate limits, tool allowlists, and audit logging

Memory Primitives

Durable state primitives that prevent context degradation across sessions:

MEMORY.md: Markdown file loaded at session start — captures user preferences, project conventions, and lessons learned in a human-readable format
Session Journals: Per-session logs of decisions, errors, and resolutions — enables Trellis-style progressive spec loading
Vector Stores: Semantic memory for long-horizon recall — Pinecone, Chroma, Weaviate — indexed by session content and retrieved by similarity
Task Boards: Structured state objects (JSON/Markdown) tracking what has been done, what is in progress, and what is blocked — critical for multi-session tasks

Harness Primitive Composition Patterns

How harness primitives compose into production-grade agent workflows:

Triage → Code → Review Pipeline: Orchestrator agent reads task board → Claude Code writes fix → review agent validates → pre-commit hook runs sensors → PR opened
Long-Session Coding Agent: CLAUDE.md (guides) + PreToolUse hooks (permissions) + test runner (sensors) + MEMORY.md (state) + session journal (continuity)
Regulated Industry Deployment: AGENTS.md (compliance rules) + AI Gateway (tool governance) + LLM eval (output quality) + audit log hook (observability) + human approval gate (guardrail)
Multi-Agent Codebase: Root AGENTS.md + service-level sub-guides + worktree isolation (no cross-contamination) + shared task board (coordination) + central observability (unified tracing)

Research Foundation

Harness Engineering draws on several foundational sources:

OpenAI Harness Engineering Post (Feb 2026): Ryan Lopopolo's account of shipping 1M+ LOC with zero human code — the discipline showed up in scaffolding, not code
Mitchell Hashimoto's Ratchet Principle (Feb 2026): "Engineer a solution such that the agent never makes that mistake again" — the core operating principle
Martin Fowler's Harness Engineering Guide: Practitioner-level treatment of guide and sensor patterns for coding agents
Enterprise Failure Analysis: 65% of agent failures trace to harness defects — context drift, schema misalignment, state degradation (Atlan, 2026)

SECTION 4: GUIDES & CONTEXT FILES

Guides: Persistent Instructions That Survive Context Compaction

Guides are the most important harness primitive for preventing context drift. Unlike instructions injected into the conversation history — which are vulnerable to compaction and forgetting — guides live in the system prompt layer and persist across every turn of a session. A CLAUDE.md file in the repository root is automatically injected into every Claude Code session; an AGENTS.md at the repo root does the same for OpenAI Codex. The content of these files encodes organizational knowledge: coding standards, security rules, architectural constraints, and lessons from past agent failures.

CLAUDE.md: The Claude Code Guide File

The primary guide primitive for Claude Code-based harnesses:

Auto-Injection: Claude Code reads CLAUDE.md at session start and injects it into the system prompt — survives all context compaction events
60-Line Discipline: Best-practice teams keep CLAUDE.md under 60–100 lines. Bloated guide files create their own noise; move detailed rules to Trellis spec files or tool-level guides
Critical Rules Only: Compaction analysis (March 2026) confirms that rules in the conversation history can be lost; rules in CLAUDE.md cannot. Move any rule that has caused a repeated failure here.
Version Controlled: Treated as code — reviewed in PRs, tracked in git history, and updated via the ratchet principle after every class of agent failure
Hierarchical Scoping: Root CLAUDE.md for org-wide rules; subdirectory CLAUDE.md files for service-specific overrides — agents apply the most specific applicable guide

AGENTS.md: The OpenAI Codex Guide File

The guide primitive for Codex and other OpenAI agent deployments:

Same Role, Different Agent: AGENTS.md serves the identical function as CLAUDE.md — persistent system-prompt injection for Codex-based agents
Containerized Context: Codex runs in cloud sandboxes; AGENTS.md files are mounted into the container at runtime alongside the codebase
Hierarchical Inheritance: Root AGENTS.md → repo AGENTS.md → directory AGENTS.md — more specific files override more general ones for the same rule category
Tool Definition Inclusion: AGENTS.md may include tool schemas and MCP server references, making it a combined guide-and-tool-manifest file
Cross-Agent Portability: Many teams maintain both CLAUDE.md and AGENTS.md with shared content — guides should be agent-agnostic where possible

Trellis: Progressive Spec Systems (Beyond Monolithic Guides)

For larger harnesses, monolithic CLAUDE.md files become unmanageable. The Trellis pattern replaces them with a progressive spec system:

Standards Files: Org-wide coding standards, security policies, and architectural constraints — loaded for every session regardless of task
Task PRDs: Product requirement documents for the current task — loaded only when the agent is working on that specific feature or bugfix
Session Journals: Per-session logs of decisions and errors — loaded at session start to restore continuity from prior work

Step-Level Loading: Agents load only the specs relevant to the current step — not the full org knowledge base — preventing token waste
Spec Versioning: Each spec file is versioned independently; an agent working on an old task loads the spec that was current when work began
Dynamic Composition: An orchestrator determines which specs to inject based on current task context — similar to RAG for guide files

MEMORY.md: Durable Cross-Session State

Persistent user-level and project-level memory loaded at session start:

User Preferences: Coding style preferences, preferred libraries, test framework choices — reduces repetitive re-explanation across sessions
Project Conventions: Naming patterns, architectural decisions, and gotchas specific to the codebase — loaded from MEMORY.md as living documentation
Lesson History: Past agent failures and their resolutions — the ratchet's memory, ensuring the agent doesn't repeat solved problems
Agent-Writable: The agent may update MEMORY.md during a session, creating a feedback loop between execution and guide improvement

Guide Authoring Best Practices

Proven patterns for writing effective guide files:

Positive Framing: State what the agent should do, not just what to avoid — "Always write tests before implementation" over "Don't skip tests"
Ratchet-Driven Updates: Add a new rule after every class of failure — never edit the guide without a corresponding agent failure that motivated the change
Section Structure: Organize guides by domain (Security, Code Style, Testing, Architecture) for predictable recall
Token Discipline: Every line in a guide costs tokens every session — audit and prune rules that no longer apply as the codebase and team evolve

Common Guide Anti-Patterns

Patterns that degrade guide effectiveness:

Guide Bloat: Hundreds-of-lines CLAUDE.md files that create noise, dilute critical rules, and waste tokens — use Trellis progressive specs for large harnesses
Relying on Conversation History: Placing critical rules in conversation turns rather than guide files — these are the first things lost to compaction
Stale Rules: Guide files that haven't been audited as the codebase evolved — rules that contradict current architecture erode agent trust in the guide
No Version Control: Treating CLAUDE.md as an informal scratch file — guides are organizational infrastructure and must be reviewed, versioned, and approved
Agent-Unreadable Format: Dense prose paragraphs rather than bulleted lists — guides parsed by an LLM should use clear imperative structure

SECTION 5: MCP TOOLS & TOOL INTERFACES

Model Context Protocol: The Standard Tool Interface for AI Agents

The Model Context Protocol (MCP) has become the dominant standard for wiring AI agents to external tools, data sources, and services. MCP servers are the primary mechanism by which a harness extends an agent's capabilities beyond file I/O and bash commands. By May 2026, MCP's combined Python and TypeScript SDKs had surpassed 97 million monthly downloads, and every major AI provider ships MCP-compatible tooling. MCP is now the TCP/IP of the AI agent layer: the protocol that makes tool interfaces interoperable across models, frameworks, and vendors.

Local MCP Servers

Servers that run on the agent's local machine or container:

Filesystem Access: Controlled read/write access to local directories — respecting permission boundaries defined in the harness
Bash Execution: Sandboxed shell access for running builds, tests, and scripts — PreToolUse hooks validate commands before execution
Browser Automation: npx @playwright/mcp gives agents browser control for UI testing, scraping, and web interaction tasks
Symbol Indexing: Token Savior-style servers that build call-graph indices so agents navigate codebases by pointer, cutting token use by 77%

Remote MCP Servers

HTTP-based servers connecting agents to cloud services and enterprise systems:

Issue Trackers: Linear, Jira, GitHub Issues — agents can read tickets, update status, and create follow-up tasks as part of automated workflows
Observability Platforms: Sentry, Datadog — agents query error logs, traces, and alerts to diagnose production issues autonomously
Communication Tools: Slack, Teams — agents post status updates, escalate to humans, and read relevant threads for context
CI/CD Pipelines: Harness CD, GitHub Actions, CircleCI — agents trigger builds, query pipeline status, and respond to failures

AI Gateways: Governance at the Tool Interface Layer

AI gateways sit between the agent and MCP servers, enforcing the governance layer of the harness:

Tool Allowlisting: Define exactly which MCP tools the agent may call — block access to production databases, billing APIs, or destructive operations from dev agents
Rate Limiting: Prevent runaway agents from exhausting API quotas or generating unexpected costs through unchecked tool calls
Request Transformation: Normalize tool inputs and outputs — shield agents from upstream API changes that would cause schema misalignment

Audit Logging: Every tool call, input, and output logged with full context — the primary source for harness debugging and compliance reporting
Cost Controls: Per-session and per-agent token and dollar budgets — gate expensive operations behind human approval before they execute
Semantic Validation: Gateway-level output parsers that validate tool responses against expected schemas before returning to the agent

Key Products: Kong AI Gateway, Portkey, LiteLLM Proxy — all now ship MCP-specific governance features as of 2026.

Tool Schema Governance

Schema misalignment is the second most common harness failure mode. Preventing it requires discipline at the tool interface layer:

Schema-First Design: Define tool input/output schemas before implementation — agents are trained against schemas, not implementations
Versioned Tool APIs: Semantic versioning for MCP servers — agents declare the tool version they depend on and receive compatible responses
Backward Compatibility Gates: CI checks that verify new tool versions don't break existing agent behaviors — run against the full eval suite before deployment
Schema Migration Guides: When breaking changes are unavoidable, update CLAUDE.md/AGENTS.md with the new schema patterns before deploying the new tool version

Tool Performance Optimization

Techniques for maximizing agent efficiency at the tool layer:

Symbol-Level Navigation: Index codebases by function, class, and call graph rather than file — agents navigate by pointer, reducing tokens by 77% and wall time by 76%
Tool Result Caching: Cache deterministic tool results (build outputs, static analysis, doc lookups) — serve cached results to multiple agent calls per session
Batch Tool Calls: Group related tool invocations where the MCP spec permits — reduce round-trip latency for multi-step reads
Lazy Loading: Don't pre-load all available tools into the agent's context — surface only the tools relevant to the current task phase

MCP Security Considerations

MCP's power creates commensurate security obligations. Key risks and mitigations:

Prompt Injection via Tools: Malicious content in tool responses can redirect agent behavior — validate and sanitize all MCP server outputs before they re-enter the context window
Skill Weaponization: Reviewed skill steps can still hide second-stage payload execution (MedusaLocker incident, Dec 2025) — runtime sandboxing and PreToolUse hooks are the primary defense
Hidden Unicode Instructions: Unicode tag characters in tool outputs can smuggle invisible instructions (Feb 2026 disclosure) — strip non-printable characters at the gateway layer
Credential Exposure: MCP servers that proxy authenticated APIs must never return raw credentials to the agent context — use gateway-level secret injection instead
Scope Creep: Agents with broad MCP access will use available tools in unexpected ways — allowlist by principle of least privilege, not by convenience

SECTION 6: SENSORS & HOOKS

Sensors: Enforcing the Ratchet Principle Through Post-Action Validation

Sensors are the harness primitives that close the control loop. A guide tells an agent what to do; a sensor verifies that it was done correctly. The ratchet principle requires that every class of agent failure produces a new sensor — a check that would have caught the failure and will catch any recurrence. Without sensors, the harness is one-directional: the agent acts, and humans discover problems after the fact. With sensors, the harness becomes self-correcting: the agent acts, the sensor validates, and failures are caught before they propagate.

Static Analysis Sensors

Deterministic checks that run without executing the generated code:

Linters: ESLint, Pylint, Rubocop, golangci-lint — catch syntax errors, style violations, and unsafe patterns immediately after generation
Type Checkers: mypy, TypeScript compiler, Pyright — verify type correctness in statically typed codebases before tests run
Dependency Scanners: Dependabot, OWASP Dependency-Check — flag newly introduced vulnerable dependencies in agent-written code
Secret Scanners: Trufflehog, GitLeaks — detect if the agent inadvertently writes credentials or API keys into the codebase
Architecture Linters: Dependency rules (no circular imports, no cross-service imports) — enforce architectural constraints the agent cannot know from local context alone

Dynamic Test Sensors

Sensors that execute code to verify runtime behavior:

Unit Tests: The agent must pass the existing test suite — a failing test is a sensor firing, not a suggestion to edit the test
Integration Tests: Verify that agent-written code integrates correctly with dependent services — essential for multi-service codebases
End-to-End Tests: Browser and API-level tests that verify user-visible behavior — the highest-confidence sensor, slowest to run
Mutation Testing: Verify that agent-written tests actually catch bugs by mutating the code and checking if tests fail
Chaos Sensors: Inject failures into agent-generated code paths to verify error handling is robust — especially critical for infrastructure code

LLM Eval Sensors (AI-as-Judge)

For open-ended quality checks where deterministic sensors are insufficient, a second model judges the first model's output:

Code Review Evals: A review agent reads the generated diff and flags potential issues — security, correctness, maintainability — before the PR is opened
Specification Adherence: Compare agent output against the task PRD — did the implementation match the requirement?
Regression Evals: Run the full eval suite against every harness change — confirm that guide updates don't degrade existing behavior

Output Schema Validation: Verify that structured agent outputs (JSON, YAML, API calls) conform to expected schemas — catch hallucinated fields before they reach downstream systems
Tone and Policy Evals: For customer-facing agents — verify outputs meet communication standards and policy requirements
Comparative Evals: A/B test harness changes against a held-out eval set before promoting to production

PreToolUse Hooks

Interceptors that fire before any tool call is executed:

Permission Enforcement: Block calls to disallowed tools or parameters before execution — the primary defense against scope creep
Input Sanitization: Strip prompt injection content from tool inputs before they reach external systems
Intent Logging: Record every planned tool call with full context — the audit trail for debugging unexpected agent behavior
Human Approval Triggers: For high-risk tools (production deployments, destructive operations, financial transactions), gate execution behind a human approval step

PostToolUse Hooks

Interceptors that fire after tool execution completes:

Output Validation: Verify tool responses conform to expected schemas — catch schema misalignment before results enter the agent's context
Memory Updates: Write significant tool outcomes to MEMORY.md or the session journal — ensure durable state reflects the latest execution
Sensor Triggers: On file-write tool calls, automatically trigger linting and type-checking sensors — close the loop without agent intervention
Cost Tracking: Accumulate token and API cost counters after each tool call — trigger budget alerts before limits are exceeded

Production Hook Patterns (The 20-Recipe Cookbook)

Common production-ready hook configurations for coding agent harnesses:

Auto-Lint on Write: PostToolUse hook on file writes → runs linter → appends lint output to next agent turn as tool result
Test-on-Change: PostToolUse hook on file writes to /src → runs pytest/jest for affected modules → returns pass/fail to agent
Secret Scan Gate: PreToolUse hook on git commit → runs secret scanner → blocks commit if credentials detected, returns diff with flagged lines
Budget Enforcement: PostToolUse hook accumulating token cost → fires human approval request when 80% of session budget consumed
Context Refresh: Session start hook → loads MEMORY.md → prepends to system context before first agent turn
Audit Trail Write: PostToolUse hook → writes structured JSON log entry (tool, inputs, outputs, timestamp, session ID) to append-only audit store

SECTION 7: MEMORY & MULTI-AGENT ORCHESTRATION

Durable Memory and Multi-Agent Coordination at Scale

State degradation — the loss of critical context across session boundaries and agent handoffs — is the third most common harness failure mode. Solving it requires two complementary capabilities: durable memory (persistent state within and across sessions) and orchestration (coordination protocols between multiple specialized agents). Together, they extend the effective context horizon of an agentic system from a single session to an indefinite run, and from a single agent to a coordinated team.

File-Based Memory

Lightweight, human-readable memory that integrates naturally with version control:

MEMORY.md: User preferences, project conventions, and lesson history — loaded at session start via a session hook, written by the agent when significant facts are established
Session Journals: Per-session markdown logs capturing decisions, errors, and resolutions — the primary continuity mechanism for multi-session tasks
Task Boards: Structured markdown or JSON files tracking task state (todo / in-progress / done / blocked) — the orchestration primitive for multi-agent workflows
Decision Logs: Append-only files recording architectural and design decisions with rationale — prevent agents from undoing past decisions they have no context for

Vector Store Memory

Semantic retrieval for long-horizon recall beyond what fits in MEMORY.md:

Session Embedding: Encode each session's key outputs as vector embeddings — retrieve semantically similar past sessions when starting a new related task
Codebase Knowledge Base: Index architectural decisions, design docs, and ADRs — agents retrieve relevant prior decisions by semantic similarity to the current task
Error Memory: Store past failure patterns and their resolutions as embeddings — agents retrieve analogous past failures when encountering similar error signatures
Platforms: Pinecone, Chroma, Weaviate — all integrate with the MCP tool layer via purpose-built MCP servers

Multi-Agent Orchestration Patterns

Proven coordination patterns for production multi-agent harnesses:

Triage → Code → Review Pipeline: Orchestrator reads task board → triage agent scopes the problem → coding agent implements → review agent validates → human approval for merge. Each agent has its own guide and sensor set.
Shared Task Board: All agents in a pipeline read and write to a shared task board — the canonical coordination primitive that prevents duplicate work and deadlocks
Handoff Summaries: Passing agents write structured summaries (not raw conversation history) to the task board — prevents context window overflow at handoff points

A2A Delegation: Agent-to-agent delegation via the emerging A2A protocol — an orchestrator spawns a specialized sub-agent with a scoped task, receives a structured result, and resumes the main workflow
Worktree Isolation: Each agent in a parallel workflow operates in its own git worktree — prevents conflicts when multiple agents work on the same codebase simultaneously
Fan-Out / Fan-In: An orchestrator fans out independent subtasks to parallel agents, then fans in their results via a merge agent — critical for large-scale codebase tasks

Managing Context Compaction

What compaction discards and how to protect critical state:

What Survives Compaction: Current task description, recent errors, file names of current edits — system prompt content (including CLAUDE.md) always survives
What Is Lost: Initial conversation instructions, intermediate reasoning steps, style rules injected as messages, early context about task scope
The Primary Defense: Move any rule that must survive compaction from conversation history to CLAUDE.md or AGENTS.md — the only guaranteed survivors
Compaction Checkpoints: Before long sessions, write a structured checkpoint to MEMORY.md so the agent can resume from a known state if compaction occurs

Session Lifecycle Management

Structured protocols for starting and ending agent sessions:

Session Start Hook: Load MEMORY.md → load task board → retrieve semantically similar past sessions → prepend structured context to system prompt
Mid-Session Checkpoint: After major milestones, write decision log entries and update task board — creates recovery points if the session is interrupted
Session End Hook: Write session journal → update MEMORY.md with new lessons → update task board with final state → run garbage collection workflow
Resume Protocol: New session loads journal from interrupted session → reconstructs state from task board → verifies environment integrity before proceeding

Garbage Collection: Managing Codebase Entropy

As agents generate code at scale, entropy accumulates: dead code, duplicate logic, conflicting style patterns, outdated guide rules. Garbage collection is a first-class harness concern:

Dead Code Sweeps: Periodic agent runs that identify and remove unreachable code, unused imports, and obsolete feature flags generated by prior agents
Guide Audits: Quarterly reviews of CLAUDE.md and AGENTS.md to prune rules that no longer apply — stale rules confuse agents and waste tokens
MEMORY.md Pruning: Remove outdated preferences and superseded conventions — the memory file should reflect current reality, not historical drift
Semantic Deduplication: Use embedding similarity to identify duplicate functions, classes, or modules written by different agents in parallel sessions — merge or remove redundancies

HARNESS SKILLS & REUSABILITY

Skills: Reusable Harness Primitives for Organizational Scale

A skill is a self-contained, version-controlled harness component — a combination of guide snippets, sensor configurations, hook scripts, and MCP tool definitions that packages a complete capability. Skills are to harness engineering what libraries are to software engineering: reusable, tested units that teams share rather than rebuild. When one team solves the "how to safely run database migrations with an agent" problem, they encode the solution as a skill and other teams adopt it directly.

Coding Skills

Reusable harness components for code-generation workflows:

Test-First Skill: Guide snippet requiring TDD + test runner sensor + pre-commit hook — packages the complete test-first workflow as a single adoptable unit
PR Creation Skill: Hook scripts for creating well-formed PRs (title, description, linked issue, assignee, labels) from agent output — consistent PR hygiene across all agents
Refactoring Skill: Guide rules for safe refactoring + semantic diff sensor + coverage regression sensor — prevents agents from breaking behavior while refactoring
Documentation Skill: Guide rules for docstring standards + completeness sensor + spelling/grammar linter — maintains documentation quality as agents generate code

Security Skills

Hardened harness components for security-critical workflows:

Secret-Free Skill: PreToolUse hook (secret scanner on all file writes) + guide rules (no hardcoded secrets) + audit log sensor — prevents credential exposure at the harness layer
Dependency Review Skill: MCP tool for dependency scanner + guide rules (no unvetted new dependencies) + vulnerability threshold sensor — catches supply chain risks before they land
OWASP Skill: Guide rules for the OWASP Top 10 + SAST sensor (Semgrep) + output validation hook — systematic injection of security standards into every coding session
Prompt Injection Defense Skill: Input sanitization hooks + guide rules for handling untrusted data + test cases for injection payloads — protects agents that process external content

SKILL.md: The Skill Manifest

Each skill is described by a SKILL.md manifest — the self-documenting contract that tells an agent or orchestrator what the skill does and how to use it:

Trigger Conditions: When should this skill be loaded? (e.g., "any task involving database schema changes") — enables automatic skill selection by orchestrators
Guide Snippets: The specific guide rules this skill injects — appended to CLAUDE.md or AGENTS.md for the duration of the skill's scope
Sensor Definitions: Which sensors this skill activates, with configuration (linter rulesets, eval prompts, coverage thresholds)

MCP Tool Requirements: Which MCP servers this skill requires access to — allows permission verification before skill activation
Hook Scripts: Pre/PostToolUse hook implementations bundled with the skill — installed into the hook registry when the skill is activated
Incompatibilities: Skills that conflict with this one — prevents simultaneous activation of contradictory guide rules

Skill Composition Patterns

How skills combine into complete workflow harnesses:

Feature Development Stack: Test-First Skill + Documentation Skill + PR Creation Skill + OWASP Skill — full-cycle feature development with quality and security baked in
Infrastructure Change Stack: Secret-Free Skill + Dependency Review Skill + Chaos Sensor Skill + Human Approval Gate Skill — governed infrastructure changes with rollback
Refactoring Campaign Stack: Refactoring Skill + Coverage Sensor Skill + Semantic Diff Skill — large-scale refactoring with behavioral regression protection
Regulated Industry Stack: OWASP Skill + Audit Log Skill + Compliance Eval Skill + Human Review Skill — regulated workflow compliance built into every agent action

Skill Distribution and Discovery

Organizational patterns for scaling skill adoption:

Internal Skill Registry: A shared repository of org-specific skills — teams discover, adopt, and contribute skills via standard PR workflow
Public Skill Libraries: Awesome Harness Engineering and similar curated lists — community-maintained skills for common workflows available for immediate adoption
Automatic Skill Suggestion: Orchestrators analyze task context and suggest relevant skills — similar to package manager recommendations
Skill Versioning: Skills are versioned independently — pin the skill version in the harness config to prevent unexpected behavior changes from upstream updates

Skill Quality Standards

Requirements for skills published to shared registries:

Independently Testable: Every skill must include a test harness — a set of input tasks and expected sensor outcomes that verify the skill works as documented
Minimal Guide Footprint: Skills should inject only the guide rules necessary for their function — each rule must have a documented failure case that motivated it
Documented Security Model: Skills that require elevated tool permissions must document the security rationale and the safeguards that justify the access
Ratchet History: A changelog that shows which guide rules and sensors were added in response to which failure classes — provenance for every harness decision
Conflict Declaration: Explicitly declare incompatible skills — prevents silent guide contradictions when multiple skills are composed

SECTION 8: PERMISSIONS, GUARDRAILS & OBSERVABILITY

Governing Agent Behavior: From Trust Boundaries to Full Observability

As AI agents gain broader tool access and longer autonomous run-times, the governance layer of the harness becomes critical infrastructure. Permissions define what the agent can and cannot do; guardrails enforce those boundaries at runtime; observability makes all agent behavior inspectable after the fact. Together they form the trust framework that allows organizations to extend agent autonomy incrementally, expanding scope only as confidence — backed by data — justifies it.

Permission Architecture

Layered permission model for production agent harnesses:

Tool Allowlists: Enumerate exactly which MCP tools the agent may invoke — default-deny, with explicit grants. An agent with file-write access should not automatically have shell-exec access.
Parameter Constraints: Restrict tool call parameters (e.g., file writes to /src only, shell exec to non-destructive commands) — allowlisting at the argument level, not just the tool level
Tool Risk Ratings: Classify tools by risk tier (read-only / write / destructive / external) — higher-tier tools require additional approval or logging
Execution Modes: Sandboxed (no external network or filesystem), restricted (local only), and full (production access) — agents are promoted between modes based on demonstrated reliability

Human-in-the-Loop Guardrails

Escalation patterns that gate high-risk actions behind human review:

Approval Gates: Production deployments, database migrations, billing operations, and external communications require explicit human approval before the agent proceeds
Confidence Thresholds: If the agent's self-assessed confidence in a proposed action falls below a threshold (expressed in the output), automatically route to human review
Novelty Detection: Flag actions the agent has never taken before in this harness — first execution of any new tool combination requires human sign-off
Budget Gates: When session cost exceeds 80% of budget, pause and surface a summary of remaining planned actions for human approval before continuing

Harness Observability: Making Agent Behavior Inspectable

Full-stack observability for production agent harnesses — the foundation for debugging, auditing, and continuously improving harness performance:

Distributed Tracing: Every agent action linked by trace ID — a single developer query can be traced through orchestrator → sub-agents → tool calls → sensor results as a single span tree
LLM Observability Platforms: LangSmith, Arize, Helicone — capture inputs, outputs, latency, token cost, and eval scores for every model call, enabling regression detection as the harness evolves
Audit Logs: Append-only, tamper-evident logs of every tool call with full context — the compliance record and primary debugging surface for unexpected agent behavior

Cost Dashboards: Per-session, per-agent, and per-tool token and dollar cost tracking — the primary mechanism for catching runaway agents before they generate unexpected bills
Sensor Pass Rate Metrics: Track the pass rate of each sensor (lint, test, eval) over time — degrading pass rates signal a harness that is drifting from the codebase it governs
Session Health Monitoring: Alert on unusually long sessions, high tool-call rates, or repeated failures — patterns that indicate a stuck or confused agent before human review is needed

Incremental Trust Expansion Model

A risk-managed approach to growing agent autonomy based on observed reliability:

Stage 1 – Sandboxed: Agent operates in isolated environment, no external access, all outputs reviewed by human before action is taken
Stage 2 – Restricted Local: Agent can write to local filesystem and run tests; no network access; PR created but not auto-merged
Stage 3 – Supervised Production: Agent can open PRs and trigger CI; human reviews before merge; no direct production access
Stage 4 – Autonomous with Gates: Agent merges to staging automatically; human approval required only for production deployment
Stage 5 – Full Autonomy: Agent deploys to production with automated rollback as the primary safety net — reserved for well-understood, fully instrumented workflows

Governance Anti-Patterns

Governance failures that expose organizations to significant risk:

Broad Tool Grants: Granting all available MCP tools to every agent — violations are inevitable when agents have access to destructive tools they don't need
No Audit Trail: Agents operating without logging — when something goes wrong, there is no record to reconstruct what happened
Skipping Sandboxed Stages: Moving directly to production access without first establishing baseline sensor pass rates in sandboxed environments
Manual Governance: Relying on humans to manually review all agent actions instead of encoding governance rules as machine-checkable sensors and PreToolUse hooks
No Rollback Path: Deploying agent-written code without automated rollback capability — when the agent writes a regression, recovery must be instant, not manual

State of Engineering Excellence 2026: Key Findings

Survey of 700 engineering leaders across US, UK, India, France, and Germany (Harness, May 2026):

AI Adoption

AI coding tools are now the default in engineering organizations

Blind Spot Cost

94% say costs accumulate in blind spots metrics don't capture

Hidden Costs

Tech debt, validation time, and burnout missing from productivity metrics

Productivity Gains

Self-reported gains overwhelmingly positive but measurement frameworks lag

SECTION 5: HARNESS IMPLEMENTATION GUIDE

Building a Production-Ready Agent Harness from Scratch

This guide provides a practical, phased approach to implementing a harness engineering system. The goal is to move through six stages — from minimal viable harness to full autonomous deployment — with each stage de-risked by measurable sensor pass rates. Do not skip stages. The incremental trust expansion model exists because teams that jump to broad autonomy without first establishing baseline sensor coverage consistently encounter catastrophic failures.

Phase 1 – Minimal Viable Harness (Week 1)

Start with the three lowest-effort, highest-ROI harness components:

Create CLAUDE.md / AGENTS.md: Write 20–40 lines covering the top 5 code quality rules, the primary architectural constraint, and any security non-negotiables. Version-control it immediately.
Wire a Static Linter Sensor: Configure the existing linter (ESLint, Pylint) as a post-generation check. The agent must pass linting before any output is accepted.
Run Existing Tests as Sensors: Execute the existing test suite after every agent code generation. A failing test is feedback, not a reason to remove the test.
Outcome: Sensors catch ~40% of agent errors before human review; guide reduces the most common guide violations immediately.

Phase 2 – MCP Tool Integration (Weeks 2–3)

Wire the first MCP servers and establish tool governance:

Local MCP First: Start with local filesystem and bash MCP servers before any remote/external integrations — establish the governance baseline in a contained environment
Write a Tool Allowlist: Enumerate permitted tools in the harness config. Default-deny everything not explicitly listed.
Install PreToolUse Hooks: Log every tool call with inputs. Add permission checks for any tool with write or exec access.
Add First Remote MCP: Wire the issue tracker (Linear/Jira) so the agent can read ticket context. Read-only first; write access comes in Phase 4 after sensors are stable.

Phase 3 – Memory & State (Week 4)

Establish durable state and prevent context compaction failures:

Create MEMORY.md: Seed with user preferences, project conventions, and any lessons from the first two weeks of agent use
Add Session Start Hook: Load MEMORY.md into system context at session start — verify it is injected before the first agent turn
Implement Task Board: Create a simple markdown task board for tracking agent work across sessions. The agent reads and updates it as part of each session.
Write Session Journal Hook: End every session by writing a structured journal entry (decisions made, files changed, tests passed/failed, next steps)

Phase 4 – Orchestration (Weeks 5–6)

Introduce multi-agent coordination for parallelism and specialization:

Define Agent Roles: Triage agent (reads ticket, scopes task), implementation agent (writes code), review agent (checks output) — each with its own scoped guide file
Establish Handoff Protocol: Passing agents write structured summaries to the task board — never pass raw conversation history between agents
Configure Worktree Isolation: Parallel agents operate in separate git worktrees — prevents conflicts on shared files
Add Orchestrator Logic: A lightweight orchestrator reads the task board, assigns agents to tasks, and monitors for stuck or blocked states

Phase 5 – Observability (Week 7)

Add full-stack observability before expanding agent autonomy:

Install LLM Observability Platform: LangSmith, Arize, or Helicone — wire it to capture every model call with inputs, outputs, latency, and cost
Build Sensor Pass Rate Dashboard: Track lint, test, and eval pass rates per agent role over time — degrading rates signal drift before it becomes a crisis
Configure Cost Alerts: Set session and daily budget thresholds — fire alerts when 80% consumed, hard stops at 100%
Establish Audit Log Retention: Append-only audit store with minimum 90-day retention — required for compliance and post-incident reconstruction

Phase 6 – Ratchet Operations (Ongoing)

Operate the harness as a continuously improving system:

Weekly Ratchet Review: Review sensor failures from the past week — each new failure class gets a new guide rule or sensor before the next week begins
Monthly Guide Audits: Review CLAUDE.md and AGENTS.md — remove stale rules, clarify ambiguous ones, add examples for rules that agents consistently misinterpret
Quarterly Garbage Collection: Run dead-code sweeps, prune MEMORY.md, deduplicate agent-generated code with semantic similarity search
Incremental Trust Expansion: Expand agent permissions only when sensor pass rates justify it — use the five-stage trust model to gate autonomy increases

Harness Implementation Best Practices

Version Control Everything: CLAUDE.md, AGENTS.md, MEMORY.md, hook scripts, MCP server configs — treat the entire harness as code
Sensor First, Guide Second: When you discover a failure, add a sensor before adding a guide rule — sensors are verifiable; guide rules are hopes
Keep Guides Short: A 300-line CLAUDE.md is a code smell. If the guide is longer than your shortest module, refactor it into Trellis-style progressive specs
Test the Harness: Maintain a suite of "harness tests" — tasks where the expected output is known and sensors should fire if the agent deviates

Never Trust Compaction: Assume everything in the conversation history will eventually be compacted. If it must survive, it belongs in CLAUDE.md.
Instrument Before Expanding: Do not grant new tool access before observability is in place — you need to see what the agent does with each new capability
Document Permission Decisions: Every allowlisted tool should have a documented reason in the harness README — "why does this agent have shell exec?" must have an answer
Rollback is a Feature: Every autonomous deployment must have a tested rollback path — automated where possible, clearly documented where not

Common Harness Implementation Pitfalls

Skipping the Sandbox Stage: Giving agents production access before sensor pass rates are established — the most common source of catastrophic early failures
Guide-Only Harnesses: Relying entirely on CLAUDE.md without sensors — guides express intent; sensors verify execution. You need both.
No Memory Architecture: Treating every session as stateless — agents rediscover solved problems, repeat past mistakes, and lose multi-day task context
Ignoring Entropy: Not running garbage collection — agent-generated codebases accumulate dead code, duplicate logic, and conflicting patterns that compound over time
Manual Governance at Scale: Expecting humans to review every agent action — harness governance must be machine-checkable or it doesn't scale

Step 1: Failure Detection & Ratchet Trigger

The harness detects a repeating class of agent failure:

Sensor Fired: PostToolUse hook on file write detects agent wrote hardcoded API credentials into a config file
Failure Class: Secret exposure — not covered by any guide rule or sensor (gap identified)
Scope: Reproducible across 3 sessions in the past week (audit log evidence)
Priority: P1 harness bug — triggers immediate ratchet sprint

Step 2: Root Cause Analysis via Observability

Harness observability data reveals the exact failure pattern:

Audit Log Review: Three sessions, same pattern — agent reading example configs that contained real credentials as illustrations
Guide Gap: CLAUDE.md had no secret-handling rule; agent followed the pattern it observed in the codebase
Sensor Gap: No secret scanner wired to the file-write PostToolUse hook
Context Source: Example files with real credentials indexed by the symbol MCP server and injected into agent context

Step 3: Three-Part Ratchet Fix

Three coordinated harness fixes applied atomically:

Guide Rule Added: "Never write literal credentials. Use env var references only: process.env.API_KEY, not the key value." Added to CLAUDE.md under Security section.
Sensor Wired: TruffleHog added as PostToolUse hook on all file writes — blocks write on secret detection, returns flagged lines as agent feedback
Index Exclusion: Symbol indexer configured to exclude *.example files and credential-pattern files from the context index
Regression Test: New test in ratchet regression suite — expected outcome: sensor fires and agent rewrites using env var

Step 4: Cross-Agent Verification

Ratchet fix verified across all agent roles before deployment:

Implementation Agent: Reproduction scenario run — scanner fires, agent receives feedback, rewrites with env var reference
Review Agent: Eval skill detects hardcoded secrets in submitted diffs — flags before PR is opened
Orchestrator: Task board updated — ratchet item resolved, regression test added to weekly suite
Harness Changelog: New rule, sensor, and exclusion documented with the failure class that motivated each

Step 5: Deployment & 30-Day Monitoring

Atomic Harness PR: CLAUDE.md, hook script, and index config merged as single harness update
Secret Scanner Pass Rate: Tracked daily — new fires investigated immediately; approaches zero
Failure Class Recurrence: Zero recurrences over 30 days confirms ratchet effectiveness
Skill Published: Secret-Free skill (rule + sensor + exclusion) published to org skill registry

Step 6: Data-Backed Trust Expansion

30-Day Sensor Pass Rate: 96.3% — above the 95% threshold for Stage 3 to Stage 4 transition
Zero Recurring Failure Classes: All known failures have sensors; none have recurred in 30 days
Permission Expansion: Agent promoted from "PR creation only" to "auto-merge to staging" for low-risk service tier
Observability Confirmed: Full audit log coverage verified — every staging deployment traceable to originating session

SECTION 9: HARNESS ANALYTICS & MONITORING

Measuring Harness Performance: From Sensor Pass Rates to Business Outcomes

Harness analytics transforms agent operations from intuition-driven to data-driven. The State of Engineering Excellence 2026 found that 94% of engineering organizations accumulate AI costs in blind spots their current metrics cannot capture. Harness-native observability closes that gap: every tool call is logged, every sensor result recorded, and every session's cost and quality score tracked. The result is a measurement system that makes agent productivity legible — not just to developers, but to engineering leadership.

Harness Quality Metrics

Primary indicators of harness health and effectiveness:

Sensor Pass Rate: Percentage of agent outputs passing all sensors on first attempt. Target: >90% for mature harnesses.
Pass Rate by Sensor Type: Decompose overall rate by lint/test/eval — reveals which guide rules need strengthening
Ratchet Velocity: New sensors and guide rules added per week — measures speed of response to failure classes
Failure Class Recurrence Rate: Percentage of failures that have occurred before. Target: zero recurring classes.
Context Drift Events: Sessions where agents deviated from guides due to compaction — rules to migrate from conversation history to CLAUDE.md

Cost & Efficiency Metrics

Token and dollar economics at session and fleet scale:

Cost per Task: Total token cost divided by completed tasks — fundamental unit economics
Token Efficiency Ratio: Tokens consumed per accepted line of code — symbol-indexed navigation achieves 77% reduction
Tool Call Distribution: Highest-cost tool patterns — candidates for caching or optimization
Budget Utilization Rate: Average session cost as percentage of budget — sustained rates above 70% signal workflow optimization opportunity
Sensor Cost Overhead: Eval sensor token cost relative to generation cost — ensures the sensor layer earns its keep

Delivery & Velocity Metrics

DORA-equivalent metrics adapted for harness-governed agents:

Agent Lead Time: Task assignment to all-sensors-pass — agent equivalent of DORA Lead Time for Changes
Deployment Frequency: How often harness agents successfully ship — daily or more for mature harnesses
Change Failure Rate: Agent-shipped changes requiring rollback — sensor coverage drives this toward zero
Mean Time to Ratchet: Failure detection to sensor/guide fix deployment — speed of harness improvement cycles
Session Success Rate: Sessions completing without human intervention — measures overall harness maturity

Agent Health Signals

Signals indicating individual sessions are operating correctly:

Session Duration Distribution: P99 duration as alert threshold — outliers signal stuck or looping agents
Tool Call Rate Anomalies: Far above historical baseline — likely confused agent, alert before budget exhaustion
Compaction Event Frequency: Multiple compactions per session — context window pressure, may need task decomposition
Human Escalation Rate: Should decrease over time as harness matures
Rollback Rate: Automated rollback trigger frequency — the ultimate quality sensor

Harness Observability Stack

LLM Tracing

LangSmith, Arize Phoenix, Helicone

Tool Audit Logs

Kong AI Gateway, Portkey — append-only

Sensor Pass Rate Dashboards

Per sensor, per agent role, over time

Budget Enforcement Hooks

Real-time spend + anomaly alerts

Harness Continuous Optimization Cycle

Weekly Sensor Failure Review: Which sensors fired most — each high-frequency sensor is a guide rule candidate
Monthly Cost Optimization Pass: Top 3 high-cost tool patterns — optimize or add caching
Quarterly Harness Audit: Prune stale rules, update thresholds, retire obsolete skills
Failure Class Elimination Sprints: Recurring failure class = P1 harness bug — root cause, ratchet fix, regression eval
Trust Expansion Reviews: Data-backed permission expansion decisions — sensor pass rate trends only

HARNESS TESTING & EVALUATION

Harness Eval Metrics

Guide Adherence Rate: Did the agent follow CLAUDE.md rules? (Measured by sensor violations attributable to guide-covered cases)
Sensor Coverage: What percentage of known failure classes have a sensor? Target: 100%.
Ratchet Effectiveness: Did the new rule/sensor eliminate the targeted failure class? (Binary)

Harness Regression Tests

Needle-in-Haystack Tests: Insert a guide rule violation scenario early — verify the sensor catches it 50 turns later
Compaction Survival Tests: Force a compaction event mid-session — verify CLAUDE.md rules govern post-compaction behavior
Ratchet Regression Suite: Historical failure scenarios run against every harness update

SECTION 8: ADVANCED HARNESS APPLICATIONS & STRATEGIC IMPACT

Real-World Harness Engineering Applications and Business Impact

Harness engineering enables organizations to deploy AI agents safely, at scale, with measurable quality — transforming software delivery economics. This section explores the most advanced production harness deployments, domain-specific applications, and the strategic business impact of well-engineered agent harnesses across engineering organizations.

Full-Codebase Autonomous Delivery

The frontier of harness engineering: zero-human-code production systems:

OpenAI Frontier Product Exploration (Feb 2026): 1M+ LOC, 1,500+ PRs, zero manually written code — engineers spent 100% of time on harness, not implementation
1B Tokens/Day Harnesses: "Extreme harness engineering" deployments running at billion-token-per-day scale — observability and cost governance are existential at this scale
Deployment Acceleration: Harness-governed agents at United Airlines and Citibank accelerate deployments by up to 75% — speed from automation, safety from sensors
Infrastructure Cost Reduction: Harness-managed infrastructure agents reduce costs by up to 60% — optimization that would be impractical for human operators to perform continuously

Regulated Industry Deployments

Harness engineering as the compliance layer for AI in regulated contexts:

Financial Services: Audit log hooks + compliance eval skills + human approval gates for every external-facing change — Citibank deploys with harness-enforced regulatory compliance
Healthcare: HIPAA-compliant tool gateways + PHI-aware guide rules + mandatory human review sensors — agents assist clinical documentation while harness prevents data exposure
Aerospace & Defense: DO-178C-compatible sensor suites + traceability hooks (every line of code linked to its originating requirement) + mandatory dual-agent review
Government: FedRAMP-compliant MCP server configurations + air-gapped local-only harnesses + full audit trails — harness engineering enables AI adoption in zero-internet environments

Strategic Business Impact of Harness Engineering

Delivery Velocity & Lead Time

Up to 90% Lead Time Reduction: Harness-automated CI/CD pipelines collapse the time from merge to production
24/7 Development: Agent harnesses produce code continuously — no waiting for business hours or human availability
Instant Ratchet Application: Lessons encoded in guides and sensors are applied to every subsequent agent task immediately — organizational learning at machine speed
Parallel Delivery: Multi-agent orchestration ships features across the codebase simultaneously — not serialized by human bandwidth

Quality & Risk Reduction

Sensor-Enforced Standards: Coding standards applied uniformly by machine — no style drift across teams or time zones
Earlier Defect Detection: Sensors catch defects at generation time — before code is committed, reviewed, or deployed
Ratchet-Driven Quality: Each new failure class permanently raises the floor — quality improves monotonically over time
Knowledge Preservation: Harness guides encode institutional knowledge that survives team turnover — the CLAUDE.md doesn't quit when the senior engineer does

Cost Optimization

77% Token Reduction: Symbol-indexed MCP navigation vs. full-file context loading — direct cost reduction for every agent session
60% Infrastructure Cost Reduction: Harness-managed optimization agents run continuously — humans can't match the frequency or breadth of autonomous optimization
Defect Prevention Savings: Sensors catch defects before production — prevention is orders of magnitude cheaper than remediation
Scalability Economics: Harness-governed agents scale without proportional cost increase — the marginal cost of the nth agent task approaches zero

Organizational Learning

Ratchet as Organizational Memory: Every agent failure that is ratcheted produces a permanent improvement — the harness accumulates organizational wisdom
Skill Registry Growth: Teams contribute skills to internal registries — knowledge compounds across the engineering organization, not just within teams
Measurement Clarity: Sensor pass rates, cost-per-task, and lead time are objective harness metrics — grounding the productivity conversation in data
Human Skill Shift: Engineers move from writing code to designing harnesses, evaluating sensor effectiveness, and expanding agent trust — higher-leverage work

Harness Engineering ROI Metrics

Lead Time Reduction

Up to 90% (Harness customer data, 2026)

Infrastructure Cost Savings

Up to 60% via harness-governed optimization agents

Token Efficiency Gain

77% reduction via symbol-indexed navigation

Failure Attribution

65% of failures are harness defects (fixable without model changes)

Keys to Successful Harness Adoption

Start with Sensors, Not Autonomy: Establish sensor coverage before expanding agent permissions — measure before trusting
Treat the Harness as Product: Assign ownership, maintain a backlog, run retrospectives — the harness is the primary engineering artifact now
Ratchet Discipline: Every failure that isn't ratcheted is a failure the agent will repeat — make the ratchet review a non-negotiable weekly ritual
Invest in Observability Early: You cannot improve what you cannot see — LLM observability platforms pay for themselves in the first month of production use
Measure the Hidden Costs: The State of Engineering Excellence 2026 finding is clear — 94% of organizations are flying blind on tech debt, validation time, and burnout costs

Continuing Your Harness Engineering Journey

The harness engineering discipline is moving fast. The foundational papers and posts published in early 2026 are already being extended by practitioners deploying at production scale. This section surfaces the highest-signal resources for building on what you have learned here.

Foundational Harness Engineering Reading

OpenAI Harness Engineering Post (Feb 2026): Ryan Lopopolo's account of shipping 1M+ LOC with zero human code — the defining case study. openai.com/index/harness-engineering
Mitchell Hashimoto's Ratchet Principle: The foundational blog post defining harness engineering's core operating principle — "engineer a solution such that the agent never makes that mistake again"
Martin Fowler's Harness Engineering Guide: Practitioner-level treatment of guides and sensors for coding agents. martinfowler.com
Atlan: AI Agent Harness Failures — 13 Anti-Patterns: Empirical analysis of the most common harness failure modes and their root causes. atlan.com

Harness Engineering Tools

Claude Code: Built-in harness primitives — CLAUDE.md, hooks, skills, subagents, MCP. The most fully harness-native coding agent available as of 2026.
Awesome Harness Engineering (GitHub): Curated list of tools, patterns, resources, and MCP servers. Updated continuously by the community. github.com/ai-boost/awesome-harness-engineering
Token Savior: MCP server that indexes codebases by symbol — 77% token reduction, 76% wall time reduction. Essential for large-codebase harnesses.
Trellis: Progressive spec system replacing monolithic CLAUDE.md — agents load only the standards and PRDs relevant to the current step.

Community & Learning Resources

Software Mansion Agentic Engineering Guide: Deep practitioner coverage of harness engineering with security focus. agentic-engineering.swmansion.com
Faros AI Harness Engineering Blog: Applied harness patterns for engineering teams with measurement frameworks. faros.ai
HumanLayer: Skill Issue: Practical harness engineering for coding agents — real configuration examples and hook recipes. humanlayer.dev
Harness State of Engineering Excellence 2026: 700-practitioner survey on AI adoption, productivity measurement, and hidden costs across 5 countries.

Your Next Steps

This Week: Create CLAUDE.md or AGENTS.md — 20 lines covering your top 5 code quality rules. Version-control it. Wire your existing linter as a sensor.
This Month: Add your first PostToolUse hook. Instrument one session with LangSmith or Helicone. Build your first sensor pass rate dashboard.
This Quarter: Run your first monthly ratchet review. Publish your first skill to the team registry. Conduct a full harness audit.
Ongoing: Apply the ratchet after every class of failure. Expand agent trust only when sensor pass rate data justifies it. Treat the harness as a first-class product.

Advanced Topics to Explore

Extreme Harness Engineering: 1M+ LOC, 1B tokens/day harnesses — the architecture decisions that only matter at production scale
Multi-Model Harnesses: Routing different task types to different models within a single harness — cost and quality optimization via intelligent model dispatch
Harness-as-Code Pipelines: CI/CD for harness changes — automated regression suite, canary harness deployments, automated ratchet diff generation
Privacy-Preserving Harnesses: Federated harness patterns for multi-tenant and regulated deployments — shared sensors, isolated guides
Autonomous Ratchet Systems: Harnesses that identify their own failure classes and propose ratchet fixes — the next frontier beyond the current human-in-the-loop ratchet cycle

Additional References

Harness Engineering for AI Agents in 2026
Why the future of enterprise AI belongs to systems engineers, not prompt engineers. Introduces the 3-layer harness architecture (Information, Execution, Feedback) and the Inner/Outer Harness distinction. By Vishal Mysore, May 2026.
Harness Engineering: Build Systems Around AI Agents (2026)
The definitive guide to the Agent = Model + Harness formula. Covers the 7 components of a production harness, the LangChain 52.8%→66.5% accuracy experiment, and real-world results from OpenAI, Stripe, and Anthropic. Coined by Mitchell Hashimoto.
Harness Engineering: The Complete Guide (NxCode)
If 2025 was the year agents proved they could write code, 2026 is the year we learned the agent isn't the hard part — the harness is. Covers the OpenAI Codex 1M-line codebase case study and the metaphor's origin from horse tack.
The 12-Factor Agent
A modern evolution of the original 12-Factor App methodology, specifically tailored for AI-native agentic architectures. Provides the structural principles that underpin a well-engineered outer harness.
Model Context Protocol (MCP)
The industry-standard protocol ('USB-C for AI') for connecting AI agents to data and tools — the tooling backbone of the Information Layer in any production harness.
Agent2Agent (A2A) Protocol
The open standard for agent-to-agent communication (Linux Foundation). Provides the inter-agent communication layer for multi-agent harnesses where sub-agents hand off structured results via Context Firewalls.
AGENTS.md Standard
The open-format file for providing persistent, project-specific context to AI coding agents — the canonical mechanism for the Context Engineering component of a harness.
Beyond Vibe Coding: The Five Building Blocks of AI-Native Engineering
Thoughtworks (2026) — the professional shift from informal 'chat-oriented programming' to agentic engineering: orchestrating agents, models, methodologies, specs, and context. The natural predecessor discipline to harness engineering.
Complete Guide to Agentic Coding (TeamDay)
Everything you need to master agentic engineering: concepts, patterns, tools, and hard-won best practices. Covers the practitioner transition from vibe coding to systematic, harness-driven workflows.
5 Key Trends Shaping Agentic Development in 2026
The New Stack — parallel agent task execution, background runners, git worktrees, and context isolation. Essential reading for understanding the execution patterns that harnesses must manage.
AI Engineering Trends in 2025: Agents, MCP and Vibe Coding
The New Stack — how agentic technology and MCP became the defining story of 2025 AI engineering, setting the stage for the harness engineering discipline that emerged in 2026.
Harness AI Code Agent Review 2026
Verdent Guides — deep evaluation of Harness.io's AI Code Agent, DevOps Agent, and AppSec layer. Useful for teams evaluating the platform-level harness tooling available in 2026.
Ultimate Guide to AI Harness Engineering Examples in 2026
Teams utilizing advanced harness frameworks see deployment frequency increases of up to 145%. Covers AI-powered harness platforms, CI/CD efficiency gains, and enterprise adoption patterns.
OpenAI Codex — AI Coding Agent
OpenAI's Codex is the agent that brought harness engineering into the mainstream. The Codex team's architecture — 3 engineers, 1M lines of code, zero manually-written — is the defining case study of production harness engineering.
Spec-Driven Development
Moving beyond vibe coding to structured, specification-driven agentic engineering — the specification layer that feeds directly into the harness's context and constraint components.
Context Engineering for AI Agents
The architecture of meaning for AI — building data layers, guardrails, and environment rules that ensure agents have exactly the right information. The theoretical foundation for the Information Layer of a harness.
AI-Native Enterprises: IT Architecture Strategy for 2026
The enterprise context for harness engineering — why 2026 marks the shift from AI adoption to AI-native architecture, and what that means for IT teams building production agentic systems.
The Great Rebuild: Architecting an AI-Native Tech Organization
Deloitte (2026) — 78% of tech leaders anticipate transformational integration of AI agents into architecture workflows. The organizational framing for why harness engineering is becoming a core engineering competency.

What You've Built

You now have the knowledge and vocabulary to design, implement, operate, and improve a production agent harness. The discipline is young — February 2026 — but the principles are durable. Here is what you have gained:

Core Harness Competencies

Guides-and-Sensors Architecture: You can design a closed-loop control system around any AI model using CLAUDE.md, AGENTS.md, linters, test runners, LLM evals, and hooks
Ratchet Principle Mastery: You know how to turn every agent failure into a permanent harness improvement — the fundamental practice that makes harness quality improve monotonically
MCP Tool Governance: You can wire, govern, and secure MCP servers — giving agents controlled tool access with full audit trails and permission enforcement
Incremental Trust Expansion: You can use sensor pass rate data to make defensible, risk-managed decisions about expanding agent autonomy across five stages

Key Insights to Carry Forward

Agent = Model + Harness: The model is the stateless reasoning engine. The harness is where your organizational knowledge lives. 65% of failures are harness failures.
CLAUDE.md Is Non-Negotiable: Everything in the conversation history will eventually be compacted. Rules that must survive belong in the guide file, not the conversation.
Sensors Before Autonomy: Never expand agent permissions without first establishing sensor coverage for the new action space. Measure before trusting.
The Harness Is the Product: At production scale, engineers spend their time on harness architecture, not code. The harness is the primary engineering artifact of the AI era.

Ready for Production

You now have the knowledge to:

Build a Minimal Viable Harness in One Week: CLAUDE.md + linter sensor + test sensor — the three primitives that deliver immediate value
Run a Ratchet Review: Turn every agent failure into a guide rule or sensor — systematically raising the quality floor
Design Multi-Agent Orchestration: Task boards, handoff summaries, worktree isolation, and A2A delegation for parallel agent teams
Measure What Matters: Sensor pass rates, cost per task, agent lead time, and failure class recurrence — the metrics that make harness performance legible
Scale Across the Organization: Skill registries, shared sensor suites, and federated guide inheritance — harness engineering at organizational scale

The Ratchet Principle

"Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."

— Mitchell Hashimoto, February 2026

Enterprise AI

Reimagining Enterprise ecosystem

Enterprise AI

Building, deploying, and managing AI at Enterprise Scale

1 Foundation & Strategy

Establish your AI strategy and understand the landscape

AI Transformation

Strategic roadmap for Enterprise AI adoption

Explore

Total Cost of Ownership

Calculate and optimize AI implementation costs

Calculate

AI Regulations Efforts

Navigate compliance and regulatory requirements

Learn More

2 Development & Engineering

Build robust AI applications with best practices

Enterprise LLM Applications

Build scalable large language model applications

Build

Spec-Driven Development

Development methodology for AI systems

Implement

Feature Engineering

Optimize data features for AI models

Optimize

Harness Engineering

Evaluate and test AI model performance

Evaluate

Forward Deployed Engineering

Integrate AI systems directly into client environments

Integrate

3 AI Capabilities & Techniques

Master advanced AI techniques and capabilities

AI Agents

Build autonomous AI agents for complex tasks

Create

Multi-Modal AI

Integrate text, image, and audio processing

Integrate

Prompt Engineering

Master the art of effective AI prompting

Master

4 Data & Infrastructure

Build scalable data and infrastructure foundations

Vector Databases

Implement vector search and indexing

Implement

Retrieval Augmented Generation

Enhance LLMs with external knowledge

Enhance

Agentic Context Engineering

Advanced context management for AI systems

Engineer

5 Integration & Protocols

Connect and integrate AI systems seamlessly

Model Context Protocol

Standardized protocol for AI model communication

Integrate

Agent2Agent (A2A) Protocol

Direct communication protocol between AI agents

Connect

Begin with small, deliberate steps to build Enterprise AI capability.

Strategy

Start with AI Transformation and TCO analysis

Build

Develop with Spec-Driven Development

Deploy

Implement Vector Databases and RAG

Scale

Integrate with MCP and AI Agents

Citizen Development in Microsoft 365 with Power Platform

Highlights

Video

About Kindle Book

Follow Us

Artificial Intelligence - The Accidental Builder

Part I — Mindset

Part II — Method

Part III — Build

About The Book

Follow Us

Discover Model Context Protocol (MCP) to enhance your AI capabilities

The Problem

The Solution: Harness Engineering

End-to-End Harness Engineering Scenario

SECTION 2: HARNESS ENGINEERING OVERVIEW

What Is Harness Engineering? The Fourth Paradigm of AI Engineering

The Four Paradigms of AI Engineering

2022–2023

Prompt Engineering

2023–2024

Context Engineering

2024–2025

Agent Engineering

2025–Present

Harness Engineering

The Ratchet Principle

Agent = Model + Harness

The Eight Load-Bearing Harness Components

1. Guides (Pre-Action Steering)

2. Sensors (Post-Action Validation)

3. MCP Tool Interfaces

4. Memory & State

5. Orchestration

6. Observability

7. Permissions & Guardrails

8. Garbage Collection

Why Harness Engineering Emerged in 2026

SECTION 3: HARNESS FRAMEWORK & PRIMITIVES

The Guides-and-Sensors Framework: Building Reliable Agent Harnesses

Guide Primitives: Pre-Action Steering

Sensor Primitives: Post-Action Validation

Hook Primitives: Programmatic Interception

MCP Tool Interface Primitives

Memory Primitives

Harness Primitive Composition Patterns

Research Foundation

SECTION 4: GUIDES & CONTEXT FILES

Guides: Persistent Instructions That Survive Context Compaction

CLAUDE.md: The Claude Code Guide File

AGENTS.md: The OpenAI Codex Guide File

Trellis: Progressive Spec Systems (Beyond Monolithic Guides)

MEMORY.md: Durable Cross-Session State

Guide Authoring Best Practices

Common Guide Anti-Patterns

SECTION 5: MCP TOOLS & TOOL INTERFACES

Model Context Protocol: The Standard Tool Interface for AI Agents

Local MCP Servers

Remote MCP Servers

AI Gateways: Governance at the Tool Interface Layer

Tool Schema Governance

Tool Performance Optimization

MCP Security Considerations

SECTION 6: SENSORS & HOOKS

Sensors: Enforcing the Ratchet Principle Through Post-Action Validation

Static Analysis Sensors

Dynamic Test Sensors

LLM Eval Sensors (AI-as-Judge)

PreToolUse Hooks

PostToolUse Hooks

Production Hook Patterns (The 20-Recipe Cookbook)

SECTION 7: MEMORY & MULTI-AGENT ORCHESTRATION

Durable Memory and Multi-Agent Coordination at Scale

File-Based Memory

Vector Store Memory

Multi-Agent Orchestration Patterns

Managing Context Compaction

Session Lifecycle Management

Garbage Collection: Managing Codebase Entropy

HARNESS SKILLS & REUSABILITY