Harness engineering: the model is just the horse

Six months ago my main interface with AI was a prompt window. Now I spend most of my time on convention files, hooks, and quality gates. The shift happened fast, and the data explains why: the same model scores 42% or 78% on identical benchmarks depending on nothing but the environment around it. A 36-point swing, same model, same tasks.

Bare model 42%

With harness 78%

Same model, same benchmark. The only variable is the environment around it.

The industry landed on a name for this in early 2026: harness engineering. The model is the horse. Our job is designing that harness.

What harness engineering actually is

Ryan Lopopolo at OpenAI coined the term in February 2026. His team ran a 5-month experiment where 3 engineers shipped 1M+ lines of production code without writing any of it by hand. 3.5 PRs per engineer per day. The engineering effort wasn’t the application, it was getting the environment right so the agent could produce reliable output.

The terminology caught up fast. Prompt engineering (2023-2024) was about what to say. Context engineering (Tobi Lutke, mid-2025) was about what the model should see. Harness engineering is about the whole environment: how files are structured, what gates exist, how feedback flows back. Each one absorbed the last.

What a harness looks like in practice

It sounds fancy but it’s really a set of files and scripts you probably already have some version of:

Artifact	Purpose
`CLAUDE.md` / `AGENTS.md`	Static project context: conventions, architecture, naming rules
Hooks (pre/post tool use)	Quality gates that fire automatically
Skills / sub-agent definitions	Decomposed reusable workflows
Memory systems (hot + cold)	Session persistence, knowledge bases
MCP server configuration	Tool access, sandboxing
CI/verification pipelines	Automated validation before human review

If you have a CONTRIBUTING.md and a CI pipeline, you already have a harness. Most teams do. The question is how intentional it is.

At dala.care we’ve built 25+ custom skills: ticket kickoff, PR creation, code review, TDD orchestration, CI triage, production debugging. Convention files encode our coding standards directly into the agent’s context. A knowledge vault captures architectural decisions so the agent keeps institutional knowledge across sessions.

We scored ourselves at 5.9/10 on a maturity assessment against 2025-2026 industry benchmarks. Scaffolding (guidance, skills, knowledge vault) at 7-8/10. Feedback loops (build speed, local testing, CI) at 3-4/10. That gap is where we’re focusing next.

The evidence: for

Some big numbers floating around from teams that went deep on this:

OpenAI Codex: 3 engineers, 5 months, 1,500 PRs, 1M+ lines
Stripe Minions: 1,300+ PRs/week, zero human-written code, 30% of bugs resolved autonomously
Spotify Honk: 1,000 merged PRs every 10 days, 60-90% time savings on migrations
Rakuten: 79% reduction in time-to-market (24 days to 5) on a 12.5M-line codebase

The interesting thing from the Faros AI study (10,000 developers, 1,255 teams) is that AI made individual developers faster (21% more tasks, 98% more PRs) but team-level DORA metrics didn’t move. Individuals sped up, teams didn’t. That’s a harness problem.

Anthropic’s 2026 report has a stat I keep coming back to: roughly 27% of AI-assisted work is stuff that wouldn’t have been done at all without the tooling. Too tedious, too small to justify. The harness made that work worth doing.

The evidence: against

The ETH Zurich study (138 repos, 5,694 PRs) found that LLM-generated context files reduce success by roughly 3% and increase costs by 20%. Human-written files help a bit, maybe 4% gain, but only for details the model can’t infer from the codebase itself. Their recommendation: write less, not more.

That study only measured AGENTS.md files, not full harnesses with hooks and CI. But the signal is worth paying attention to. Most of what people put in convention files is stuff the model could figure out by reading the code.

The skills atrophy data is harder to dismiss. Three independent RCTs:

Study	Finding
Anthropic (Jan 2026, 52 engineers)	17% lower comprehension. Debugging hit hardest.
METR (2025-2026, 16 to 57 developers)	Original: 19% slower with AI, believed 24% faster. Follow-up: speed improved, but 30-50% refused to work without AI, making the data unreliable.
Microsoft/CMU (2025)	Heavy AI reliance reduced critical thinking applied to own code.

The METR result is the one I keep coming back to. Their original study found a 43-point gap between how fast developers thought they were and how fast they actually were. Nobody could tell they were getting worse. They ran a follow-up with better tools and the speed picture did improve, but then a third to half of the participants just refused to do tasks without AI. METR called their own data “unreliable” and went back to redesign the whole experiment. So the miscalibration finding still stands. The speed finding is muddier now. And the fact that developers won’t even try working without AI anymore? That tells you something all on its own.

0% (baseline)

Believed faster

+24%

-19%

Actually slower

43pt perception gap (original study)

METR original study (early 2025, 16 devs). Follow-up (late 2025, 57 devs) showed improved speed but severe selection bias: 30-50% refused to work without AI.

The mechanism makes sense when you think about it: better harness means less coding, less coding means skill degradation, skill degradation means you’re less able to judge what the agent gives you. You end up amplifying a signal you’re progressively less equipped to evaluate.

Scale AI’s SWE-Atlas found that performance differences between harnesses are “essentially noise.” HumanLayer warns it’s “entirely possible to spend more time optimizing your agent setup than shipping code.” Noam Brown’s take: “Your fancy AI scaffolds will be washed away by scale.” Models get better. Context windows grow. Maybe the harness is a workaround for temporary limitations.

Where I’ve landed on this

I’ve gone back and forth. The for case is backed by real production numbers. The against case is backed by controlled experiments. I don’t think they contradict each other, they’re measuring different things.

What helped me was ranking the artifacts by durability. If the AI situation shifts (or collapses), which pieces of a harness still have value?

Knowledge vault

High value in every scenario. If AI fades tomorrow, you have the onboarding docs every team wishes they had.

Coding standards

Rename the file and it's a style guide. Useful with or without AI.

Quality gates

Standard engineering infrastructure regardless of AI tooling.

Skills & orchestration

Most format-dependent. When the tooling changes, these break first.

There’s a pattern underneath all of this. Nonaka and Takeuchi called it externalization in 1995: converting tacit knowledge to explicit knowledge. It’s the hardest and most valuable knowledge conversion. The IaC movement proved the pattern works in software. Codifying tribal knowledge has permanent value regardless of the tool that reads it.

Build the harness. Invest most in the layers that survive a model change.

What I’d tell someone starting out

Write less context, not more. ETH Zurich proved this. If the model can figure it out from the code, don’t document it. Only encode decisions it can’t infer.

Keep coding by hand on the hard stuff. The atrophy data is real. Use the harness for boilerplate, migrations, formatting. Keep your hands in the code for anything with ambiguity.

Invest in the portable layers first. Knowledge vault, coding standards, quality gates. Those survive a model change. Skills and hooks are the most disposable parts of the stack.

If your feedback loops are slow (builds, local testing, CI), fix that before adding more skills. That’s where we found the next increment at dala.care.

Next up: a deeper look at that perception gap, and what happens to your coding ability when the tooling does too much.