Seventeen comments
If you’ve been using agents for coding you are aware of the following pattern: you ask an agent to implement a feature, all 900+ tests pass, builds are green, you give it a skim and it all looks good so it gets shipped, ready for review and you go on your merry way.
Then seventeen comments land on the PR.
None of them are about bugs, the code works. However by the time the review cycle finishes, both implementations look different enough that “working code” and “right code” become two entirely different things. This post is about the gap between those two and the steps I’ve taken to improve on it.
Working code isn’t the hard part
A key aspect of working with agents most people miss is that they are by design non-deterministic: You give them the same prompt and get different outputs. This is a feature, not a bug, it means they can be creative and adapt to new information. But it also means that “shipped” doesn’t always mean “done.”
This gets compounded by the fact that these tools are significantly faster to write code than us, so you can get to “shipped” much faster than you might expect. In my case, the agent read the existing service patterns, generated code and the GraphQL schema, wrote the tests, and shipped. Frictionless, just as configured.
As reviews came in, several things came back wrong. Seventeen comments in this case, roughly split three ways:
- where things lived and what they were called
- places the existing codebase had a convention the new code ignored
- places the new surface broke invariants from somewhere else in the system
None of these are things that would cause a crash, a security hole, or broken behavior. All seventeen still mattered.
Correct and also wrong
One of the clearest examples was validation. The agent had used a hardcoded type, a pattern it had seen in the codebase. Reviews had valid push-back: in this codebase, entities implement an interface and resolvers fetch-then-run against it. The entity carries its own metadata.
Both versions work. Both pass tests. The first hardcodes the check at the call site. The second lets higher-level access fall out for free, because the entity already knows where in the tree it sits. Future resolvers don’t have to reinvent that check.
The agent couldn’t have known this from just looking at code. The pattern showed up in some entities but not all because we’re constantly improving, refining. Reviewers had pointed it out because they knew they’d been in the room when the pattern was decided, and because a later ticket was going to touch something related to it. The check couldn’t be hardcoded because it would need to move.
That’s domain knowledge living in the reviewer, not the repo.
A question of taste
As I started peeling more layers of the onion, more things came up wrong in unexpected ways. As an example, I had a function implemented at the resolver, looping over items and calling a single endpoint. It worked fine.
The problem came when looking at the codebase as a whole. The way the data flowed through the system, the shape, didn’t fit. This kind of discrepancy can be incredibly common when doing large changes through agentic development, the agents don’t have the same holistic understanding of the codebase that we do, at least not yet. Some call it taste, a very appropriate word. They can easily write code that works but doesn’t “taste” as it should.
That shape matters for the long term, a predictable codebase is a healthy codebase.
Ask why
Sometimes the most useful review isn’t about what you built, it’s about why.
The agent added a field to the path where the status of the item would change, stored it, returned it, done. Ticket satisfied, right? “Here’s what was built”.
Reviews pointed at a different question: why does this field even exist?
Not “what does it hold” or “where does it go.” Why. They were asking about the audit trail. Every time something transitions state, we need to know the why. Not just when a user does it through the API, but through every path that can cause that transition. The agent had modeled the field as belonging to the request because it had a limited perspective. The reviewer moved it to the transition itself.
It’s a small change in where something lives. It’s a large change in what it means.
The loop
This is all part of an iterative process. I pushed fixes for these and it got reviewed again. Six new threads came back with all pointing to similar issues. The problem is not about the iteration process, it’s about what is learnt from it and what you can do as a developer to avoid falling on the same hole twice.
Yes, it can be pretty painful but what survives is worth it: For some of these the rules have been externalized, saved for future session references so that they can be applied to those PRs. For others we reach deeper and learn, we extract what’s valuable from the experience and grow from it.
In a way it’s what we do day to day already, the only difference is that we are doing it in a more conscious way, and getting that extra value from saving it for other endeavours.
The gap
This is the consistent failure mode I see with agentic backend work. Not bugs. Not security holes. Shape.
Agents are good at solving the ticket. They read the existing code, pattern-match, produce something that compiles and passes tests. On the axis of “does it do the thing,” they are reliable enough that I trust them to ship without watching.
However, on the axis of “does this fit the shape of the system,” they drift in the same directions. A function at the call site because that’s what they see most. Bulk actions as loops because that’s the most literal reading of “go through this many items and do x.” Audit data as API plumbing because the requirements describe it as an API field. Each choice is locally defensible. Collectively they pull the codebase toward a shape that doesn’t hold up.
I’ve been trying better prompts, added a domain-seams section to backend AGENTS.md, covered in CLAUDE.md as project constitution. The rules helped on subsequent work but it didn’t eliminate the pattern. The convention file can only carry so much, and the actual decision is entirely contextual: which seams matter for this feature, what’s planned in adjacent tickets, which patterns are converging and which are being deliberately changed.
That information will live in our head unless externalized.
Review is where the codebase teaches itself
I’ve changed the way I think about reviews now: they are less about where we verify the work and more where we do the teaching. I’m not talking about the agent-to-reviewer direction, I’m talking about the codebase-to-agent direction, through the reviewer.
Every comment left on PRs nowadays encodes something about the shape of the system. A pattern we’ve seen work. A cross-cutting invariant. A decision from a design discussion. A plan for a ticket that hasn’t been filed yet. None of that is in the diff. Not all of it is in AGENTS.md. Some of it isn’t written down anywhere at all.
If the agent had merged that PR solo, the code would work but the shape of the system would surely drift. Subsequent agent-generated work would pattern-match on the drifted code, compound it, and eventually look wrong enough that we would have to tear out a quarter of it and start again.
As I see it, whether intentional or not the review process has turned into the mechanism that stops the drift. Not by catching bugs, the tests mostly do that, but by forcing the code back onto the seams that we still carry in our heads.
What changed
Two things, after the work on this ticket.
One: I’ve stopped treating review as a formality. The agent ships green PRs. I review them like I would review a PR from a junior engineer who’s new to the codebase, which is what the agent is, permanently, because it doesn’t carry state between sessions. There are ways to improve this like injecting context through use of the memento vault but the risk is still there and the code quality has to be high. Every review cycle is a chance to push the code back onto the domain seams before drift compounds.
Two: I started capturing the seams that reviewers pushed toward. After this ticket I wrote up the patterns as vault notes. An inception pass turned those into a draft “domain seams” section for backend AGENTS.md for future improvements.
The generalizable pattern is real. The specifics need a reviewer. For now the reviewer stays human.
Next up: three LLMs reviewed the same PR in parallel. They disagreed in useful ways.
Part of a series on agentic development tooling. See also: Harness engineering, Five revisions of start-ticket, Ship-it, Composing over reimplementing, CLAUDE.md as project constitution