The Day 2 Problem: Why Vibe Coding Is Creating a Maintenance Crisis and What Good Engineers Do Instead

I was reviewing a PR last week where a checkout flow had grown to 847 lines in a single component. The author had used an AI assistant to build out severa...

10 Apr 2026

I was reviewing a PR last week where a checkout flow had grown to 847 lines in a single component. The author had used an AI assistant to build out several features in a row. Each one worked fine. The tests passed. The component worked. Nobody was really sure what it did anymore.

That's the pattern I keep running into. Not broken code. Code that nobody wants to touch.

Day 1 is when you spin up a project with an AI assistant and features appear almost faster than you can think of them. It feels like a superpower and honestly it is. Day 2 is three months later when you need to change something in that codebase and realise the whole thing is a trap. Not because the code is wrong. Because it was written for an imagined version of your system, not the actual one.

The specific failure mode

AI models are good at writing code that's locally correct. They will give you a well-structured function, sensible variable names, idiomatic usage of whatever library you're using. The problem is that code doesn't live alone. It lives inside a system with history, constraints, and decisions the model has no way of knowing about.

The model doesn't know you deprecated that utility four months ago. It doesn't know your team specifically avoids that pattern after an incident. It doesn't know the "simple" approach it just suggested will explode under the threading model you're using. It's writing code for a plausible version of your codebase, not yours.

And because the output looks clean (no obvious red flags, no weird syntax) it slides through review. The problems are in the fit, not the code itself, and fit is harder to see.

What reviewing AI output actually requires

When I review code a colleague wrote, I can ask them questions. Why did you structure it this way? Did you consider this edge case? What happens when this is null? The code is backed by a person with reasoning.

When you review AI-generated code, there's no reasoning to query. You have to supply it yourself. That means asking different questions:

Does this code understand where it actually lives? Is it respecting the module's responsibilities or doing something that belongs somewhere else? Has it reintroduced a pattern we deliberately moved away from? Is it correct in isolation but wrong given how this module is actually called?

The model will also happily invent abstractions that duplicate things that already exist, because it doesn't know what already exists. I've caught duplicate utility functions, parallel implementations of the same logic in different styles, and import patterns that violated our layering. All in code that passed tests and looked fine on a quick scan.

Reviewing AI output is more demanding than reviewing human output, not less. The code looks cleaner, which makes it easier to miss the structural problems.

Fitness functions are more useful now than they've ever been

The idea of an architectural fitness function (an automated test for system-wide properties, not just individual functions) has been around since Neal Ford and Rebecca Parsons wrote about it in Building Evolutionary Architectures. I've found it genuinely useful for years. In an AI-assisted codebase it's basically essential.

The model doesn't know your architectural rules. So you need to encode them as tests that will fail when the rules are broken.

A dependency direction test is the obvious one: if your architecture says the data layer can't import from the presentation layer, write something that parses your import graph and fails if that invariant is violated. Models introduce cross-layer imports constantly because the shortest path to making something work often ignores your layering entirely.

Circular dependency detection is another one that's easy to add and catches a surprisingly frequent class of problem. Circular deps rarely crash tests but they always mean something about your module boundaries has been violated.

I'd also add module size thresholds. Not because line count is a perfect proxy for complexity, but because when a module quietly becomes a god object over several AI-assisted sessions, you want something to flag it before it gets deeply embedded.

These feel unglamorous. They're also the difference between a codebase you can still work in at the 12-month mark and one you're planning to rewrite.

The thing that actually matters

The engineers I've watched handle AI-assisted codebases well all have something in common: they're comfortable pushing back on the model. Not reflexively, but when it's wrong. They look at confident, fluent AI output and say "no, that's not right here" and they know why.

That's not a prompt engineering skill. You can't acquire it by getting better at writing prompts. It comes from having maintained enough bad codebases to understand why constraints exist, from having felt the cost of a "just for now" abstraction becoming permanent, from knowing what coupling actually costs when you're trying to ship a fix at 11pm.

Junior engineers (and I was one, not long ago) are especially exposed here. The model sounds certain. It doesn't hedge. If you don't already have a strong enough mental model to disagree with it, you'll merge things you shouldn't. That's not a criticism of junior engineers, it's just how confidence works on people who haven't yet built up enough counter-evidence.

The practical implication for teams is that AI-generated code probably shouldn't have junior engineers as the first-pass reviewer, at least not for anything in a critical path. I know that sounds backwards. The pitch for AI tools is often that they make junior engineers more productive. They do. They also generate output that requires senior judgment to evaluate safely.

A few things I'd actually do

If you're running a team right now and using AI generation heavily, the thing I'd prioritise above almost everything else is getting fitness functions into CI before you start a big AI-assisted sprint. Once the code is merged and other things depend on it, the violations are load-bearing and unwinding them is genuinely painful.

Beyond that: be honest about where the cost of a wrong answer is high. Auth flows, payment processing, anything touching user data. These aren't places to use AI carelessly and do a quick review. The model doesn't hallucinate dramatically wrong code, it hallucinate subtly wrong code, and subtle is exactly what you don't want in those areas.

The last thing, and this is more cultural than technical: track how long it takes to onboard someone new to a module. Velocity metrics look great in an AI-assisted codebase. Maintainability shows up later. Time-to-onboard is one of the few proxies you can measure before it becomes a crisis.

None of this is a reason not to use these tools. I use them constantly. They genuinely make me faster. But "it works" and "it's good" are different bars, and the gap between them is where the maintenance problems live.

This is exactly the gap I teach engineers to close. My next free live session, System Design for AI Agents: Senior vs Staff (Tuesday, July 21, 2026, 6:30 PM GMT+1, 45 minutes on Zoom), goes straight at it: the five things that break every LLM agent after the demo, and the design decisions that stop each one. Everything else I teach is on my Maven profile.

I write about system design and the senior-to-staff transition every week in Monday BY Gazar on Substack, and I break down architecture and AI-in-production reasoning on Gazar Breakpoint on YouTube.

Go further

Live cohort on Maven

From Senior to Staff: Master the Architecture Skills That Get You Promoted

Go from shaky in design reviews to the engineer everyone trusts to architect the hard stuff.

View the live cohort

Free lessonAI Supply Chain: Threat Model Your Agent ToolchainTue, 11 Aug · 6:30pm LondonSave my seat 1:1Get your LLM or agent system reviewedSee how it works