The job is babysitting the agents now

I haven’t written a feature from scratch in a while. I write prompts, I review diffs, I approve or reject what gets staged. Sometimes I steer mid-run. The ratio of code I authored to code I supervised is somewhere around one to eight at this point, and the number is only moving in one direction.

This is fine. The productivity is real. I’m not complaining.

But something shifted in the job description that nobody’s quite named yet, and the name for it is babysitting.

what the day actually looks like

You spin up a Cursor parallel run on a ticket. You give it context, you point it at the relevant files, you set some guardrails in the system prompt because you’ve learned the hard way what happens when you don’t. Then you go do something else: review another branch, write up a spec, stare out the window. You come back and there’s a PR. It’s mostly right. There’s one thing that’s subtly wrong in a way that would have taken you twenty seconds to avoid if you’d written it, but you didn’t write it, and the agent didn’t know it had to avoid it, because the thing you’d have avoided is the kind of knowledge that lives in your head and not in any file the agent could read.

So you either catch it in review (which you did, this time) or it ships.

The new failure mode is “it shipped and I didn’t read the diff closely enough.” Nobody reads the code the agent wrote until it breaks. This is not a future problem. This is the current situation. I know this because I have been the person who didn’t read the diff closely enough.

the irreversibility problem

There’s a class of operation (deploys, migrations, onchain transactions, anything that moves money or modifies state you can’t roll back) where the correct policy is: the agent does not do this unattended. Full stop. You can give it the tools. You probably shouldn’t.

I had to learn this by building an explicit layer of “human in the loop for anything irreversible” into every workflow that touches anything I care about. That layer is not automatic. The models don’t add it themselves. GPT-5.4 with native computer-use will execute whatever sequence of actions gets it to the goal, and if “deploy to prod” is part of the sequence and you haven’t explicitly blocked it, it’ll do that. Because you asked it to solve the problem. It solved the problem.

This is the “AI replaced the developer” story from up close: I’m here at 3am looking at a notification because something broke in prod, and the thing that shipped it was running fine, doing exactly what it was configured to do, while I was asleep.

The agent didn’t break prod. The configuration broke prod. The agent executed the configuration correctly.

The configuration was mine.

what changed and what didn’t

The part everyone gets right: the speed is real. Cursor parallel agents, Opus 4.6 running long multi-step tasks across a codebase, Codex spinning up in a sandbox and running its own tests. I can cover more surface in a day than I could in a week two years ago. That’s not exaggerated. The output multiplier is real.

The part people get wrong: the leverage makes you faster at building things, not faster at knowing what to build or faster at catching what went sideways. The bottleneck moved. It used to be writing the code. Now it’s reading the output, understanding it well enough to approve it, and noticing when the model did something technically correct but contextually wrong.

Half of “agent engineering” right now is just maintaining enough working memory of the system that you can evaluate what the agent shipped. You have to understand the codebase even though you didn’t write most of it. You have to read diffs in a domain you may not have touched in weeks. You have to notice the subtle wrong thing.

That’s a different skill than writing code. It’s closer to reviewing code written by a junior who is very fast and occasionally confident in exactly the wrong direction.

The junior analogy doesn’t fully hold because the agent has no ego and takes feedback without drama, which is honestly its best feature. But the supervision dynamic is real. You are responsible for what it ships. It doesn’t share that responsibility with you.

the posture that helps

I keep the agents on a short leash for anything stateful. Long leash for exploration, scaffolding, tests, documentation, first-draft features in a sandbox. Short leash the moment we’re anywhere near production data, keys, transactions, or a deploy pipeline.

I write more explicit system prompts than I used to. Not clever ones. Blunt ones. “Do not call any function that starts with send_ without asking first.” “If you are unsure whether to proceed, stop and surface the question.” Blunt works better than clever because the model actually follows blunt.

I read every diff before it merges. This sounds obvious. It is obvious. It’s also the thing that slips when you have twelve branches open and the model’s approval rate is high enough that you start treating review as a formality.

It’s not a formality. The one you skimmed is the one that’ll get you.

The productivity is real and so is the new failure mode. The speed gain is real and so is the babysitting. This is what AI-native development looks like right now, from inside it: you spend less time writing code and more time watching something write it, hoping your guardrails were tight enough and your review was careful enough and your “do not touch X” instructions were specific enough.

Most of the time they are.

The one time they aren’t is a long night.