Andy sent me this post about how Carl Kolon uses and thinks about coding agents. The post is pretty interesting, and when I went to reply to Andy I realized that at seven paragraphs perhaps the reply is better suited to here than a signal message. Also, because that is more friendly than replying to messages with links to your blog to drive traffic and the adoration of your readers? This is of course a forward looking statement for me, I do indeed hope one day to have a reader here but baby steps.
Similarly to Carl I certainly started using LLMs as “smarter search”, cutting and pasting queries into claude.ai, and waiting patiently for an answer. I think this should make Google very worried, especially as its so good at finding answers. The improved performance over a “raw Google” is largely because of persistence — the LLM doesn’t perform single search, it will keep searching until it finds the answer.
The decreasing levels of supervision Carl talks about is what I am talking about when I talk about prompts and planning, which has been my favourite LLM topic for the last few weeks. Increasingly I am writing a plan for a feature with the LLM, and then asking it to implement it. I also find the structure of these planning documents is super important, and I am iterating to a quite structured style for those. Assuming that I have good unit and functional tests enforced by a pre-commit, I cannot remember a recent time where there’s been a serious bug which wasn’t a design flaw in that planning process. The days of trying to read a terabyte of data into RAM seem to be over.
I think this might also be why I have gotten into understanding the underlying mechanics of things so much recently. Not only is it interesting to me, but I no longer need to be too concerned with high level details like how to do CSS for a single status dashboard. This frees me up to think about more fundamental things like “what even is a VM”? I think I always would have dug into those things if I’d had the time, its just that LLMs are the tools that freed up that time for me.
The big barrier for me right now is that its rare for me to let the LLM run in unsupervised mode while implementing. I do it with one personal project as a deliberate experiment, and maybe some random personal scripts, but nothing else. I think that’s really slowing me down at this point because I spend all day reading diffs and almost always saying yes. I do occasionally catch things that should have been decided in planning, so its hard to stop though. I think the really productive people are probably better at planning than me and therefore can just let it do its thing. Or maybe they have more bugs and don’t notice?
The specific personal project I am letting the LLM run unsupervised on I think is a special case for now because the guard rails are so concrete — the project is cloning an existing tool while implementing it in a much more sandboxed manner. That means that for any given possible target command line, if the output is different with my project that’s a bug and should be fixed. Claude has written something like 760 tests so far asserting this thing, and when it finds a delta it just fixes it. Does it matter if the implementation is a little weird if the output is concretely provably correct?
The biggest gap I see in for that project is that I need to prompt it at the end of a session to pay down debt before I push the PR to CI. My current prompt for that is this:
Thanks for your work on this. I appreciate it. Some final checks before I push the PR:
* Did the changes in this branch introduce any significant amount of duplicated code? Are there any missed opportunities for code reuse or refactoring?
* Has docs/ been updated?
* Is there unit and functional test coverage?
* Are we sure that all rust and python tests are run by both the pre-commit and CI? We’ve had historical problems with missing the guest operation rust code for example.
* All tests should pass. We need to fix any failing tests now before we commit.
* What tests are skipped? Could we reduce that number?
* Are there any TODO comments we should address as part of this work?
* Is all deferred work and pre-existing errors listed in a plan file?
Now, here’s the bit that worries me right now. I’ve just said the important part is the planning process, and that’s where you catch design flaws, avoid bugs, and lay down the guardrails that seem to be vital to a good outcome. That’s a thing I can do because I’ve been writing software for at least 38 years. What of the juniors of the world? How do they develop those skills without the battle scars of having done it the dumb way for a very long time?