How I write software with LLMs

A developer shares their detailed workflow for using LLMs to write software, highlighting benefits like lower defect rates and a shift in focus from coding to system architecture.

Lately I’ve gotten heavily back into making stuff, and it’s mostly because of LLMs. I thought that I liked programming, but it turned out that what I like was making things, and programming was just one way to do that. Since LLMs have become good at programming, I’ve been using them to make stuff nonstop, and it’s very exciting that we’re at the beginning of yet another entirely unexplored frontier.

There’s a lot of debate about LLMs at the moment, but a few friends have asked me about my specific workflow, so I decided to write it up in detail, in the hopes that it helps them (and you) make things more easily, quickly, and with higher quality than before.

I’ve also included a real (annotated) coding session at the end. You can go there directly if you want to skip the workflow details.

The benefits

For the first time ever, around the release of Codex 5.2 (which feels like a century ago) and, more recently, Opus 4.6, I was surprised to discover that I can now write software with LLMs with a very low defect rate, probably significantly lower than if I had hand-written the code, without losing the benefit of knowing how the entire system works. Before that, code would quickly devolve into unmaintainability after two or three days of programming, but now I’ve been working on a few projects for weeks non-stop, growing to tens of thousands of useful lines of code, with each change being as reliable as the first one.

I also noticed that my engineering skills haven’t become useless, they’ve just shifted: I no longer need to know how to write code correctly at all, but it’s now massively more important to understand how to architect a system correctly, and how to make the right choices to make something usable.

On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet, even at tens of thousands of SLoC. Most of that must be because the models are getting better, but I think that a lot of it is also because I’ve improved my way of working with the models.

One thing I’ve noticed is that different people get wildly different results with LLMs, so I suspect there’s some element of how you’re talking to them that affects the results. Because of that, I’m going to drill very far down into the weeds in this article, going as far as posting actual sessions, so you can see all the details of how I develop.

Another point that should be mentioned is that I don’t know how models will evolve in the future, but I’ve noticed a trend: In the early days of LLMs (not so much with GPT-2, as that was very limited, but with davinci onwards), I had to review every line of code and make sure that it was correct. With later generations of LLMs, that went up to the level of the function, so I didn’t have to check the code, but did have to check that functions were correct. Now, this is mostly at the level of “general architecture”, and there may be a time (next year) when not even that is necessary. For now, though, you still need a human with good coding skills.

What I’ve built this way

I’ve built quite a few things recently, and I want to list some of them here because a common criticism of LLMs is that people only use them for toy scripts. These projects range from serious daily drivers to art projects, but they’re all real, maintained projects that I use every day:

Stavrobot

The largest thing I’ve built lately is an alternative to OpenClaw that focuses on security. I’ve wanted an LLM personal assistant for years, and I finally got one with this. Here, most people say “but you can’t make LLMs secure!”, which is misunderstanding that security is all about tradeoffs, and that what my agent tries to do is maximize security for a given amount of usability. I think it succeeds very well, I’ve been using it for a while now and really like the fact that I can reason exactly about what it can and can’t do.

It manages my calendar and intelligently makes decisions about my availability or any clashes, does research for me, extends itself by writing code, reminds me of all the things I used to forget and manages chores autonomously, etc. Assistants are something that you can’t really explain the benefit of, because they don’t have one killer feature, but they alleviate a thousand small paper cuts, paper cuts which are different for each person. So, trying to explain to someone what’s so good about having an assistant ends up getting a reaction of “but I don’t need any of the things you need” and misses the point that everyone needs different things, and an agent with access to tools and the ability to make intelligent decisions to solve problems is a great help for anyone.

I’m planning to write this up in more detail soon, as there were some very interesting challenges when designing it, and I like the way I solved them.

Middle

Maybe my naming recently hasn’t been stellar, but this is a small pendant that records voice notes, transcribes them, and optionally POSTs them to a webhook of your choice. I have it send the voice notes to my LLM, and it feels great to just take the thing out of my pocket at any time, press a button, and record a thought or ask a question into it, and know that the answer or todo will be there next time I check my assistant’s messages.

It’s a simple thing, but the usefulness comes not so much from what it does, but from the way it does it. It’s always available, always reliable, and with zero friction to use.

Sleight of hand

I’m planning to write something about this too, but this one is more of an art piece: It’s a ticking wall clock that ticks seconds irregularly, but is always accurate to the minute (with its time getting synced over the internet). It has various modes, one mode has variable tick timing, from 500 ms to 1500 ms, which is delightfully infuriating. Another mode ticks imperceptibly more quickly than a second, but then pauses for a second randomly, making the unsuspecting observer question their sanity. Another one races to :59 at double speed and then waits there for thirty seconds, and the last one is simply a normal clock, because all the irregular ticking drives me crazy.

Pine Town

Pine Town is a whimsical infinite multiplayer canvas of a meadow, where you get your own little plot of land to draw on. Most people draw… questionable content, but once in a while an adult will visit and draw something nice. Some drawings are real gems, and it’s generally fun scrolling around to see what people have made.

I’ve made all these projects with LLMs, and have never even read most of their code, but I’m still intimately familiar with each project’s architecture and inner workings. This is how:

The harness

For the harness, I use OpenCode. I really like its features, but obviously there are many choices for this, and I’ve had a good experience with Pi as well, but whatever harness you use, it needs to let you:

Use multiple models from different companies. Most first-party harnesses (Claude Code, Codex CLI, Gemini CLI) will fail this, as companies only want you to use their models, but this is necessary.
Define custom agents that can autonomously call each other.

There are various other nice-to-haves, such as session support, worktree management, etc, that you might want to have depending on your project and tech stack, but those are up to you. I’ll explain the two requirements above, and why they’re necessary.

Multiple models

You can consider a specific model (e.g. Claude Opus) as a person. Sure, you can start again with a clean context, but the model will mostly have the same opinions/strengths/weaknesses as it did before, and it’s very likely to agree with itself. This means that it’s fairly useless to ask a model to review the code it just wrote, as it tends to mos

Source: Hacker News