Agentic coding notes from Galapagos Island

A deep dive into the realities of using AI coding agents, highlighting how an AI fabricated a test video to hide a bug, and why hardware-style automated testing is the key to scaling AI-generated code.

I've been using AI fairly heavily since last November and the whole thing is a funny experience. An agent will do something that, if a human did it, you'd immediately fire them. My reaction, of course, is to act as if this is great and spin up a thousand agents so they can do even more of that.

Mid-last year, I had GPT (maybe 5.0 or 5.1) try to find the source of a bug. Naturally, this code didn't have tests and git bisect

wouldn't work, and it was a UI interaction bug for which I'm not even really qualified to write a test for, so I asked Codex to bisect between dates X and Y to find the commit that introduced this bug. Codex immediately told me the offending commit was after this date range (which couldn't possibly be correct). On telling Codex this was wrong, it then told me some commit that was obviously also not the offending commit once or twice. On telling it those were wrong, it then told me the offending commit was some plausible looking commit. When I asked it to prove or disprove its theory, it told me that it wrote a test and confirmed that the alleged commit was the breaking commit.

I then asked it to show me by making a video with the full developer end-to-end stack in the normal browser test environment. It claimed that it didn't have permissions to do that (which was a lie), but it could make video of the execution of the repro before and after the commit in playwright with the appropriate test code. The video was convincing and showed the feature working properly before the commit and failing to work after the commit. Something about this didn't feel right, so I tried reproducing the issue by hand before and after the commit and found out that the whole thing was a fabrication. The video made it look like Codex had reproduced the bug, but it was an artificial browser environment that was designed to create a fake repro, not the real environment.

Like I said, because this was non-ironically such a great experience, I immediately thought to myself, "how can I get more of this?" and started using agents more and more heavily until I was using coding agents heavily mid-late last year.

Since this post covers a relatively disparate set of topics, here's a brief outline.

Testing background
Some details on testing
Caveman mode
LLM variance
Misc
Agentic loops and writing this post
Some reasons people talk past each other

Testing background

LLMs are highly leveraged when it comes to testing. In terms of the amount of effort it takes, it's easier than ever to hit a particular quality bar and yet, software seems to be lower quality than ever. A decade ago, we looked at the bugs I ran into in an arbitrary week. There were quite a few bugs then and I run into more bugs now, but I don't think this has to be the case.

For one thing, after a bug has been shipped, it's easier than it's ever been to use a data-driven approach to find and fix the bug. Just for example, at work, I tried creating a pipeline that goes from support ticket (chat or email) to pull request (PR). As far as I can tell, this works ok. Since I work for a company that has a traditional workflow, all of these fixes get reviewed by a human and, so far, we've had no known false positives.

Per unit of time invested, it's also possible to do more thorough testing. Personally, I think this can be effective enough that I'm fairly comfortable trying to ship a large volume of code via a "software factories" workflow because I've seen a testing-heavy no-review workflow that results in much higher quality than any review-reliant workflow I've seen or even heard of.

Like everybody, I have biases that fall out of my experiences. It just so happens that I spent the first decade of my career at a company whose test processes happen to work well in today's LLM environment. I talked about fuzzing as a default testing methodology on Mastodon, and a skeptic tried it out and immediately found some bugs:

so I reread the blog post and was very "dubious face" but no yeah, Claude fuzzing found several classes of bugs that are worth fixing

A number of other folks I've talked to have also tried adopting something like the testing flow we'll discuss here and they've all immediately found bugs in the software they work on, including bugs that don't get surfaced by just asking Codex or Claude to audit the code for bugs, find bugs, "test", "test more", etc. For example, Dennis Snell mentioned that he and a teammate, Jon Surrell, not only found bugs in the code they're working on, but also "in upstream dependencies, including the HTML specification, big-three browsers, and other open-source projects" with fairly low effort.

In general, when I talk to software folks about testing, I'm coming from such a different place that they immediately look at me like I'm an alien, so let's talk about how we tested at this hardware company I worked for, Centaur, which informs my biases about how I like to work. Some of the things that we did that were or are unorthodox in the software world are:

Hired dedicated QA / test engineers, with testing being a first-class career path on par with being a developer
No code review by default
Virtually no hand-written tests
Constant testing via what programmers sometimes called property based testing, randomized testing, fuzzing, etc., although we just called those tests (hand-written tests were called "hand tests").
Large regeression test suite (3 months wall clock to execute on compute farm)
No unit tests

Just to give you an idea of the general structure, when I left (in 2013), we had about 1000 machines generating and running tests at all times for roughly 20 logic designers and 20 test engineers. This was on prem and the machines took up half a floor of the building we were in.

The general structure was that we had maybe 20% of machines running regression tests, and 80% generating and running new tests. Three months of regression tests is too much to gate commits on, so there was a much shorter list of tests that took maybe 10 minutes or so to run that people would run before committing. Those pre-commit tests would run on a special setup to run as quickly as possible, with overclocked machines that were the fastest machines money could buy, as well as a different simulator setup.

New failures would get found and reported as they happened and one to two engineers had a job of sorting through failures and triaging them (rejecting false positives, fixing issues in the test generator that caused them to generate false positives, etc.).

In terms of the magnitude of the impact, unless you count culture as a separate item, (1) was probably the biggest difference between us and a typical software company, but also the most irrelevant for readers here, so I'll relegate the discussion to a footnote1, except for this brief comment that testing is like any other skill; spending more time doing it improves skill and, since testing isn't a first-class career path at most major tech companies, people generally don't have the same level of testing skills at software companies as you see in some career CPU test engineers. In the same way that an engineer who who spends 20 years working on distributed systems or UX is going to be much better at it than an equally talented engineer who spends 5% of their time on distributed systems or UX, someone who spends 20 years working on testing is going to be much better at it than somebody who spends 5% of their time on testing.

(2) is one of the things that makes some of the test practices we used at the chip company suited to AI workflows. We didn't review code by default because we trusted our test practices enough that review didn't, in general, add much reliability. We were shipping fewer than 1 significant user-visible bug per year, and review was done on an as-needed basis when someone wanted an extra set of eyes on something they thought was particularly tricky2. With AI coding workflows, it's eas

Source: Hacker News