We replaced Node.js with Bun for 5x throughput

Trigger.dev achieved a 5x throughput increase by migrating their Firestarter service from Node.js to Bun, while navigating challenges like memory leaks and over-engineered database queries.

Update (March 30, 2026): Shortly after this post went live, Bun shipped a fix for the memory leak. 🥳

We replaced Node.js with Bun in one of our most latency-sensitive services and got a 5x throughput increase. We also found a memory leak that only exists in Bun's HTTP model.

The service is called Firestarter. It's our warm start connection broker: it holds thousands of long-poll HTTP connections from idle run controllers, each waiting for work. When a task run arrives, Firestarter matches it to a waiting controller and sends the payload through the held connection. No cold start, no container spin-up. It's in the critical path of every task execution on Trigger.dev.

The problem: Firestarter was using too much CPU. It was running on Node.js, spending 31% of its time inside a SQLite query, parsing every request with Zod, and converting headers with Object.fromEntries() on every GET. It worked, but it was slow.

It took four rounds of profiling to get there, and we hit a few Bun surprises we haven't seen documented elsewhere.

Phase 1: kill the SQLite query engine

The original connection manager was designed as a generic queryable store. It accepted arbitrary nested metadata, flattened it recursively into key-value pairs, and indexed everything in an in-memory SQLite database. Node 22 shipped with node:sqlite built-in, so it was zero-dependency. SQL gave us flexible partial matching on any combination of fields. It made sense at the time because we didn't know the access pattern yet.

Turns out the access pattern was always the same 4 fields. Every match attempt ran this query:

SELECT DISTINCT c.id, c.metadata FROM connections c JOIN metadata_index mi ON c.id = mi.connection_id WHERE c.id IN ( SELECT connection_id FROM metadata_index WHERE (key = ? AND value = ?) OR (key = ? AND value = ?) OR (key = ? AND value = ?) OR (key = ? AND value = ?) GROUP BY connection_id HAVING COUNT(DISTINCT key) = ?) LIMIT 1

A correlated subquery with JOIN, GROUP BY, and HAVING COUNT(DISTINCT) for what is fundamentally a hash table lookup (we really overengineered this one). The metadata is always the same 4 fields: deployment ID, version, CPU, and memory.

We ran node --prof under load (500 simulated controllers, 50 concurrent supervisor requests) and processed the output with --prof-process. getConnection was 31% of total CPU time.

We replaced SQLite with a composite-key Map<string, Set<string>>. The key is a null-delimited string of deployment + version + cpu + memory. Matching became O(1) instead of a SQL query.

The results:

| Metric | SQLite | Map | |---|---|---| | Throughput | 2,099 req/s | 4,534 req/s | | p50 latency | 22.5ms | 10.1ms | | p95 latency | 29.1ms | 14.9ms | | max latency | 619ms | 403ms |

2.2x throughput, 2.2x better median latency. And we could drop the --experimental-sqlite Node.js flag.

Phase 2: move to Bun

With SQLite gone, re-profiling showed 50%+ of CPU time in node:http internals: writev, socket management, stream handling. Node.js's HTTP stack has overhead that matters when you're holding thousands of concurrent long-poll connections.

We added a Bun entry point (bun.ts) using Bun.serve() with its native routing API. The connection manager was already transport-agnostic (we'd extracted it during the SQLite removal), so it was mostly wiring.

Benchmarks with 500 controllers and 50 concurrent supervisor requests:

| Metric | Node.js (Map) | Bun.serve() | |---|---|---| | Throughput | 4,534 req/s | 9,434 req/s | | p50 latency | 10.1ms | 4.5ms | | p95 latency | 14.9ms | 7.4ms | | max latency | 403ms | 22ms |

Another 2x across the board (and the Bun numbers above already include the Phase 3 optimizations below).

Phase 3: profile and strip the hot path

Bun was faster out of the box, but we weren't done profiling. Bun has a --cpu-prof-md flag that outputs CPU profiles as markdown instead of Chrome DevTools format. The output is grep-friendly and readable without any tooling.

# Start with CPU profiling, markdown output
bun --cpu-prof --cpu-prof-md --cpu-prof-dir /tmp/bun-prof src/bun.ts

The output is a markdown table you can read in any editor:

| Self% | Self | Function | Location | |------:|-------:|-------------------|----------------------| | 22.0% | 87.2ms | _parse | zod/v3/types.js | | 10.5% | 41.6ms | fromEntries | [native code] | | 8.6% | 34.1ms | #structuredLog | structuredLogger |

Three clear hotspots:

Zod DequeuedMessage.safeParse() on every POST: 22% of CPU. We replaced it with minimal field presence checks for internal traffic.
Object.fromEntries(req.headers.entries()) on every GET: 10.5% of CPU. Replaced with direct req.headers.get() calls.
Debug logging even when filtered: 8.6% of CPU. The logger was serializing JSON even when debug was off.

Combined, these three fixes cut CPU usage by ~40% under identical load.

Phase 4: compile to a single binary

Next: the runtime itself. Bun has a bun build --compile flag that produces a single self-contained executable.

| Metric | Interpreted | Compiled | |---|---|---| | Throughput | baseline | +14% | | p95 latency | baseline | -24% | | Image size | ~120MB | ~68MB |

We also tried --bytecode and found it actually hurt steady-state performance. Bytecode helps cold starts, but for a long-running server, the larger binary and extra memory mapping overhead makes it slower.

The Bun memory leak

After deploying to production, the Grafana dashboard told two stories. CPU was down. But RSS was climbing fast. Yellow on the left is Node.js, stable at 192 MiB. Green climbing to 250 MiB is Bun with the leak. Blue on the right is the final vers

Source: Hacker News