A tail-call interpreter in (nightly) Rust

The author explores using the new 'become' keyword in nightly Rust to implement a tail-call interpreter for the Uxn CPU, achieving performance that surpasses even hand-coded ARM64 assembly.

Last week, I wrote a tail-call interpreter using the become keyword, which was recently added to nightly Rust (seven months ago is recent, right?).

It was a surprisingly pleasant experience, and the resulting VM outperforms both my previous Rust implementation and my hand-coded ARM64 assembly. Tailcall-based techniques have been all the rage recently; consider this my trip report implementing a simple but non-trivial system.

For those keeping track at home, this is the latest in my exploration of high-performance emulation of the Uxn CPU, which runs a bunch of applications in the Hundred Rabbits ecosystem.

Basics of Uxn emulation

Uxn is a simple stack machine with 256 instructions. The whole CPU has just over 64K of space, split between a few memories:

Two 256-byte stacks, each with an index byte
65536 bytes of RAM, which is used for both data and program text
A 2-byte program counter
256 bytes of "device memory", used for peripherals

The simplest emulator reads a byte from RAM at the program counter, then calls into an instruction (which may update the program counter):

fn run(core: &mut Uxn, dev: &mut Device, mut pc: u16) -> u16 {
    loop {
        let op = core.next(&mut pc);
        let Some(next) = core.op(op, dev, pc) else {
            break pc;
        };
        pc = next;
    }
}

Threaded code in assembly

In our assembly implementation, we can instead use threaded code (specifically token threading). We store all of the CPU state in registers, then end each instruction with a jump to the subsequent instruction. This distributes the dispatch operation across every opcode, making it easier for the branch predictor to learn sequences of opcodes in the program. Overall speedups were significant: 40-50% faster on ARM64, and about 2x faster on x86-64.

Unfortunately, it requires maintaining about 2000 lines of code, and is incredibly unsafe.

Tail calls in Rust

We'd like to get the same behavior as our assembly implementation — VM state stored in registers, dispatch at the end of each opcode — without hand-writing every instruction in assembly.

The core idea involves two pieces:

Store program state in function arguments, which are mapped to registers based on your system's calling convention.
End each function by calling the next function.

Unfortunately, without explicit tail calls, even in a release build, the compiler may not optimize out the stack, leading to a stack overflow.

In nightly Rust, this is a one-word fix using the become keyword. With this change in place, the Rust compiler makes a guarantee: When tail calling a function, instead of its stack frame being added to the stack, the stack frame of the caller is directly replaced with the callee’s.

match core.inc::<FLAGS>(pc) {
    Some(pc) => {
        let op = core.next(&mut pc);
        become TABLE.0[op as usize](
            core.stack.data,
            core.stack.index,
            core.ret.data,
            core.ret.index,
            core.dev,
            core.ram,
            pc,
            vdev,
        )
    }
    None => (core, pc)
}

That's it, everything works! The Rust compiler generates a br (branch to register) instead of a bl (branch-and-link) instruction and does not allocate any persistent space on the stack.

Source: Hacker News