WATaBoy: JIT-Ing Game Boy Instructions to WASM Beats a Native Interpreter

An exploration of WATaBoy, a Game Boy emulator that bypasses iOS JIT restrictions by compiling instructions to WebAssembly at runtime, achieving better performance than a native interpreter.

Background

This text assumes the reader is familiar with the concept of just-in-time compilation.Dolphin isn’t on iOS, because you can’t do JIT compilation on iOS. That’s a quick summary of OatmealDome’s blog post “Why Dolphin Isn’t Coming to the App Store”. Ever since reading that, I’ve wondered what it would take to get a CPU-bound emulator like Dolphin working on iOS. Do we just... have to wait a few years for iPhone CPUs to get fast enough to run Dolphin with an interpreter?

Well, Apple has one exception to its JIT restrictions: web browsers. JavaScriptCore, WebKit’s JS engine, uses JIT compilation for its higher-performance tiers. So, if a JS function is called enough times, eventually it’ll be optimised and compiled into native machine code. The same is true for WebAssembly.

So, what if we just piggyback off of this? Instead of generating native machine code directly, we could just generate Wasm bytecode, which will eventually be compiled to native machine code by the web browser. After reading Andy Wingo's blog post "just-in-time code generation within webassembly", I knew such a thing would be possible. In fact, a handful of projects already use this technique, namely The Jiterpreter and v86, but at the time of writing, no emulators for game consoles have used it, and nobody has compared the performance to an interpreter running natively to see if it's faster.

So, for my undergraduate final-year project, I decided I’d build a Game Boy emulator, first using an interpreter, and then using a JIT-to-Wasm. This project primarily serves as a proof of concept and benchmark to compare the performance of each approach. For the rest of this blog post, I'll call this a “JIT-to-Wasm” instead of a “Wasm JIT” to avoid confusion with what the JS engine itself does (recompile Wasm to machine code).

Anyone reading this who knows a bit about emulation just rolled their eyes, because how the hell is a Game Boy emulator going to benefit from JIT compilation? Luckily, GameRoy’s blog post describes exactly how it’s possible while remaining cycle-accurate:

predict when interrupts are going to occur
whenever a JIT block might be interrupted, fall back to an interpreter
lazily evaluate any non-CPU Game Boy components accessed via MMIO

GameRoy’s JIT only targets x86, but nearly all of its optimisation techniques still apply to our JIT-to-Wasm. Definitely check it out if you’re interested in the nitty-gritty details of the Game Boy emulation side of things; it was a huge inspiration.

Still, a Game Boy emulator doesn't benefit from JIT compilation as much as, say, a sixth-gen console. But it was much faster to make, and actually fit within the scope of my final-year project.

Implementation

Now, to narrow the scope of this blog post, I’ll take you through the most broadly applicable part of WATaBoy that I couldn't find a guide for anywhere else: Wasm codegen and late-linking from within Rust. A lot makes WATaBoy interesting, specifically from a Game Boy emulation perspective (e.g., SIMD tile rendering), but those implementation details deserve separate write-ups (you can also just read WATaBoy’s source, of course). If you aren’t interested, skip to the results.

Normally we'd usually reach for tools like wasm-bindgen and wasm-pack to generate glue code between Rust and JavaScript. But those tools cause some ergonomics issues when working with Wasm at a low level. Instead, I use an approach similar to the one described in ”Rust to WebAssembly the hard way”. This just means we'll pass data across the Rust-JS boundary via the C ABI, using pointers and buffer lengths instead of JavaScript objects.

Just a heads up, you’ll need Nightly Rust, because we'll use a tiny bit of inline Wasm later. So run:

rustup default nightly

To switch back, just run this again but swap ‘nightly’ for ‘stable’. Create a new library:

cargo new --lib jit-to-wasm

Hey look, we've already got some code here:

pub fn add ( left : u64 , right : u64 ) -> u64 { left + right }

For our simple example, let’s try producing some Wasm bytecode at runtime that does the same thing.

Wasm code generation

The wasm-encoder crate will be our only dependency. With it, we can emit the bytes for Wasm instructions using a sort of builder pattern. It wasn’t designed for our JIT use case, so there are some ergonomics issues and a tiny bit of boilerplate, but it definitely beats writing an array of raw bytes by hand. :)

[ package ] name = "jit-to-wasm" version = "0.1.0" edition = "2024" [ lib ]

Required to produce a .wasm file.

crate-type = [ "cdylib" ] [ dependencies ] wasm-encoder = "0.252.0"

Now, let’s use it to produce the bytecode for a Wasm module containing an ‘add’ function. Here comes that boilerplate I mentioned:

use wasm_encoder:: * ; fn make_add_module () -> Vec < u8 > { let mut module = Module :: new (); // Encode the type section for the add function. // Parameters: 32-bit int left, 32-bit int right. // Returns: 32-bit result. let mut types = TypeSection :: new (); let params = vec! [ ValType :: I32 , ValType :: I32 ]; let results = vec! [ ValType :: I32 ]; types. ty (). function ( params, results); module. section ( & types); // Encode the function section. let mut functions = FunctionSection :: new (); let type_index = 0 ; functions. function ( type_index); module. section ( & functions); // Encode the export section. let mut exports = ExportSection :: new (); exports. export ( "my_add_func" , ExportKind :: Func , 0 ); module. section ( & exports); // Encode the code section. let mut codes = CodeSection :: new (); let locals = vec! []; let mut my_add_func = Function :: new ( locals); my_add_func . instructions () // Get the first 32-bit int onto the stack (left). . local_get ( 0 ) // Get the second 32-bit int onto the stack (right). . local_get ( 1 ) // Add the two ints together. . i32_add () . end (); codes. function ( & my_add_func); module. section ( & codes); // Extract the encoded Wasm bytes for this module. module. finish () }

This example is almost exactly the same as the one from wasm_encoder’s documentation. Alright, now how do we actually execute this bytecode?

[ unsafe ( no_mangle )]

pub extern "C" fn make_and_execute_add ( left : i32 , right : i32 ) -> i32 { let add_bytecode = make_add_module (); // Execute add ...somehow??? }

Compiling and linking

Harkening back to Wingo’s blog post, Wasm is a Harvard architecture rather than a von Neumann architecture. Practically speaking, this means we can’t directly execute the bytecode generated by our programme. For WebAssembly specifically, we have to reach out to the embedder (typically JavaScript) to compile, instantiate and link in our new Wasm bytecode. The jit-interface proposal may provide a way to do this directly in Wasm with a func.new

instruction, but for now, we gotta talk to JavaScript.

First, we use the synchronous compilation interface to compile and instantiate our bytecode. (Compile & Instantiate)
Then, we add the function from our generated module to our main module’s indirect function table, and keep track of its index in the table so we can invoke it later. (Link)
Finally, we can actually execute the function using the call_indirect instruction, which calls the nth function in our indirect function table. (Dispatch).

Let’s imagine we’re already importing a function called "linkNewModule" that compiles, instantiates, and links a buffer of bytecode; we’ll implement the real thing in JavaScript later.

[ link ( wasm_import_module = "env" )]

unsafe extern "C" { // Returns the new function's index in the table.

[ link_name = "linkNewModule" ]

fn link_new_module ( buffer : * const u8 , len : usize ) -> i32 ; }

Next, we implement our dispatch function to call the nth function in our indirect function table. All we really need to do is execute the call_indirect Wasm instruction. Normally when you want to do something like this, you'd reach for an intrinsic function in std::arch

, but there isn't one for call_indirect. So we're going to have to use a tiny bit of inline WebAssembly.

This is an unstable feature, so you'll have to put this at the top of lib.rs:

#! [ feature ( asm_experimental_arch )] use std:: arch:: asm;

// Indirectly call the function at index in this module's function table. fn dispatch ( index : i32 , left : i32 , right : i32 ) -> i32 { let mut result: i32 ; unsafe { asm! ( "local.get {right}" , "local.get {left}" , "local.get {index}" , "call_indirect (i32, i32) -> (i32)" , "local.set {result}" , index = in ( local) index, left = in ( local) left, right = in ( local) right, result = lateout ( local) result, ); } result }

Putting it all together, this is what we have:

[ unsafe ( no_mangle )]

pub extern "C" fn make_and_execute_add ( left : i32 , right : i32 ) -> i32 { let add_bytecode = make_add_module (); let func_idx = unsafe { link_new_module ( add_bytecode. as_ptr (), add_bytecode. len ()) }; dispatch ( func_idx, left, right) }

And one last thing: we have to pass a couple of flags to LLD using a /build.rs file: The first one, --export-table

, exports our main Wasm module's indirect function table, so we can access it from the embedder (JS). The second one, --growable-table

, lets us grow the table so we can append our JIT-compiled functions. This flag is totally undocumented, but it works, and there's a test for it, so...

fn main () { println! ( "cargo:rustc-link-arg=--export-table" ); println! ( "cargo:rustc-link-arg=--growable-table" ); }

Alright, that's the Rust side of things done. Let's build our main Wasm module:

cargo build --release --target wasm32-unknown-unknown

The embedder (JavaScript) side of things

Now, let's try to call our make_and_execute_add

function from the embedder:

// Instantiate the main Wasm module for the JIT itself. const source = fetch ( "target/wasm32-unknown-unknown/release/jit_to_wasm.wasm" ); const { instance} = await WebAssembly . instantiateStreaming ( source ); // Generate an add function at runtime and use it to add 2 and 3 together. const result = instance . exports . make_and_execute_add ( 2 , 3 ); console . log ( result );

Console output:

TypeError: import env:linkNewModule must be an object

Ah right, we haven't implemented that linking function yet. Let's do that now:

Here's the console output:

And here’s an example of the code we just wrote running on this page:

= And that’s the basis of WATaBoy’s codegen, linking, and dispatch. I'm sure you can guess how you might modify the function's signature and instructions in make_add to generate more useful Wasm modules at runtime. In WATaBoy, our JIT recompiles and appends each non-branching Game Boy instruction to create a basic block (a Wasm module with a single execute_block

function) that we can cache and re-execute later. If you're curious, check out how part of the Game Boy's instruction set is recompiled.

Results

Now, let’s compare performance between WATaBoy’s JIT compiler running in Wasm, its interpreter running in Wasm, and its interpreter running natively. For this benchmark, all three variations were set to load a game’s ROM and emulate the looping title screen at uncapped speed for 10 seconds of wall-clock time. The three ROMs I tested were: Pokémon Blue, The Legend of Zelda: Link’s Awakening, and Tobu Tobu Girl (a free homebrew platformer). Results are measured in total number of frames emulated within the 10-second timespan.

| Environment | MacBook Air (13-inch, M2, 2022) | |---|---| | Memory | 16 GB | | OS | macOS 26.5 (25F71) | | Safari | 26.5 (21624.2.5.11.4) | | Chrome | 148.0.7778.168 | | Firefox | 150.0.3 | | Rust Compiler | 1.97.0-nightly (82bee9650 2026-05-09) | | WATaBoy | Commit c06850a |

Each benchmark configuration was run for 5 iterations. Wasm benchmarks were conducted in a web browser with no other tabs open, and the tab that ran the benchmark was refreshed after each iteration. The total number of frames emulated was averaged, then divided by the Game Boy's refresh rate (59.73)×10 to get the relative speed shown below.

Nice! Emulating Pokémon Blue, the JIT-to-Wasm managed to be ~1.2x faster than the interpreter running natively, so we benefit from JIT, despite being an extra layer of indirection away from the native machine code. It’s also ~1.5x faster than the interpreter running in Wasm.

I also ran the benchmark across the three major browser engines just to see how they stack up.

Looks like Safari pulls ahead! Just to be clear, our JIT wasn’t intentionally tuned to any specific browser; most of the profiling during development was actually done in Firefox. So it's nice to know that iOS being WebKit-only doesn’t hold us back (at least in this case). :)

One good thing about this whole thing being web-based is that I can just put the demo right here in the blog post. One bad thing about it being so fast is that it may trigger seizures in people with photosensitive epilepsy ⚠️, so please be careful before pressing start.

frametime
0ms
avg last 100
0ms
min last 100
0ms
max last 100
0ms

Further work

WATaBoy

Audio and GBC support are the most prominent missing features.

In terms of performance, profiling shows that emulating the PPU still takes up most of WATaBoy's runtime, because there are still a few PPU interrupts that I haven't implemented prediction for. This causes the JIT to fall back to the interpreter more often than it actually needs to, so it'll be my main priority before optimising the JIT compiler any further.

Our JIT-to-Wasm clearly beats out our interpreter running natively, and these results possibly apply to other emulators as well, especially those which are heavily CPU-bound. But looking at the results critically, we have only shown that our basic-block JIT compiler beats our basic fetch-decode-execute interpreter.

The interpreter is fast, and a lot of time was spent optimising it, but there are still niche optimisation techniques (e.g., a cached interpreter) that might help it catch up with our basic block JIT compiler.

The same goes for optimising our JIT compiler as well. For example, recompiling branching instructions would mean we’d stay executing JIT blocks for longer and spend less time falling back to the interpreter and dispatching between blocks.

I think it would be interesting to compare their relative performance with further optimisations, and I plan to continue working on this project as a hobby until I’m pushing the limits of both approaches. And if you know about cycle-accurate Game Boy emulation and you’d like to contribute, or if you're just curious, check out the project on GitHub.

JIT-to-Wasm in general

I'd argue that right now, the main pain point with JIT-ing to Wasm is codegen. Every project I've seen so far is using their own bespoke tooling for generating Wasm bytecode, and none of them is as ergonomic or robust as tools like DynASM or Cranelift. For this technique to see more widespread adoption, emulator developers will probably want some way to write strings of human-readable WAT that gets translated into bytecode at compile time, in the same way that DynASM translates ARM/x86 assembly into machine code.

It’s also worth acknowledging another limitation to this approach. There’s no way to do a few of the lower-level optimisations Dolphin relies on. For example, Dolphin's hardware fastmem wouldn't work since any invalid memory accesses are irrecoverable within the Wasm runtime.

Conclusion

This doesn't necessarily prove that a GameCube emulator would run at full speed just by implementing a JIT-to-Wasm. But given that even the basic block JIT was able to outperform an interpreter, I think it’s an avenue worth exploring. And hopefully, with more mature tooling for codegen, Wasm could become a common target for faster cross-platform console emulation, especially on iOS.

You can try out the full version of WATaBoy with your own ROMs. Yes, I know the interface looks more like Universal Paperclips than an emulator; that’s O.K. The primary focus is its fancy implementation details rather than its design. :)

Thanks for reading!

Source: Hacker News