Ohm's Peg-to-WASM Compiler

Ohm v18 introduces a major performance breakthrough by shifting from AST interpretation to WebAssembly compilation, achieving 50x faster parsing and 90% lower memory usage.

Ohm is a user-friendly parsing toolkit for JavaScript and TypeScript. You can use it to parse custom file formats or quickly build parsers, interpreters, and compilers for programming languages. Learn more

A few weeks ago, we announced the Ohm v18 beta, which involved a complete rewrite of the core parsing engine. Since then, we've implemented even more performance improvements: v18 is now more than 50x faster for real-world grammars while using about 10% of the memory.

The new parsing engine works by compiling an Ohm grammar — which is a form of parsing expression grammars, or PEG — into a WebAssembly module that implements a parser. In this post, we'll dive into the technical details of how that works, and talk about some of the optimizations that made it even faster.

PExpr trees

In previous versions of Ohm (up to and including v17), the parsing engine used an approach called AST interpretation. Here's how that works.

When you instantiate a grammar with Ohm, it parses your grammar and converts it to an abstract syntax tree. You can think of this tree as a kind of program, which describes a parser for the language. The nodes of the tree are parsing expressions, or PExprs as they're called in the source code.

We'll use the following grammar as an example:

JSONLike { Value = Object | "true" | "false" | "null" Object = "{" Members? "}" Members = Member ("," Member)* Member = string ":" Value string = """ (~""" any)* """ }

This grammar is parsed by Ohm (using the "grammar grammar", or metagrammar, defined in ohm-grammar.ohm). The result is a Map<string, PExpr>:

{ 'Value' => Alt(Apply('Object'), Term('true'), Term('false'), Term('null')), 'Object' => Seq(Term('{'), Opt(Apply('Members')), Term('}')), ... }

You can think of each rule ('Value', 'Object', etc.) as being like a function, and the function bodies are parsing expressions. Alt, Apply, Opt, Seq, and Term are all subclasses of the abstract PExpr class, and they all have an eval method. These methods are mostly pretty small and straightforward — here's the implementation for Alt:

pexprs.Alt.prototype.eval = function (state) { for (let idx = 0; idx < this.terms.length; idx++) { if (state.eval(this.terms[idx])) { return true; } } return false; };

To evaluate an Alt, we just recursively evaluate its children. If one of them succeeds, then the Alt succeeds; otherwise, it fails. This approach is straightforward, but not very performant.

Wasm compilation

At a high level, the v18 code has two parts:

Runtime support code written in AssemblyScript.
The WebAssembly codegen piece, which is written in TypeScript.

The code generation phase begins with the same PExpr tree as v17, but instead of interpreting it, we compile it to WebAssembly. Then we link the generated WebAssembly code with the runtime support code to produce the final Wasm module.

Here's what that code generation looks like for Alt:

function emitAlt(exp: ir.Alt): void { const {asm} = this; const saved = asm.maybeSaveBacktrackPoint(); asm.block(w.blocktype.empty, () => { for (const term of exp.children) { this.emitPExpr(term); asm.localGet('ret'); asm.condBreak(asm.depthOf('pexprEnd')); saved.pos.restore(); } }); }

Unsurprisingly, it has a similar structure: we loop over all the children, and emit the code for each child. But, notice that this is a compile-time loop, not a run-time one. So the structure of the final code, expressed as pseudocode, looks like this:

try matching terms[0] succeeded? return true try matching terms[1] succeeded? return true return false

Note that in the generated WebAssembly code, we're also not dispatching to any kind of generic eval function — we just inline the code for each individual expression.

Building syntax trees

So far we've described how v18 compiles a recognizer. But to do something useful with a valid input, we need to produce some kind of parse tree — or concrete syntax tree (CST), as they're called in Ohm.

In v17, CST nodes are regular JavaScript objects, allocated on the heap and managed by the garbage collector. From a memory management perspective, they have a few interesting properties:

The nodes themselves are relatively small, so the per-node memory management overhead is relatively large.
There are a large number of nodes (around one per input character).
All nodes generally have the same lifetime.

Bump allocation into Wasm linear memory

In v18 we use a bump allocator (provided by AssemblyScript's stub runtime) to allocate CST nodes in Wasm linear memory. This has lower overhead than heap-allocated JavaScript objects (only one 32-bit header field per object, vs 3–4 in most JS engines). We consider all CST nodes to be owned by the MatchResult they are associated with, so when the MatchResult is freed, we also reclaim the memory from all its CST nodes.

Node layout

Terminal nodes

Terminals are the most important thing to optimize, since a typical tree has approximately one terminal node per input character. So, rather than allocating a separate node for each terminal, we use a tagged 32-bit value: (matchLength << 1) | 1.

A regular reference to a full CST node is always 4-byte aligned: the offset is a multiple of 4, and thus the two low bits are always 0. So if the low bit is set, we can detect that it's not a true reference, and instead use the upper 31 bits as the payload — and for terminal nodes, the only thing we need to store is how many characters of input were consumed.

Source: Hacker News