What Are Skiplists Good For?

Once considered a niche data structure, skiplists proved their worth by solving complex tree traversal problems in BigQuery. This article explores how a 'skiptree' variant optimized data processing at Antithesis.

A while back, I joined Phil Eaton’s book club on The Art of Multiprocessor Programming, and the topic of skiplists came up.

For most of my career, skiplists had always seemed like a niche data structure, with a rabid cult following but not a whole ton of applicability to my life. Then six or so years ago, we encountered a problem at Antithesis that seemed intractable until it turned out that a generalization of skiplists was exactly what we needed.

Before I tell you about that, though, let me explain what skiplists are (feel free to skip ahead if you already know them well).

A skiplist is a randomized data structure that’s basically a drop-in replacement for a binary search tree with the same interface and the same asymptotic complexity on each of its operations. Some people like them because you can produce relatively simple and understandable lock-free concurrent implementations, and others like them as a matter of taste, or because they enjoy listening to bands that you’ve totally never heard of.

In implementation terms, you can think of them roughly as linked lists plus “express lanes”:

You start with a basic linked list, and then add a hierarchy of linked lists with progressively fewer nodes in them. In the example above, the nodes in the higher-level lists are chosen probabilistically, with each node having a 50% chance of being promoted to the next level.

This helps with search, because you can use the higher-level lists to skip more quickly to the node you want:

Here we’ve found the node with an ID of 38 by starting at the top level and working downwards. At each level we advance until the next node would have an ID that’s too high, then jump down a level.

In a regular linked list of n nodes, finding a node would take O(n) time, because you’re walking through the nodes one by one. Skiplists let you jump levels, with each level halving the number of nodes you need to check, so you end up finding the node in O(log n) time.

This is all very nice, but after reading about this data structure I literally never thought about it again, until one day we encountered the following problem at Antithesis…

Antithesis runs customers’ software many times to look for bugs. Each time, our fuzzer injects different faults and tells your testing code to make different random decisions. Over many runs, these choices create a branching tree of timelines: each path from root to leaf represents one sequence of choices the fuzzer made and what happened as a result.

There were a lot of queries that we wanted to do which basically amounted to fold operations up or down this tree. For example, given a particular log message, what’s the unique history of events that led to it? (Walk up the parent pointers from that node to the root.)

The trouble was that the amount of data output by the software we were testing was so huge, we had to throw it all into an analytic database, and at the time we were using Google BigQuery. Analytic databases are optimized for scanning massive amounts of data in parallel to compute aggregate results. The tradeoff is that they’re slow at point lookups, where you fetch a specific row by its ID.

This matters, because the natural way to represent a tree in a database is with parent pointers – each node is a row in the table, with a parent_id column pointing to its parent. To answer a question like “show me the history leading to this log message”, you’d need to walk up the tree one node at a time: look up the node, get its parent ID, look up the parent node, and so on. Each step is a point lookup. In an OLTP database designed for point lookups, that’s fine. But in BigQuery, basically every operation results in a full table scan, which means even the most basic queries would end up doing O(depth) reads over your entire data set. Yikes!

One alternative would have been to split the data: store just the tree structure (the parent pointers) in a database that’s good at point lookups, and keep the bulk data in BigQuery. But this approach would have created other problems. Every insert would need to write to both systems, and since we want to analyze the data online (while new writes are streaming in) keeping the two databases consistent would require something like two-phase commit (2PC). I prefer not to invent new 2PC problems where I don’t need them. And anyway, at the time BigQuery had very loose consistency semantics, so it’s not even clear that keeping the two systems in sync would have been possible.

Skiplists to the rescue! Or rather, a weird thing we invented called a “skiptree”… Well, it’s like a skiplist, but it’s a tree.

To store the skiptree, you create a SQL table for each level: tree0, tree1, and so on. Each table has a row for every node in that tree. Instead of having a single parent_id column, it has a column for the closest ancestor node in the tree above (we’ll call that next_level_ancestor) and another column (call it ancestors_between) with a list of all nodes between the current node and the next-level ancestor.

You can use these tables to find the ancestors of a node by chaining together JOINs, working your way up the tables. Now we can find ancestors with a single non-recursive SQL query with a fixed number of JOINs. We just had to do… 40 or so JOINs.

Best of all, at the time BigQuery’s pricing charged you for the amount of data scanned, rather than for compute, and the geometric distribution of table sizes meant that each of these queries only cost twice a normal table scan. Of course, there were disadvantages, like the SQL itself. The textual size of these queries was often measured in the kilobytes. But we didn’t write the SQL. We wrote a compiler in JavaScript that generated it. And that is how most test properties in Antithesis were evaluated for the first six years of the company.

Source: Hacker News