An ode to bzip

Despite its obscurity compared to modern algorithms like xz and zstd, bzip offers superior compression ratios for text-based data due to its unique BWT-based approach, making it a surprisingly efficient choice.

The story goes like this. ComputerCraft is a mod that adds programming to Minecraft. You write Lua code that gets executed by a bespoke interpreter with access to world APIs, and now you’re writing code instead of having fun. Computers have limited disk space, and my /nix

folder is growing out of control, so I need to compress code.

The laziest option would be to use LibDeflate, but its decoder is larger than both the gains from compression and my personal boundary for copying code. So the question becomes: what’s the shortest, simplest, most ratio-efficient compression algorithm?

I initially thought this was a complex question full of tradeoffs, but it turns out it’s very clear-cut. My answer is bzip

, even though this algorithm has been critiqued multiple times and has fallen into obscurity since xz and zstd became popular.

First lookI’m compressing a 327 KB file that contains Lua code with occasional English text sprinkled in comments and documentation. This is important: bzip excels at text-like data rather than binary data. However, my results should be reproducible on other codebases, as the percentages seem to be mostly constant within that category.

Let’s compare multiple well-known encoders on this data:

zopfli --i100

: 75882zstd -22 --long --ultra

: 69018xz -9

: 67940brotli -Z

: 67859 (recompiled without a dictionary)lzip -9

: 67651bzip2 -9

: 63727bzip3

: 61067The bzip

family is a clear winner by a large margin. It even beats lzip

, whose docs say “‘lzip -9’ compresses most files more than bzip2” (I guess code is not “most files”). How does it achieve this? Well, it turns out that bzip

is not like the others.

AlgorithmsYou see, all other popular compression algorithms are actually the same thing at the core. They’re all based on LZ77, a compression scheme that boils down to replacing repetitive text with short links to earlier occurrences.

The main difference is in how literal strings and backreferences are encoded as bit streams, and this is highly non-trivial. Since links can have wildly different offsets, lengths, and frequencies from location to location, a good algorithm needs to predict and succinctly encode these parameters.

But bzip

does not use LZ77. bzip

uses BWT, which reorders characters in the text to group them by context – so instead of predicting tokens based on similar earlier occurrences, you just need to look at the last few symbols. And, surprisingly, with the BWT order, you don’t even need to store where each symbol came from!

For example, if the word hello

is repeated in text multiple times, with LZ77 you’ll need to find and insert new references at each occurrence. But with BWT, all continuations of hell

are grouped together, so you’ll likely just have a sequence of many o

s in a row, and similarly with other characters, which simple run-length encoding can deal with.

BWT comes with some downsides. For example, if you concatenate two texts in different English dialects, e.g. using color

vs colour

, BWT will mix the continuations of colo

in an unpredictable order and you’ll have to encode a weird sequence of r

s and u

s, whereas LZ77 would prioritize recent history. You can remedy this by separating input by formats, but for consistent data like code, it works just fine as is.

bzip2

and bzip3

are both based on BWT and differ mostly in how the BWT output is compressed. bzip2

uses a variation on RLE, while bzip3

tries to be more intelligent. I’ll focus on bzip2

for performance reasons, but most conclusions apply to bzip3

, too.

HeuristicsThere is another interesting thing about BWT. You might have noticed that I’m invoking bzip3

without passing any parameters like -9

. That’s because bzip3

doesn’t take them. In fact, even invoking bzip2

with -9

doesn’t do much.

LZ77-based methods support different compression levels because searching for earlier occurrences is time-consuming, and sometimes it’s preferable to use a literal string instead of a difficult-to-encode reference, so there is some brute-force. BWT, on the other hand, is entirely deterministic and free of heuristics.

Furthermore, there is no degree of freedom in determining how to efficiently encode the lengths and offsets of backreferences, since there are none. There are run lengths, but that’s about it – it’s a single number, and it’s smaller than typical offsets.

All of that is to say: if you know what the bzip2

pipeline looks like, you can quickly achieve similar compression ratios without fine-tuning and worrying about edge cases. My unoptimized ad-hoc bzip2

-like encoder compresses the same input to about 67 KB – better than lzip

and with clear avenues for improvement.

DecodersThat covers the compression format, but what about the size of the decoder? Measuring ELFs is useless when targeting Lua, and Lua libraries like LibDeflate don’t optimize code size for self-extracting archives, so at risk of alienating readers with fancy words and girl math, I’ll have to eyeball this for everything but bzip2

A self-extracting executable doesn’t have to decode every archive – just one. We can skip sanity checks, headers, inline metadata into code, and tune the format for easier decoding. As such, I will only look at the core decompression loops.

gzip

, zstd

, xz

, brotli

, and lzip

all start by doing LZ77. Evaluating “copy” tokens is a simple loop that won’t take much code. Where they differ is in how those tokens are encoded into bits:

gzip

does some light pre-processing and then applies Huffman coding, which assigns unambiguous bit sequences to tokens and then concatenates them, optimizing for total length based on the token frequency distribution. Huffman codes can be parsed in ~250 bytes, the bit trie might take ~700 bytes, and the glue should fit in ~500 bytes. Let’s say 1.5 KB in total.xz

encodes tokens bit-by-bit instead of treating them as atoms, which allows the coder to adjust probabilities dynamically, yielding good ratios without encoding any tables at the cost of performance. Bit-by-bit parsing will take more space than usual, but avoiding tables is a huge win, so let’s put at 1 KB.

lzip

is very similar to xz

, only slightly changing token encodings, so let’s put it at 1 KB as well.

zstd

complicates the pre-processing step and uses Finite State Entropy instead of Huffman coding, which effectively allows tokens to be encoded with fractional bit lengths. FSE is simple, but requires large tables, so let’s say ~2000 bytes for storing and parsing them. Adding glue, we should get about 3 KB.brotli

keeps Huffman coding, but switches between multiple static Huffman tables on the flight depending on context. I couldn’t find the exact count, but I get 7 tables on my input. That’s a lot of data that we can’t just inline – we’ll need to encode it and parse it. Let’s say ~500 bytes for parser and ~100 bytes per table. Together with the rest of the code, we should get something like 2.2 kB.For bzip

decoders, BWT can be handled in ~250 bytes. As for the unique parts,

bzip2

compresses the BWT output with MTF + RLE + Huffman. With the default 6 Huffman tables, let’s assign ~1.5 KB to all Huffman-related code and data and ~400 bytes for MTF, RLE, and glue.

bzip3

uses XZ-like bit-by-bit coding with context mixing instead. Let’s say 1 KB for the former and ~500 bytes for the latter.

Point is: by dropping compatibility with standard file formats, the decoder can become very small. I might be wrong on some of these figures, but it most likely won’t switch things up significantly.

bzip

-style methods are in the middle of the pack, but that’s somewhat misleading. While bzip2

typically uses 6 Huffman tables, I got good compression results with just one. With a single table, my bzip

-style decoder fits in 1.5 KB, which is smaller than everything but xz

and lzip

, while being faster an

Source: Hacker News