Gzip decompression in 250 lines of Rust

A deep dive into how compression works by building a functional Gzip decompressor from scratch using only 250 lines of Rust code.

i wanted to have a deeper understanding of how compression actually works, so i wrote a gzip decompressor from scratch. the result is about 250 lines of rust that can decompress gzip from a file or stdin.

why bother?

gzip is everywhere. it compresses your web traffic, your log files, your documentation / man pages, it is the format commoncrawl stores 300 billion pages and hundreds of terabytes of web archives in. it sits invisibly between disks, networks, and CPUs, quietly shaving off bytes at planetary scale, so boringly reliable that the browser won't bother telling you itâs there at all. it's a fundamental tool in the software ecosystem. so let's read it, the first thing to do is clone zlib and start reading through the source code, right? well lets first see how long it is:

/mnt/g/repos/zlib
â¯ fd -e c -0 | xargs -0 cat | wc -l
25569

twenty five thousand lines of pure C not counting CMake files. (and whenever working with C always keep in mind that C stands for CVE).

ok maybe we can find a smaller implementation. perhaps zlib-rs is more digestible:

/mnt/g/repos/zlib-rs
â¯ fd -e rs -0 | xargs -0 cat | wc -l
36003

thirty six thousand lines of rust. that's a bit more than i wanted to read through.. actually there is a smaller implementation called miniz which is only 1261 lines of C if you combine the header and source:

/mnt/g/repos/miniz
â¯ cat miniz.c miniz.h | wc -l
1261

the problem is just looking at miniz is not going to give you a good understanding of how gzip works. it's a single file, but it's still 1200 lines of code with a lot of optimizations and c preprocessor string substitutions stitched together. i wanted something even simpler, something that just implements the core ideas without worrying about checksums or features that aren't used in practice.

the gzip wrapper

gzip itself is just a thin wrapper around the DEFLATE algorithm (the same one used by .zip files and .png files). the file starts with a magic number (0x1F 0x8B), followed by some flags and metadata, and then the compressed data. parsing it is straightforward:

assert!(buf[0] == 0x1F && buf[1] == 0x8B, "not a gzip file");
let mut offset = 10; // skip the fixed header
if buf[3] == 8 { // FNAME flag - filename is present, find null byte to skip it
offset += buf[10..].iter().position(|&b| b == 0).unwrap() + 1;
}

the only flag we care about is FNAME, which contains the name of the file that was compressed. we skip it by looking for the next null byte as is standard in C. the real work happens in the inflate function that processes the DEFLATE stream.

DEFLATE blocks

DEFLATE data is organized into blocks. each block starts with a bit indicating whether it's the last block, followed by a 2-bit type:

type 0: stored (uncompressed) data
type 1: compressed with fixed huffman codes
type 2: compressed with dynamic huffman codes

stored blocks are trivial - just copy the bytes directly. the interesting stuff happens in fixed and dynamic blocks.

reading bits

before we get to huffman codes, we need a way to read individual bits from the input stream. DEFLATE packs bits from least significant bit to most significant bit within each byte. we maintain a buffer of bits we've read but not yet consumed. when we need more bits than we have, we read another byte and shift it into place.

huffman coding

the insight of huffman coding is that common symbols should use fewer bits. if 'e' appears 1000 times and 'z' appears twice, it's wasteful to use 8 bits for both. huffman assigns shorter codes to frequent symbols and longer codes to rare ones. DEFLATE uses canonical huffman codes, which means the codes are determined entirely by the bit lengths assigned to each symbol.

fixed vs dynamic codes

fixed huffman blocks use a predefined set of code lengths. dynamic blocks include their own huffman tables at the start of the block. this adds overhead, but allows the compressor to tailor the codes to the actual data. the meta-twist is that the code lengths themselves are huffman encoded.

LZ77: back-references

huffman coding alone can't achieve great compression. the real power comes from LZ77, which replaces repeated sequences with back-references. when the decoder sees a literal symbol (0-255), it outputs that byte directly. but symbols 257-285 are special - they indicate a length/distance pair. "copy 9 bytes from 50 bytes ago" is much shorter than repeating those 9 bytes.

Source: Hacker News