Quantization from the Ground Up

Quantization is a vital technique that makes LLMs 4x smaller and 2x faster with minimal accuracy loss, enabling powerful models to run on consumer hardware.

Qwen-3-Coder-Next is an 80 billion parameter model 159.4GB in size. That's roughly how much RAM you would need to run it, and that's before thinking about long context windows. This is not considered a big model. Rumors have it that frontier models have over 1 trillion parameters, which would require at least 2TB of RAM. The last time I saw that much RAM in one machine was never.

But what if I told you we can make LLMs 4x smaller and 2x faster, enough to run very capable models on your laptop, all while losing only 5-10% accuracy.

That's the magic of quantization.

Parameters, also called "weights," are the majority of what an LLM is when it's in memory or on disk. In my prompt caching post I wrote that LLMs are an "enormous graph of billions of carefully arranged operations." What do those graphs look like? Let's start with the simplest example: 1 input, 1 parameter, 1 output. It doesn't look like much, but this is the fundamental building block of modern AI. It takes the input of 2.0, multiplies it by the parameter 0.5, and gets the output 1.0.

LLMs, though, are much bigger. They have billions of these parameters in practice. One of the ways they get so big is that they have "layers." Every connection between two nodes gets a parameter. When 2 connections end at the same node, the values are added together. Modern LLMs have hundreds of thousands of inputs and outputs. They have many dozens of layers, each with thousands of nodes, all densely connected together. This all multiplies out to result in billions, sometimes trillions of parameters.

Computers work in 1s and 0s, called "bits." Integers are nice to work with because they are discrete. It gets trickier when you start thinking about decimal places. How many decimal place numbers are there between 1 and 3? There are an infinite number of them. This is not good for computers, because computers can't represent an infinite number of things. What computers do is they compromise. They promise to be accurate up to so many significant figures, and anything after that is best-effort.

For example, 32-bit floating point numbers span the range ±3.40×1038 with 7 significant figures of accuracy. They do this by dividing the 32 bits up into 3 parts: 1 sign bit, 8 exponent bits, and 23 significand bits. More exponent bits results in a larger range, while more significand bits results in more significant figures of accuracy.

A lot of the representable 32-bit floats are small values. This is fantastic for LLMs, because parameters also tend to be small. Small parameters have been found to result in models that generalise better to problems they haven't seen before. Almost all parameters are very close to 0.

Do language models actually need 32-bit floats? The answer is no, LLMs work just fine with smaller, less accurate floats. A 16-bit float takes up half as much RAM and disk as a 32-bit float. We can mix and match the number of exponent and significand bits to get different precision/range tradeoffs. For example, the Google Brain team created the bfloat16 format. Some more extreme examples are float8 and float4.

Quantization is the process of taking values from a large range, and packing them into a smaller range. It is a form of lossy compression. When we convert between, e.g., a float16 and a float8, we tend to round to the closest representable value.

Source: Hacker News