What if AI doesn't need more RAM but better math?

Google's new TurboQuant algorithm achieves a 6x reduction in KV cache memory usage without sacrificing accuracy, potentially easing the global AI memory supply crunch through mathematical optimization rather than hardware expansion.

@adlrocha - What if AI doesn’t need more RAM but better math?

How TurboQuant compresses the KV cache without losing accuracy, and what that could mean for memory stocks

Last week I was writing about the hardware side of the AI memory problem: the HBM density penalty, the EUV bottleneck, and the supply chain pressure squeezing DRAM prices for everyone from data centre operators down to consumer electronics. This week, Google published something that attacks the exact same problem using another approach: not “build more memory”, but “need less of it.”

You guessed it! This post will dive a bit deeper into what TurboQuant is, and what this may imply to the field of AI. What Pied Piper achieved in the Silicon Valley TV Show with their general-purpose lossless compression algorithm, Google may have achieved it for the compression of information represented as vectors in a high-dimensional space.

What is a transformer? And the KV cache?

But before getting into what TurboQuant does, let’s make a brief detour to understand what is this algorithm is actually built to compress, and why it is important for LLMs and the memory problem.

GPT models are what are known as autoregressive: they generate text one token at a time, where each new token is conditioned on everything that came before. You send a prompt, the model reads all of it, picks the most likely next word, appends it, reads everything again, picks the next word, and so on. One token at a time, left to right, until it decides to stop.

The core mechanism that lets the model read everything at each step is called attention. For every token in the sequence, the model computes three vectors: a query, a key, and a value. You can think of these data structures as a bit more complex key-value stores. To generate the next token, the model compares the current query against every previous key, essentially asking “which past tokens are relevant right now?”, and uses the answer to weigh the corresponding values and build up context.

This is implemented through the transformer architecture. Transformer layers are responsible for encoding the input sequences into a meaningful representation, applying the attention mechanism, and decoding into an output representation. All LLMs are architectural variations of this basic cell.

The keys and values for every previous token are recomputed from scratch on every single pass through architecture. If your conversation is N tokens long and you’re generating token N+1, the model recalculates N sets of keys and values it already calculated on the previous step. This is slow and wasteful in terms of the resources.

The obvious fix to this is to cache them. The query, key and values are computed once per token and stored so they can be looked up in subsequent steps instead of being recalculated. This is the KV cache, a running store of QKV tokens from all previous tokens stored in GPU memory.

The problem is that the KV cache grows with every token. With short messages this is trivial as all tokens fit in memory, but a long conversation involves hundreds of thousands of tokens. For a model like Llama 3.1 70B, the KV cache for a single long context can consume more GPU memory than the model weights themselves.

This is one of the key bottlenecks in production inference. Serve more users simultaneously? More KV cache. Support longer contexts? More KV cache. We are trading the compute necessary to compute on-the-fly the QKV values, for increased memory requirements.

By using quantisation instead of storing each value at 32-bit or 16-bit precision, one can round it down to 4 bits or 3 bits. Some accuracy is lost in the approximation, but if it is not significant for the user case, the trade-off is obviously worth it. Standard quantisation techniques add 1-2 extra bits of overhead per value as metadata, which partially undermines the compression you’re trying to achieve.

Enter TurboQuant

But things may be about to change. Google announced this week TurboQuant. TurboQuant is a two-stage algorithm.

Stage 1: PolarQuant. This is the main compression step. We currently store vectors using Cartesian coordinates. PolarQuant converts the vector to polar coordinates: a radius, and an angle. The key observation is that, in high-dimensional transformer key spaces, the angle distribution is highly concentrated and predictable. That predictability means you can eliminate the expensive normalisation steps that standard quantisation methods require, and you can do it without any dataset-specific tuning.

Stage 2: QJL (Quantised Johnson-Lindenstrauss). QJL’s job is to correct for the bias introduced by quantization. It applies a Johnson-Lindenstrauss transform to the residual error, a random projection that preserves distances between high-dimensional points. The result is an unbiased estimator for the inner products, with zero additional memory overhead.

The combination achieves 3.5 bits per channel with “absolute quality neutrality” across Gemma, Mistral, and Llama-3.1-8B-Instruct. The headline number: 6x reduction in KV memory size with no measurable accuracy loss, and on H100 GPUs, 4-bit TurboQuant delivers up to 8x performance increase over 32-bit unquantised keys.

TurboQuant is data-oblivious: the algorithm works from first principles without seeing the data first. That’s what makes it deployable at inference time to any models without having to explicitly train the quantised model. There is no need for specific training and fine-tuning to achieve the most optimal compression rate.

What this means for the memory crunch

Last week I was writing about how HBM stacking reduces DRAM bit density, and how the entire supply chain for consumer DRAM is under pressure. TurboQuant suggests that better mathematical optimization might be as important as hardware scaling in solving the AI memory bottleneck.

Source: Hacker News