NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...4 min read

What if AI doesn't need more RAM but better math?

Share
NOW LET US Article – What if AI doesn't need more RAM but better math?

Google's new TurboQuant algorithm achieves a 6x reduction in KV cache memory usage without sacrificing accuracy, potentially easing the global AI memory supply crunch through mathematical optimization rather than hardware expansion.

@adlrocha - What if AI doesn’t need more RAM but better math?

How TurboQuant compresses the KV cache without losing accuracy, and what that could mean for memory stocks

Last week I was writing about the hardware side of the AI memory problem: the HBM density penalty, the EUV bottleneck, and the supply chain pressure squeezing DRAM prices for everyone from data centre operators down to consumer electronics. This week, Google published something that attacks the exact same problem using another approach: not “build more memory”, but “need less of it.”

You guessed it! This post will dive a bit deeper into what TurboQuant is, and what this may imply to the field of AI. What Pied Piper achieved in the Silicon Valley TV Show with their general-purpose lossless compression algorithm, Google may have achieved it for the compression of information represented as vectors in a high-dimensional space.

What is a transformer? And the KV cache?

But before getting into what TurboQuant does, let’s make a brief detour to understand what is this algorithm is actually built to compress, and why it is important for LLMs and the memory problem.

GPT models are what are known as autoregressive: they generate text one token at a time, where each new token is conditioned on everything that came before. You send a prompt, the model reads all of it, picks the most likely next word, appends it, reads everything again, picks the next word, and so on. One token at a time, left to right, until it decides to stop.

The core mechanism that lets the model read everything at each step is called attention. For every token in the sequence, the model computes three vectors: a query, a key, and a value. You can think of these data structures as a bit more complex key-value stores. To generate the next token, the model compares the current query against every previous key, essentially asking “which past tokens are relevant right now?”, and uses the answer to weigh the corresponding values and build up context.

This is implemented through the transformer architecture. Transformer layers are responsible for encoding the input sequences into a meaningful representation, applying the attention mechanism, and decoding into an output representation. All LLMs are architectural variations of this basic cell.

The keys and values for every previous token are recomputed from scratch on every single pass through architecture. If your conversation is N tokens long and you’re generating token N+1, the model recalculates N sets of keys and values it already calculated on the previous step. This is slow and wasteful in terms of the resources.

The obvious fix to this is to cache them. The query, key and values are computed once per token and stored so they can be looked up in subsequent steps instead of being recalculated. This is the KV cache, a running store of QKV tokens from all previous tokens stored in GPU memory.

The problem is that the KV cache grows with every token. With short messages this is trivial as all tokens fit in memory, but a long conversation involves hundreds of thousands of tokens. For a model like Llama 3.1 70B, the KV cache for a single long context can consume more GPU memory than the model weights themselves.

This is one of the key bottlenecks in production inference. Serve more users simultaneously? More KV cache. Support longer contexts? More KV cache. We are trading the compute necessary to compute on-the-fly the QKV values, for increased memory requirements.

By using quantisation instead of storing each value at 32-bit or 16-bit precision, one can round it down to 4 bits or 3 bits. Some accuracy is lost in the approximation, but if it is not significant for the user case, the trade-off is obviously worth it. Standard quantisation techniques add 1-2 extra bits of overhead per value as metadata, which partially undermines the compression you’re trying to achieve.

Enter TurboQuant

But things may be about to change. Google announced this week TurboQuant. TurboQuant is a two-stage algorithm.

Stage 1: PolarQuant. This is the main compression step. We currently store vectors using Cartesian coordinates. PolarQuant converts the vector to polar coordinates: a radius, and an angle. The key observation is that, in high-dimensional transformer key spaces, the angle distribution is highly concentrated and predictable. That predictability means you can eliminate the expensive normalisation steps that standard quantisation methods require, and you can do it without any dataset-specific tuning.

Stage 2: QJL (Quantised Johnson-Lindenstrauss). QJL’s job is to correct for the bias introduced by quantization. It applies a Johnson-Lindenstrauss transform to the residual error, a random projection that preserves distances between high-dimensional points. The result is an unbiased estimator for the inner products, with zero additional memory overhead.

The combination achieves 3.5 bits per channel with “absolute quality neutrality” across Gemma, Mistral, and Llama-3.1-8B-Instruct. The headline number: 6x reduction in KV memory size with no measurable accuracy loss, and on H100 GPUs, 4-bit TurboQuant delivers up to 8x performance increase over 32-bit unquantised keys.

TurboQuant is data-oblivious: the algorithm works from first principles without seeing the data first. That’s what makes it deployable at inference time to any models without having to explicitly train the quantised model. There is no need for specific training and fine-tuning to achieve the most optimal compression rate.

What this means for the memory crunch

Last week I was writing about how HBM stacking reduces DRAM bit density, and how the entire supply chain for consumer DRAM is under pressure. TurboQuant suggests that better mathematical optimization might be as important as hardware scaling in solving the AI memory bottleneck.

© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Swift at Apple: Migrating the TrueType hinting interpreter

dev-tools

Swift at Apple: Migrating the TrueType hinting interpreter

Apple has rewritten its TrueType hinting interpreter from C to memory-safe Swift for its Fall 2025 OS releases, improving security and boosting performance by an average of 13%.

NOW LET US Related – Where Did Earth Get Its Oceans? Maybe It Made Them Itself

dev-tools

Where Did Earth Get Its Oceans? Maybe It Made Them Itself

For decades, scientists believed Earth's water was delivered by comets or asteroids. However, new research and space missions suggest our planet might have manufactured its own oceans through a mix of magma and hydrogen.

NOW LET US Related – Digital Sovereignty Becomes an Imperative as the US Reads Dutch Emails

dev-tools

Digital Sovereignty Becomes an Imperative as the US Reads Dutch Emails

The reported access of Dutch officials' emails by the U.S. House of Representatives highlights the critical difference between data residency and true digital sovereignty. It underscores why nations must secure legal and operational control over their data, moving beyond mere local storage promises.

NOW LET US Related – Removing 'um' from a recording is harder than it sounds

dev-tools

Removing 'um' from a recording is harder than it sounds

Removing filler words like 'um' and 'uh' from audio recordings is surprisingly difficult due to audio artifacts and AI limitations. The open-source tool 'erm' solves this by combining Whisper with advanced digital signal processing techniques.

NOW LET US Related – If you are asking for human attention, demonstrate human effort

dev-tools

If you are asking for human attention, demonstrate human effort

As AI-generated content floods the workplace, a new etiquette dilemma emerges. This article highlights a crucial principle for modern collaboration: if you want to request human attention, you must first demonstrate human effort.

NOW LET US Related – Raspberry Pi 5 – 16GB RAM

dev-tools

Raspberry Pi 5 – 16GB RAM

The Raspberry Pi 5 features a massive upgrade with a 2.4GHz quad-core processor, up to 16GB of RAM, and in-house silicon for vastly improved I/O performance.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.