DEV-TOOLSMarch 17, 20261 min read15 views

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Flash-KMeans introduces IO-aware and contention-free optimizations for GPU-based K-Means, achieving up to 200x speedup over industry standards like FAISS. By fusing computations and redesigning memory updates, it transforms K-Means into an efficient online primitive for modern AI workloads.

Computer Science > Distributed, Parallel, and Cluster Computing

Title:Flash-KMeans: Fast and Memory-Efficient Exact K-Means

$k$-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable $k$-means as an online primitive. We point out that existing GPU implementations of $k$-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the $N \times K$ distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free $k$-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9$\times$ end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33$\times$ and over 200$\times$, respectively.

Source: Hacker News

More in this category

dev-tools

Linux kernel will support $ORIGIN, sort of

Farid Zakaria shares his journey of proposing a patch to the Linux kernel to support relocatable binaries in Nix, which evolved into a powerful eBPF-based solution for binfmt_misc.

dev-tools

Five US tech giants' hidden debts soar to $1.65T on opaque AI funding

A Nikkei study reveals that off-balance-sheet debts at five major US tech companies have surged eightfold to $1.65 trillion, driven by massive AI investments like data center leases and GPU contracts.

dev-tools

Running Doom on Our Custom CPU and Going Viral

Two developers successfully built a custom CPU from scratch at the logic gate level, deployed it on an FPGA, and optimized its memory architecture to run the classic game DOOM.

dev-tools

Incremental – A library for incremental computations

Incremental is a library designed for building complex computations that update efficiently when inputs change. Inspired by self-adjusting computation research, it is highly useful for large-scale calculations, GUI views, and data synchronization.

dev-tools

Flock Credibility Lost as It Repeatedly Lies to City Councils, Police, & Public

Flock Safety, a prominent provider of automatic license plate readers (ALPR), is facing a severe credibility crisis after being caught repeatedly misleading city councils, police departments, and the public about its surveillance capabilities and data practices.

dev-tools

Human mathematicians are being outcounterexampled

In just a few weeks, advanced AI models like ChatGPT Sol and Claude Fable have disproved long-standing mathematical conjectures by finding precise counterexamples and auto-formalizing them in Lean.