NumKong: 2'000 Mixed Precision Kernels for All

Ash Vardanian has unveiled NumKong, a massive open-source library succeeding SimSIMD with over 2,000 SIMD kernels for mixed-precision numerics. It aims to revolutionize high-performance computing across RISC-V, Intel AMX, and Arm SME architectures.

These are a few lines of celebratory “proud-dad” rumblings and highlights from my largest open-source release to date. I’m killing my SimSIMD project and re-launching under a new name — NumKong — StringZilla’s big brother. Over 2'000 SIMD kernels for mixed precision numerics, spread across 200'000 lines of code & docstrings, in 7 languages. One of the largest collections online — pretty much the same size as OpenBLAS, the default NumPy BLAS (Basic Linear Algebra Subprograms) backend (detailed comparison below).

What’s inside?

RISC-V Vector Extensions, Intel AMX & Arm SME Tiles
From Vectors to Matrices and Higher-rank Tensors
From BFloat16 and Float16 to Float6 — E3M2 & E2M3 on any CPU
Native Int4 & UInt4 Dot Products via Nibble Algebra
Neumaier & Dot2 for higher-than-BLAS precision
Ozaki Scheme for Float64 GEMMs via Float32 Tile Hardware
Haversine & Vincenty for Geospatial — 5'300x faster than GeoPy
Kabsch & Umeyama Mesh Alignment — 200x faster than BioPython
Fused MaxSim for ColBERT — GPU-Free Late Interaction Scoring
WebAssembly SIMD backend for AI Sandboxes, Edge, & Browsers
C 99, C++ 23, Rust, Swift, JavaScript, GoLang, & Python 🐍

All of that tested against in-house 118-bit floating point numbers and heavily profiled for both numerical stability and speed! Here’s a preview of performance numbers for the most boring part — GEMM (General Matrix Multiply)-like batched dot products:

| Input | NumPy + OpenBLAS | PyTorch + MKL | NumKong | |---|---|---|---| | Float64 | 65.5 gso/s, 1e-15 err | 68.2 gso/s, 1e-15 err | 8.6 gso/s, 1e-16 err | | Float32 | 140 gso/s, 9e-7 err | 145 gso/s, 1e-6 err | 37.7 gso/s, 4e-7 err | | BFloat16 | — | 851 gso/s, 1.8% err | 458 gso/s, 3.6% err | | Float16 | 0.3 gso/s, 0.25% err | 140 gso/s, 0.37% err | 103 gso/s, 0.26% err | | Float8 | — | 0.4 gso/s, 4.6% err | 398 gso/s, 0% err | | Int8 | 0.4 gso/s, overflow | 50 gso/s, overflow | 1'279 gso/s, 0% err | | Binary Size | 30 MB | 705 MB | 5 MB | | Available For | Python | Python, C++ | 7 languages | | Python Wheels | 72 | 39 | 99 |

Those are single-threaded numbers for Intel Xeon4 CPUs powering mainstream Nvidia DGX-H100 servers — the workhorse of GenAI in 2025/6. NumKong makes different tradeoffs around speed vs accuracy, and a big part of the article is about the strategy of prioritizing one above the other.

Built for USearch, Released for Everyone

And no, the scale wasn’t the hard part — correctness, portability, and UX were. I started this release 3 years ago, when I first got access to Intel Xeon4 (Sapphire Rapids) CPUs. I opened the #220 Pull Request in 2024… and it’s already 2026. Around 900 commits in, CI finally passed for the first time; another 200 patches later, it was ready to merge 😅 I rewrote it several times to make sure it’s compatible with the evolution of Unum’s USearch — the search engine it was designed to accelerate.

When SimSIMD was first integrated into USearch, it was a little-known open-source project with an ambition. Now, it’s almost equally little-known, but with bindings for 14+ programming languages it powers vector search in ClickHouse, DuckDB, ScyllaDB, TiDB, Yugabyte, MemGraph, government and intelligence agencies, and several frontier AI labs — running on well over 100 Million, possibly over a Billion devices worldwide, from mobile phones to 10 kW mega-servers. I’m using it for the design of yet-to-be-released USearch v3, but you can already apply it elsewhere, like Albumentations team does for Image Processing.

Tiled Multiply-Accumulate on Every Chip

RISC-V

The saddest part of this story may just be the state of RISC-V — so let’s start at the bottom and work up.

RISC-V is the open-source Instruction Set Architecture that has made waves online and holds great promise for democratizing chip design. Sadly, even in 2026, it’s still far from usable. Even more disappointing, the driving force behind its growing adoption in CPUs (as opposed to external accelerators) isn’t technical merit — it’s growing geopolitical tensions between the US, China, Europe, Russia, and the rest of the world. Politics aside, what’s the state right now? Where can we find RISC-V cores and which of those have SIMD (Single Instruction, Multiple Data) extensions like RVV (RISC-V Vector) to enable advanced parallel processing?

| Vendor | Availability | Vectors | Notes | |---|---|---|---| | Meta | Internal only (MTIA) | Custom | 100K+ chips, 16 data centers | | Alibaba | Scaleway EU, Alibaba Cloud | 0.7.1 | C910/C920, €16/mo on Scaleway | | Tenstorrent | Koyeb serverless | SFPI | NOT standard RVV | | SiFive | Dev boards (X280) | RVV 1.0 | Purchase only | | QEMU | Local emulation | RVV 1.0 | For development |

Meanwhile, NVIDIA ships 1B+ RISC-V cores per year embedded in GPUs for power management — not for compute, but a telling sign of the ISA’s reach.

No public cloud offers RVV 1.0 hardware yet — Scaleway’s EM-RV1 with T-Head C910 is the only commercial option, stuck on the draft 0.7.1 spec. The gap to 1.0 is painful: binary-incompatible encodings, no fractional LMUL, no tail/mask-agnostic policies, different vsetvl semantics. Worse, there’s no ratified matrix extension yet — unlike Intel AMX or Arm SME, you can’t do 2D tiled matmuls, and BFloat16 requires the “Zvfbfwma” extension that C910 doesn’t have. That said, the instruction set is already humongous — as many have pointed out, “Reduced” doesn’t really belong in “RISC-V”.

RISC-V RVV vocabulary is quite simple compared to AVX-512, but when you look at the combinatorially exploding number of LMUL × datatype × policy combinations, it starts looking quite scary:

— Ash Vardanian (@ashvardanian) February 14, 2026

• AVX-512 ≈ 5.2k intrinsics

• Arm SVE/SVE2 ≈ 9.2k

• RVV 1.0 ≈ 71.5k

... excluding…

Despite all that, RVV does have unique strengths worth exploiting:

vlseg/vsseg — segment loads that deinterleave AoS (complex numbers, RGB) directly into registers viota + vcompress — stream compaction in 3 instructions vs ~10 on AVX-512 vfwredusum — widening reduction without the horizontal shuffle dance

What does work well: vfwmacc for f16 × f16 → f32 and vwmacc for i8 × i8 → i32 widening dot products. Even more interesting, vfwmacc_vv_f64m2 computes f64 += f32 × f32 in a single instruction with no intermediate rounding — the multiply happens at full Float64 width. Neither x86 nor SVE has an equivalent; on those ISAs you must widen both operands first, then FMA, eating two extra instructions and an intermediate rounding step.

Here are some of the interesting things that worked on RISC-V better than on other ISAs:

nk_reduce_moments horizontal accumulations over arbitrarily strided data, where we may be computing the L2 norm of a column in a row-major matrix layout — the RVV kernel may use a combination of __riscv_vlse32_v_f32m1 strided loads and __riscv_vfwmacc_vv_f64m2_tu widening FMA in the hot loop nk_reduce_minmax horizontal reductions tracking positions in arbitrary size arrays, where the compared objects and the offsets have clearly different widths — the RVV kernel may use vfloat32m1_t for incoming floats and a wider vuint64m2_t to track positions nk_rmsd, nk_kabsch, nk_umeyama kernels for computational geometry with applications in Biology and Chemistry leverage RVV’s segmented loads and widening/narrowing logic for tight iterative algorithms like SVD — details in the mesh alignment section

Thanks to QEMU, I was able to validate all kernels for correctness, but it’s impossible to make throughput claims without hardware. So let’s switch to more practical cases.

Intel AMX and the Future Beyond AVX10.2

Intel was the first to bring GPU-style tensor cores to mainstream CPUs. Advanced Matrix Extensions (or AMX) provide massive 8x 1 KB tile registers (TMMs) and a dedicated TMUL unit for tiled matmuls. Conceptually similar to scheduling NVIDIA’s tensor cores, but much easier to program. Xeon4 supports bf16 × bf16 → f32

Source: Hacker News