A cache-friendly IPv6 LPM with AVX-512 (linearized B+-tree, real BGP benchmarks)

The planb-lpm library implements a high-performance IPv6 Longest Prefix Match algorithm using linearized B+-trees and AVX-512 SIMD instructions, achieving up to 20x speedup over traditional Patricia tries.

IPv6 longest-prefix-match (LPM) using a linearized B+-tree with AVX-512 SIMD and a scalar fallback. Based on the algorithm from:

Zhihao Zhang, Lanzheng Liu, Chen Chen, Huiba Li, Jiwu Shu, Windsor Hsu, Yiming Zhang.

PlanB: Efficient Software IPv6 Lookup with Linearized B+-Tree.NSDI '26. arXiv:2604.14650. Author's reference code: https://github.com/nicexlab/planb.

This repository is an independent, clean-room reimplementation of the algorithm as a portable, MIT-licensed C++17 library with extras that were not part of the reference:

Portable header-only core(include/lpm6.hpp) that compiles without AVX-512 and transparently falls back to a scalar path.Dynamic FIB(lpm6::Dynamic) using the paper's rebuild-and-swap model withstd::atomic<std::shared_ptr>; lookups are wait-free.Python bindingsvia pybind11 (pip install -e .).Correctness tests(tests/test_correctness.cpp) that verify the tree against a brute-force LPM reference on synthetic FIBs.Sample FIB/trace generator(examples/generate_sample.py).

The PlanB paper is a solid engineering idea, but the reference code on GitHub is not something you can drop into another project: it is Linux- and AVX-512-only, has no license, no Python layer, and no dynamic-FIB path. planb-lpm re-expresses the algorithm from the paper as a portable, tested, and permissively licensed library so the technique is actually usable in research, teaching, and production software.

Research / simulation— drop-in fast LPM backend for ns-3 or a custom packet-level simulator, or as a reference to compare newer LPM structures against.Control-plane tooling— SDN controllers and route-monitoring services that need to answer "which prefix covers this address?" over a large FIB without pulling in a full routing stack.Network analysis— IPv6 scanners, traffic classifiers, and log-enrichment pipelines that tag flows with their covering prefix.Teaching— a compact, readable implementation of a modern SIMD lookup structure suitable for networking and computer-architecture courses.

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure

Options:

| CMake option | Default | Effect | |---|---|---| LPM6_BUILD_PYTHON | OFF | Build the pybind11 extension |

AVX-512 is auto-detected via check_cxx_compiler_flag; on CPUs or compilers without support, the scalar path is used.

python3 examples/generate_sample.py 100000 1000000 42
./build/planb-lpm examples/sample_fib.txt examples/sample_trace.txt 20

Synthetic FIB, 100 000 prefixes. Prefix length distribution is weighted to match the RIPE rrc00-25 table (roughly /48 45%, /32 11%, /40 10%, /44 10%, /36 4%, others), generated byexamples/generate_sample.py. - Synthetic trace, 1 M uniform-random 64-bit addresses (upper half of an IPv6 address, which is all that drives /0../64 LPM). **Each benchmark does one warmup pass (discarded) and 20 timed passes.**We report min / q1 / median / q3 / max + IQR across the 20 runs so variance is visible, not hidden under an average.Pin to a single corewithtaskset -c <N> to cut scheduler noise. The binaries print the CPU they ran on, the affinity-mask width, thecpufreq governor, and the Intel turbo state so numbers are reproducible. (On WSL the/sys probes are unavailable and those lines are silently omitted.)- Hardware: laptop Intel i5-1035G7(Ice Lake, 4C/8T, AVX-512), Ubuntu 24.04 on WSL, GCC 13.3,-O3. This is a consumer CPU; the paper's numbers are from a 24-core Xeon 8331C and a Zen5 Ryzen 9 370HX.

taskset -c 2 ./build/planb-lpm examples/sample_fib.txt examples/sample_trace.txt 20 (20 timed runs + 1 warmup, batch paths use the real public API returning next-hops):

| Path | min | q1 | median | q3 | max | IQR | |---|---|---|---|---|---|---| single | 34 | 45 | 48 | 51 | 62 | 6.6 | batch<1> | 32 | 44 | 47 | 51 | 60 | 7.6 | batch<2> | 36 | 48 | 54 | 58 | 68 | 10.4 | batch<4> | 42 | 60 | 66 | 71 | 84 | 11.3 | batch<8> | 53 | 66 | 73 | 81 | 101 | 13.4 | batch<16> | 43 | 57 | 62 | 66 | 78 | 9.1 | batch<32> | 41 | 70 | 76 | 83 | 94 | 13.4 |

Units: MLPS; medians from a rolling average of four 20-run sweeps. Build time for that FIB is ~25 ms, tree depth 6, 531 k keys across 200 k edges.

Observations:

— the batched entry point adds no overhead at M=1, confirming the dispatch cost is negligible.batch<1> ≈single Sweet spot at M=8: ~1.5× single-core throughput oversingle. M=32 is essentially flat; going further would need larger-batch unrolling and register-allocation work.M=16 dipis real and reproducible across runs — likely register pressure at the#pragma GCC unroll 16 boundary interacting with the 7-level traversal. A known-to-investigate item.

Paper's ablation reports batching worth 3–4.5× on AMD Zen5; we see ~1.5× on Ice Lake. Part of the gap is the paper's lookup_batch_checksum-style fast path (our binary uses the public API that also writes next-hops), part is AVX-512 sustained frequency differences between Ice Lake and Zen5. The paper reports 191–197 MLPS single-core on Xeon and 374–393 MLPS on Zen5 — a real server should land near those numbers.

The bench_naive binary builds a Patricia radix trie (tests/patricia.hpp) on the same FIB and runs the same trace through it. Patricia is a standard path-compressed binary trie — representative of the pointer-chasing LPM structures PlanB is designed to replace.

taskset -c 2 ./build/bench_naive … (same 20-run discipline):

| Implementation | min | median | max | IQR | |---|---|---|---|---| lpm6::Tree | 23 | 50 | 63 | 8.2 | | Patricia trie | 2.4 | 2.4 | 2.6 | 0.06 |

Units: MLPS. Tree-vs-Patricia ≈ 20× on the median, bracketing the upper end of the paper's reported 1.6–14× advantage over PopTrie, CP-Trie, Neurotrie, and HBS. Patricia isn't one of those baselines — it's a conventional-radix-trie stand-in — so the exact multiplier will shift when we benchmark against the paper's actual algorithms (on the roadmap, see plan.md). The binary also prints a linear-scan number on a 50 000-address slice; that is O(N) vs. O(log N) and is only a sanity check, not a comparison.

Same 100 000-prefix workload. footprint_bytes() counts the algorithmic cost (our flat key + next-hop arrays for the tree; one Node struct per Patricia node); RSS delta is the process-level VmRSS change across the build and captures allocator overhead:

| Implementation | Algorithmic | RSS delta | Per-prefix | |---|---|---|---| lpm6::Tree | 4.82 MB | 5.05 MB | ~50 B/prefix | | Patricia trie | 6.04 MB | 9.02 MB | ~90 B/prefix |

At 100 K, the tree's flat cache-aligned layout costs roughly half the RSS of the pointer-chasing Patricia. This does not hold at scale — see FIB-size sweep below: the linearized B+-tree grows in discrete depth steps (9^depth buckets), so its footprint is bounded from below by depth, not prefix count. Once depth transitions 6→7 (around 250 K prefixes) the tree's algorithmic size jumps to ~38 MB and stays there until depth 7 fills up. Patricia scales linearly with prefixes, so at 250 K the tree is actually larger than Patricia. Paper reports 56–92% memory reductions vs PopTrie/CP-Trie/Neurotrie/HBS; those are heavier than plain Patricia, so the paper's advantage likely still holds against them even at our 250 K crossover.

examples/run_sweep.sh drives planb-lpm, bench_update, and bench_naive across five FIB sizes sharing one 1 M-address trace, same 20-run discipline, pinned to core 2. Numbers are medians of the timed runs; throughput units are MLPS:

| FIB | depth | single | batch<8> | batch<32> | rebuild | tree alg. | Patricia alg. | |---|---|---|---|---|---|---|---| | 10 K | 5 | 94 | 165 | 148 | 1.5 ms | 0.53 MB | 0.61 MB | | 100 K | 6 | 46 | 69 | 75 | 17.7 ms | 4.82 MB | 6.04 MB | | 250 K | 7 | 20 | 29 | 41 | 51.8 ms | 38.40 MB | 15.00 MB | | 500 K | 7 | 12 | 18 | 27 | 107.9 ms |

Source: Hacker News