Show HN: Forkrun – NUMA-aware shell parallelizer (50×–400× faster than parallel)

forkrun is a high-performance, NUMA-aware replacement for GNU Parallel and xargs -P, accelerating shell-based data preparation by 50×–400×. It achieves near-total CPU utilization and linear scaling on modern multi-socket architectures.
forkrun is a self-tuning, drop-in replacement for GNU Parallel and xargs -P that accelerates shell-based data preparation by 50×–400× on modern CPUs and scales linearly on NUMA architectures.
forkrun achieves:
- 200,000+ batch dispatches/sec (vs ~500 for GNU Parallel)
- ~95–99% CPU utilization across all cores (vs ~6% for GNU Parallel)
- Near-zero cross-socket memory traffic (NUMA-aware “born-local” design)
forkrun is built for high-frequency, low-latency workloads on deep NUMA hardware — a regime where existing tools leave most cores idle due to IPC overhead and cross-socket data migration.
forkrun is distributed as a single bash file with an embedded, self-extracting compiled C extension. There are no external dependencies (no Perl, no Python).
Download and source it directly:
source <(curl -sL https://raw.githubusercontent.com/jkool702/forkrun/main/frun.bash)
(Note: Sourcing the script sets up the required C loadable builtins in your shell environment).
Once sourced, frun acts as a drop-in parallelizer:
frun my_bash_func < inputs.txt # parallelize custom bash functions natively!
cat file_list | frun -k sed 's/old/new/' # pipe-based input, ordered output
frun -k -s sort < records.tsv # stdin-passthrough, ordered output
frun -s -I 'gzip -c >{ID}.gz' < raw_logs # stdin-passthrough, unique output names
Verifiable Builds: The embedded C-extension is compiled and injected transparently via GitHub Actions. You can trace the git blame of the Base64 blob directly to the public CI workflow run that compiled forkrun_ring.c, guaranteeing the binary contains no hidden malicious code.
| Workload | forkrun | GNU Parallel | Speedup | Notes |
|---|---|---|---|---|
| Default (array + fully-quoted args, no-op) | 24 M lines/s | 58 k lines/s | ~415× | forkrun default mode |
| Ordered output (-k , no-op) | 24.5 M lines/s | 57 k lines/s | ~430× | ordering is free in forkrun |
| echo (line args) | 22.6 M lines/s | ~55 k lines/s | ~410× | typical shell command |
| printf '%s\n' (I/O heavy) | 12.8 M lines/s | ~58 k lines/s | ~220× | formatting + output |
| -s stdin passthrough (no-op) | 893 M lines/s | 6.05 M lines/s (--pipe ) | ~148× | streaming / splice |
| -b 524288 byte batches (no-op) | 1.54 B lines/s | 6.02 M lines/s (--pipe ) | ~256× | kernel-limited |
Average CPU utilization across ~400 benchmarks
- forkrun: 95% (27.1 / 28 cores) — No centralized dispatcher; all 27.1 cores do actual work.
- GNU Parallel: 6% (2.68 / 28 cores) — 1 full core used strictly for dispatching work; 1.68 cores doing actual work.
Traditional tools like GNU Parallel use heavy regex parsing and IPC dispatch loops that bottleneck multi-socket servers. forkrun operates completely differently. The pipeline has four stages, each designed to preserve physical locality:
- Ingest (Born-Local NUMA): Data is
splice()'d from stdin into a shared memfd. On multi-socket systems,set_mempolicy(MPOL_BIND)places each chunk's pages on a target NUMA node before any worker touches them. - Index: Per-node indexers find record boundaries using AVX2/NEON SIMD scanning at memory bandwidth. They dynamically batch based on runtime conditions.
- Claim (Contention-Free): Workers claim batches via a single
atomic_fetch_add— no CAS retry loops, no locks, no contention. - Reclaim: A background fallow thread punches holes behind completed work via
fallocate(PUNCH_HOLE), bounding memory usage.
Adaptive tuning is fully automatic. A PID-based controller discovers the optimal batch size and continuously adjusts based on input rate and worker starvation.
Roadmap: Priorities include failure isolation, per-batch retries, and resume-after-interruption state saving for cluster/Slurm jobs.
Source: Hacker News









