I imported the full Linux kernel git history into pgit

A deep dive into importing 1.4 million Linux kernel commits into pgit, a SQL-based Git alternative. The post details the hardware setup, OS tuning, and PostgreSQL configurations required to handle one of the world's largest repositories.
TL;DR: Imported the full Linux kernel history into pgit. 1,428,882 commits, 24.4 million file versions, 20 years of development, stored in PostgreSQL with delta compression. Actual data: 2.7 GB (git gc --aggressive gets 1.95 GB). The import took 2 hours on a dedicated server. Then I started asking questions. 7 f-bombs in 1.4 million commit messages (all from 2 people). 665 bug fixes pointing at a single commit. A filesystem that took 13 years to merge. Here's what the Linux kernel looks like as a SQL database.
The import
This post builds on pgit: What If Your Git History Was a SQL Database?. If you haven't read it, start there. Short version: pgit is a Git-like CLI where everything lives in PostgreSQL instead of the filesystem. It uses pg-xpatch for transparent delta compression and makes your entire commit history SQL-queryable. After the pgit post hit the HN front page and got picked up by TLDR, console.dev, and dailydev, I teased that I was importing the Linux kernel. Here's what happened.
The Linux kernel is one of the largest actively developed repositories in the world. 1.4 million commits spanning 20 years, 171,000 files, 38,000 contributors. From what I've found, only a handful of VCS besides git have ever managed a full import of the kernel's history. Fossil (SQLite-based, by the SQLite team) never did. Darcs and Monotone attempted it with severe performance problems. Mercurial can do it. Correct me if I'm wrong on any of this.
pgit handled it.
| Metric | Value | |---|---| | Commits | 1,428,882 | | File versions (file refs) | 24,384,844 | | Unique blobs | 3,089,589 | | Unique paths | 171,525 | | Path groups (delta chains) | 137,600 | Import time | 2h 0m 48s |
The import ran on a Hetzner dedicated server in Finland: AMD EPYC 7401P (24 cores / 48 threads), 512 GB DDR4 ECC RAM, 2Ã1.92 TB SSD in RAID 0. With a 350 GB xpatch content cache, the entire decoded repository fits in memory.
Full server setup, git baseline, and pgit configuration
The server
Hetzner Dedicated "Server Auction" from their Finland datacenter (HEL1):
| Component | Spec | |---|---| | CPU | AMD EPYC 7401P (24 cores / 48 threads) | | RAM | 16Ã32 GB DDR4 ECC reg. (512 GB total) | | Storage | 2ÃMicron SSD SATA 1.92 TB Datacenter (RAID 0) | | NIC | 1 Gbit Intel I350 | | Cost | ~â¬272/month |
OS installation
Hetzner installimage with Ubuntu 24.04 LTS. Two changes from the default config: RAID 0 (SWRAIDLEVEL 0) for maximum throughput (no redundancy needed for ephemeral analysis work), and a simple partition layout:
PART /boot ext3 1024M
PART swap swap 4G
PART / ext4 all
This gives ~3.5 TB usable storage across the two 1.92 TB SSDs.
OS tuning
After booting into the installed image:
# --- Packages ---
apt update && apt upgrade -y
apt install -y tmux btop htop iotop cpufrequtils numactl git curl wget unzip build-essential ufw linux-tools-common linux-tools-$(uname -r)
# --- CPU governor â performance (all 48 threads) ---
for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > "$cpu"; done
# --- Kernel mitigations off ---
sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT="consoleblank=0"/GRUB_CMDLINE_LINUX_DEFAULT="consoleblank=0 mitigations=off"/' /etc/default/grub.d/hetzner.cfg
update-grub
# --- sysctl ---
cat >> /etc/sysctl.conf << 'EOF'
vm.swappiness = 1
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
kernel.numa_balancing = 1
EOF
sysctl -p
# --- Disable Transparent Huge Pages ---
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# --- noatime ---
sed -i 's|relatime|noatime|g' /etc/fstab
mount -o remount,noatime /
pgit configuration
pgit config --global container.shared_buffers 64GB
pgit config --global container.effective_cache_size 400GB
pgit config --global container.xpatch_cache_size_mb 358400 # 350 GB
pgit config --global container.max_worker_processes 28
pgit config --global import.workers 24
Configuration rationale
| Parameter | Value | Reasoning |
|---|---|---|
| shared_buffers | 64 GB | Dataset ~20 GB on disk |
Source: Hacker News













