Pgit: I Imported the Linux Kernel into PostgreSQL

A software engineer successfully imported the entire 1.4 million commits and 20 years of Linux Kernel history into a PostgreSQL database using pgit, making the massive repository fully queryable via SQL.

TL;DR: Imported the full Linux kernel history into pgit. 1,428,882 commits, 24.4 million file versions, 20 years of development, stored in PostgreSQL with delta compression. Actual data: 2.7 GB (git gc --aggressive gets 1.95 GB). The import took 2 hours on a dedicated server. Then I started asking questions. 7 f-bombs in 1.4 million commit messages (all from 2 people). 665 bug fixes pointing at a single commit. A filesystem that took 13 years to merge. Here's what the Linux kernel looks like as a SQL database.

The import

This post builds on pgit: What If Your Git History Was a SQL Database?. If you haven't read it, start there. Short version: pgit is a Git-like CLI where everything lives in PostgreSQL instead of the filesystem. It uses pg-xpatch for transparent delta compression and makes your entire commit history SQL-queryable. After the pgit post hit the HN front page and got picked up by TLDR, console.dev, and dailydev, I teased that I was importing the Linux kernel. Here's what happened.

The Linux kernel is one of the largest actively developed repositories in the world. 1.4 million commits spanning 20 years, 171,000 files, 38,000 contributors. From what I've found, only a handful of VCS besides git have ever managed a full import of the kernel's history. Fossil (SQLite-based, by the SQLite team) never did. Darcs and Monotone attempted it with severe performance problems. Mercurial can do it. Correct me if I'm wrong on any of this.

pgit handled it.

| Metric | Value | |---|---| | Commits | 1,428,882 | | File versions (file refs) | 24,384,844 | | Unique blobs | 3,089,589 | | Unique paths | 171,525 | | Path groups (delta chains) | 137,600 | Import time | 2h 0m 48s |

The import ran on a Hetzner dedicated server in Finland: AMD EPYC 7401P (24 cores / 48 threads), 512 GB DDR4 ECC RAM, 2x1.92 TB SSD in RAID 0. With a 350 GB xpatch content cache, the entire decoded repository fits in memory.

Full server setup, git baseline, and pgit configuration

The server

Hetzner Dedicated "Server Auction" from their Finland datacenter (HEL1):

| Component | Spec | |---|---| | CPU | AMD EPYC 7401P (24 cores / 48 threads) | | RAM | 16x32 GB DDR4 ECC reg. (512 GB total) | | Storage | 2xMicron SSD SATA 1.92 TB Datacenter (RAID 0) | | NIC | 1 Gbit Intel I350 | | Cost | ~€272/month |

OS installation

Hetzner installimage with Ubuntu 24.04 LTS. Two changes from the default config: RAID 0 (SWRAIDLEVEL 0) for maximum throughput (no redundancy needed for ephemeral analysis work), and a simple partition layout:

PART /boot ext3 1024M
PART swap swap 4G
PART / ext4 all

This gives ~3.5 TB usable storage across the two 1.92 TB SSDs.

OS tuning

After booting into the installed image, the system was tuned for performance by setting the CPU governor to performance, disabling kernel mitigations, and optimizing sysctl parameters like swappiness and dirty ratios. Transparent Huge Pages were disabled to ensure stability for the database workload.

pgit configuration

PostgreSQL was configured with a 64GB shared buffer and a 350GB xpatch cache to ensure that the massive amount of file versions and delta chains could be processed efficiently in memory. Parallelism was tuned to match the 24-core EPYC processor.

Source: Hacker News