NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...1 min read

Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m

Share
NOW LET US Article – Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m

A massive dataset containing every Hacker News item since 2006 is now available in Parquet format, featuring over 47 million records and 5-minute live updates.

Every Hacker News item since 2006, live-updated every 5 minutes

This dataset contains the complete Hacker News archive: every story, comment, Ask HN, Show HN, job posting, and poll ever submitted to the site. Hacker News is one of the longest-running and most influential technology communities on the internet, operated by Y Combinator since 2007. It has become the de facto gathering place for founders, engineers, researchers, and technologists to share and discuss what matters in technology.

The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,829 items committed. New items are fetched every 5 minutes and committed directly as individual Parquet files through an automated live pipeline, so the dataset stays current with the site itself.

We believe this is one of the most complete and regularly updated mirrors of Hacker News data available on Hugging Face. The data is stored as monthly Parquet files sorted by item ID, making it straightforward to query with DuckDB, load with the datasets library, or process with any tool that reads Parquet.

The dataset is organized as one Parquet file per calendar month, plus 5-minute live files for today's activity. Every 5 minutes, new items are fetched from the source and committed directly as a single Parquet block. At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory.

data/
2006/2006-10.parquet
...
2026/2026-03.parquet
today/
2026/03/16/00/00.parquet
...
stats.csv
stats_today.csv

Along with the Parquet files, we include stats.csv which tracks every committed month with its item count, ID range, file size, fetch duration, and commit timestamp. This makes it easy to verify completeness and track the pipeline's progress.

DuckDB can read Parquet files directly from Hugging Face without downloading anything first. This is the fastest way to explore the data. The type column is stored as a small integer: 1 = story, 2 = comment, 3 = poll, 4 = pollopt, 5 = job.

Example SQL for top stories:

SELECT id, title, "by", score, url, time
FROM read_parquet('hf://datasets/open-index/hacker-news/data/*/*.parquet')
WHERE type = 1 AND title != ''
ORDER BY score DESC
LIMIT 20;
© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.