NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...2 min read

Case study: recovery of a corrupted 12 TB multi-device pool

Share
NOW LET US Article – Case study: recovery of a corrupted 12 TB multi-device pool

A power failure corrupted a 12 TB Btrfs pool, leading to a failure of standard repair tools. Through custom-built C tools, 99.99% of the data was recovered, providing valuable insights for Btrfs development.

Case study: recovery of a severely corrupted 12 TB multi-device pool, plus constructive gap analysis and reference tool set #1107

Description

Hello, and thanks in advance for reading.

This is not a bug report. It is a case study write up of a recovery effort on a severely corrupted 12 TB multi-device pool, shared here in case any of the observations are useful to btrfs-progs development. The goal is constructive, not a complaint.

One paragraph summary

A hard power cycle on a 3 device pool (data single, metadata DUP, DM-SMR disks) left the extent tree and free space tree in a state that no native repair path could resolve. A subsequent btrfs check --repair run entered an infinite loop of 46,000+ commits with zero net progress, rotating the 4 backup_roots slots past any pre-crash rollback point. Recovery eventually succeeded through a set of 14 custom C tools built against the internal btrfs-progs API, with a final data loss of about 7.2 MB out of 4.59 TB (0.00016 percent). The pool is now fully operational.

Full analysis

I wrote the case up in a structured way that covers environment, timeline, root cause classification, the bulletproof safety criterion we derived empirically, and 9 specific areas where a relatively small upstream change would have prevented the need for most of the custom tooling.

The nine proposed improvement areas, in order of expected impact on operators hitting similar cases:

A. Progress detection in btrfs check --repair so 46,000 commit loops abort with a clear message instead of destroying backup_roots.

B. Symmetric handling of BTRFS_ADD_DELAYED_REF in reinit_extent_tree, matching the existing BTRFS_DROP_DELAYED_REF exemption.

C. Sibling safety precheck in btrfs_del_items rebalance so a drain below LEAF_DATA_SIZE/4 does not trigger push_leaf_left on a stale sharable sibling.

D. Supervised EEXIST handling in alloc_reserved_tree_block with three explicit modes (error, silent, update).

E. A btrfs rescue rebuild-extent-tree subcommand that operates from a pre-scanned ref list, as an alternative to the currently deadlocking --init-extent-tree.

F. A btrfs rescue clean-orphan-inodes subcommand with a built-in dry-run that applies the bulletproof 5-condition check and produces a machine-readable plan.

G. A btrfs rescue fix-bg-accounting for surgical BLOCK_GROUP_ITEM.used fixes after bulk extent tree rebuild.

H. Clearer documentation that backup_roots[0..3] is a four commit sliding window, not historical backup (widely misunderstood).

I. Documentation of the DIR i_size = sum(namelen * 2) rule, which bit us during orphan dir entry cleanup and is not currently written down in any user facing place.

Reference implementation

All 14 custom tools, along with the single-line patch to alloc_reserved_tree_block, are published in GPL-2.0 form. Every tool has a read-only scan mode by default and a --write mode that is opt-in. The README.md explains the execution order used during recovery. I am not proposing these as upstream patches directly. Most of the proposals above are not single function changes and getting any of them accepted would require a design discussion with people more familiar than me with the subsystems involved. Sharing the reference implementation felt more useful than opening nine separate pull requests without context.

How I would like this to be received

Please treat this as input, not as a demand. If any single observation or proposal is worth pursuing, I am happy to expand the analysis, provide additional evidence from the session logs, or test any proposed patch against the class of damage we hit. If none of it is useful, no problem, and thanks for the tool set that got us most of the way there.

© 2026 Now Let Us. All rights reserved.

Source: Hacker News

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Leaving Mozilla

dev-tools

Leaving Mozilla

A poignant and candid reflection from a 15-year Mozilla veteran upon their departure. The author highlights the leadership's missteps in trying to emulate tech giants and urges Mozilla to return to its core values: community and uniqueness.

NOW LET US Related – Shepherd's Dog: A Game by the Most Dangerous AI Model

dev-tools

Shepherd's Dog: A Game by the Most Dangerous AI Model

A developer tested Anthropic's latest, supposedly 'too dangerous' AI model by asking it to build a long-held game idea in a single shot. The model succeeded, generating a complete 2,319-line game after a 45-minute reasoning session.

NOW LET US Related – Open source AI must win

dev-tools

Open source AI must win

If artificial intelligence becomes a utility rented only from a few closed institutions, humanity loses its operational freedom. Open-source AI is a vital infrastructure for the future of our digital society.

NOW LET US Related – Statement on US government directive to suspend access to Fable 5 and Mythos 5

dev-tools

Statement on US government directive to suspend access to Fable 5 and Mythos 5

The US government has issued an export control directive forcing Anthropic to suspend all access to its Fable 5 and Mythos 5 models due to national security concerns, a move the AI safety startup strongly disputes.

NOW LET US Related – Electric motors with no rare earths

dev-tools

Electric motors with no rare earths

Renault Group is pioneering the development of electrically excited synchronous motors (EESM) that eliminate the need for rare earth magnets, reducing dependency on global monopolies while driving efficiency and sustainability.

NOW LET US Related – Swift at Apple: Migrating the TrueType hinting interpreter

dev-tools

Swift at Apple: Migrating the TrueType hinting interpreter

Apple has rewritten its TrueType hinting interpreter from C to memory-safe Swift for its Fall 2025 OS releases, improving security and boosting performance by an average of 13%.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.