Case study: recovery of a corrupted 12 TB multi-device pool

A power failure corrupted a 12 TB Btrfs pool, leading to a failure of standard repair tools. Through custom-built C tools, 99.99% of the data was recovered, providing valuable insights for Btrfs development.

Case study: recovery of a severely corrupted 12 TB multi-device pool, plus constructive gap analysis and reference tool set #1107

Description

Hello, and thanks in advance for reading.

This is not a bug report. It is a case study write up of a recovery effort on a severely corrupted 12 TB multi-device pool, shared here in case any of the observations are useful to btrfs-progs development. The goal is constructive, not a complaint.

One paragraph summary

A hard power cycle on a 3 device pool (data single, metadata DUP, DM-SMR disks) left the extent tree and free space tree in a state that no native repair path could resolve. A subsequent btrfs check --repair run entered an infinite loop of 46,000+ commits with zero net progress, rotating the 4 backup_roots slots past any pre-crash rollback point. Recovery eventually succeeded through a set of 14 custom C tools built against the internal btrfs-progs API, with a final data loss of about 7.2 MB out of 4.59 TB (0.00016 percent). The pool is now fully operational.

Full analysis

I wrote the case up in a structured way that covers environment, timeline, root cause classification, the bulletproof safety criterion we derived empirically, and 9 specific areas where a relatively small upstream change would have prevented the need for most of the custom tooling.

The nine proposed improvement areas, in order of expected impact on operators hitting similar cases:

A. Progress detection in btrfs check --repair so 46,000 commit loops abort with a clear message instead of destroying backup_roots.

B. Symmetric handling of BTRFS_ADD_DELAYED_REF in reinit_extent_tree, matching the existing BTRFS_DROP_DELAYED_REF exemption.

C. Sibling safety precheck in btrfs_del_items rebalance so a drain below LEAF_DATA_SIZE/4 does not trigger push_leaf_left on a stale sharable sibling.

D. Supervised EEXIST handling in alloc_reserved_tree_block with three explicit modes (error, silent, update).

E. A btrfs rescue rebuild-extent-tree subcommand that operates from a pre-scanned ref list, as an alternative to the currently deadlocking --init-extent-tree.

F. A btrfs rescue clean-orphan-inodes subcommand with a built-in dry-run that applies the bulletproof 5-condition check and produces a machine-readable plan.

G. A btrfs rescue fix-bg-accounting for surgical BLOCK_GROUP_ITEM.used fixes after bulk extent tree rebuild.

H. Clearer documentation that backup_roots[0..3] is a four commit sliding window, not historical backup (widely misunderstood).

I. Documentation of the DIR i_size = sum(namelen * 2) rule, which bit us during orphan dir entry cleanup and is not currently written down in any user facing place.

Reference implementation

All 14 custom tools, along with the single-line patch to alloc_reserved_tree_block, are published in GPL-2.0 form. Every tool has a read-only scan mode by default and a --write mode that is opt-in. The README.md explains the execution order used during recovery. I am not proposing these as upstream patches directly. Most of the proposals above are not single function changes and getting any of them accepted would require a design discussion with people more familiar than me with the subsystems involved. Sharing the reference implementation felt more useful than opening nine separate pull requests without context.

How I would like this to be received

Please treat this as input, not as a demand. If any single observation or proposal is worth pursuing, I am happy to expand the analysis, provide additional evidence from the session logs, or test any proposed patch against the class of damage we hit. If none of it is useful, no problem, and thanks for the tool set that got us most of the way there.

Source: Hacker News