tag name | bcachefs-2025-06-04 (f05ba39fbb4f0b2fe65f80a620c2d4d04348099a) |
tag date | 2025-06-04 16:48:38 -0400 |
tagged by | Kent Overstreet <kent.overstreet@linux.dev> |
tagged object | commit 3d11125ff6... |
bcachefs updates for 6.16, part 2
- More stack usage improvements (~600 bytes).
- Define CLASS()es for some commonly used types, and convert most
rcu_read_lock() uses to the new lock guards
- New introspection:
- Superblock error counters are now available in sysfs: previously,
they were only visible with 'show-super', which doesn't provide a
live view
- New tracepoint, error_throw(), which is called any time we return an
error and start to unwind
- Repair
- check_fix_ptrs() can now repair btree node roots
- We can now repair when we've somehow ended up with the journal using
a superblock bucket
- Revert some leftovers from the aborted directory i_size feature, and
add repair code: some userspace programs (e.g. sshfs) were getting
confused.
It seems in 6.15 there's a bug where i_nlink on the vfs inode has been
getting incorrectly set to 0, with some unfortunate results;
list_journal analysis showed bch2_inode_rm() being called (by
bch2_evict_inode()) when it clearly should not have been.
- bch2_inode_rm() now runs "should we be deleting this inode?" checks
that were previously only run when deleting unlinked inodes in
recovery.
- check_subvol() was treating a dangling subvol (pointing to a missing
root inode) like a dangling dirent, and deleting it. This was the
really unfortunate one: check_subvol() will now recreate the root
inode if necessary.
This took longer to debug than it should have, and we lost several
filesystems unnecessarily, becuase users have been ignoring the release
notes and blindly running 'fsck -y'. Debugging required reconstructing
what happened through analyzing the journal, when ideally someone would
have noticed 'hey, fsck is asking me if I want to repair this: it
usually doesn't, maybe I should run this in dry run mode and check
what's going on?'.
As a reminder, fsck errors are being marked as autofix once we've
verified, in real world usage, that they're working correctly; blindly
running 'fsck -y' on an experimental filesystem is playing with fire.
Up to this incident we've had an excellent track record of not losing
data, so let's try to learn from this one.
This is a community effort, I wouldn't be able to get this done without
the help of all the people QAing and providing excellent bug reports and
feedback based on real world usage. But please don't ignore advice and
expect me to pick up the pieces.
If an error isn't marked as autofix, and it /is/ happening in the wild,
that's also something I need to know about so we can check it out and
add it to the autofix list if repair looks good. I haven't been getting
those reports, and I should be; since we don't have any sort of
telemetry yet I am absolutely dependent on user reports.
Now I'll be spending the weekend working on new repair code to see if I
can get a filesystem back for a user who didn't have backups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmhAuL0ACgkQE6szbY3K
bnZlCg/+Pu2TgWBbkwrmHgKH9v4K3pwQRREXSj0TlbWQp9bK00zEBrmdEfTZKgUC
q5nAAa6zCs0w/A9TFA7t1W/3+JY28ENhoArKFWemLhFZ2qEEXTZlVHvqyHOyuPBf
Loe+hQO8qgWJm6KO9VMCT1pEupslQLRlhI8GhbPPcxPvYXVjmTne7KCanhjeSEx5
TLaOiMn7jr+qPeLZ7xSMaaUTbH2SASjwl2E9/4kG6VqaTTF2MnPNwrdJI0exjyvs
QRaUvYbwBBTe/ru5ddmJuWj+61awKS87ANg+rkO2FWpOrai2HfgHd6o+zge/IR2Z
/Cfarv1SSd1+0caVaGUAzhnoVhOpY1FU4emJwVvcwnBXeXdGIb/kpaw+Lxm7fr+U
J6EnqgUoBsBWBCWgvUxlNHVeJ6wBdVNtDlTHabaH8RSCJZjgjg2JaSQM/v9VPLNa
6jTy3rhkPo50BJBb/F/AZmrobWXR2MkgID3iPEMcpjEyLaRZvW9FPqMFIxKQrUfB
XGDU4dAu3C+Q9i1KDkFIvIG3e7z9nSmv6np4O57CgrmrmmCpRUz7Yy0yhqNs36/H
WhLh/Pjb9gupdFK0TwFiEEG3wfnmXlde2c8IfrXXzKSKPIZ0T/RnLZapS7i94c2E
DumhLYjNjSCiciQZh4eLK0bKx0NETUG79eLUTz5Gi3Pc02E0pU8=
=ZGDn
-----END PGP SIGNATURE-----