tag name | bcachefs-2025-05-24 (642a3d66bc6d67e1d297fb12f7ac99990bdef752) |
tag date | 2025-05-24 20:18:44 -0400 |
tagged by | Kent Overstreet <kent.overstreet@linux.dev> |
tagged object | commit 9caea9208f... |
bcachefs updates for 6.16
Lots of changes:
- Poisoned extents can now be moved: this lets us handle bitrotted data
without deleting it. For now, reading from poisoned extents only
returns -EIO: in the future we'll have an API for specifying "read
this data even if there were bitflips".
- Incompatible features may now be enabled at runtime, via
"opts/version_upgrade" in sysfs. Toggle it to incompatible, and then
toggle it back - option changes via the sysfs interface are
persistent.
- Various changes to support deployable disk images:
- RO mounts now use less memory
- Images may be stripped of alloc info, particularly useful for
slimming them down if they will primarily be mounted RO. Alloc info
will be automatically regenerated on first RW mount, and this is
quite fast.
- Filesystem images generated with 'bcachefs image' will be
automatically resized the first time they're mounted on a larger
device.
The images 'bcachefs image' generates with compression enabled have
been comparable in size to those generated by squashfs and erofs -
but you get a full RW capable filesystem.
- Major error message improvements for btree node reads, data reads,
and elsewhere. We now build up a single error message that lists all
the errors encountered, actions taken to repair, and success/failure
of the IO. This extends to other error paths that may kick off other
actions, e.g. scheduling recovery passes: actions we took because of
an error are included in that error message, with grouping/indentation
so we can see what caused what.
- Repair/self healing:
- We can now kick off recovery passes and run them in the background
if we detect errors. Currently, this is just used by code that walks
backpointers; we now also check for missing backpointers at runtime
and run check_extents_to_backpointers if required. The messy 6.14
upgrade left missing backpointers for some users, and this will
correct that automatically instead of requiring a manual fsck - some
users noticed this as copygc spinning and not making progress.
In the future, as more recovery passes come online, we'll be able to
repair and recover from nearly anything - except for unreadable
btree nodes, and that's why you're using replication, of course -
without shutting down the filesystem.
- There's a new recovery pass, for checking the rebalance_work btree,
which tracks extents that rebalance will process later.
- Hardening:
- Close the last known hole in btree iterator/btree locking
assertions: path->should_be_locked paths must stay locked until the
end of the transaction. This shook out a few bugs, including a
performance issue that was causing unnecessary path_upgrade
transaction restarts.
- Performance;
- Faster snapshot deletion: this is an incompatible feature, as it
requires new sentinal values, for safety. Snapshot deletion no
longer has to do a full metadata scan, it now just scans the inodes
btree: if an extent/dirent/xattr is present for a given snapshot ID,
we already require that an inode be present with that same snapshot
ID.
If/when users hit scalability limits again (ridiculously huge
filesystems with lots of inodes, and many sparse snapshots), let me
know - the next step will be to add an index from snapshot ID ->
inode number, which won't be too hard.
- Faster device removal: the "scan for pointers to this device" no
longer does a full metadata scan, instead it walks backpointers.
Like fast snapshot deletion this is another incompat feature: it
also requires a new sentinal value, because we don't want to reuse
these device IDs until after a fsck.
- We're now coalescing redundant accounting updates prior to
transaction commit, taking some pressure off the journal. Shortly
we'll also be doing multiple extent updates in a transaction in the
main write path, which combined with the previous should drastically
cut down on the amount of metadata updates we have to journal.
- Stack usage improvements: All allocator state has been moved off the
stack
- Debug improvements:
- enumerated refcounts: The debug code previously used for filesystem
write refs is now a small library, and used for other heavily used
refcounts. Different users of a refcount are enumerated, making it
much easier to debug refcount issues.
- Async object debugging: There's a new kconfig option that makes
various async objects (different types of bios, data updates, write
ops, etc.) visible in debugfs, and it should be fast enough to leave
on in production.
- Various sets of assertions no longer require CONFIG_BCACHEFS_DEBUG,
instead they're controlled by module parameters and static keys,
meaning users won't need to compile custom kernels as often to help
debug issues.
- bch2_trans_kmalloc() calls can be tracked (there's a new kconfig
option); with it on you can check the btree_transaction_stats in
debugfs to see the bch2_trans_kmalloc() calls a transaction did when
it used the most memory.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmgyaC0ACgkQE6szbY3K
bnYcGQ//ZOCe34wjVFub+dNn9os0llaIFaShTC9Baoi+Ly8qmMBkiVR8h0XZWJ6I
Xue8FaPksEDUF+pXSPjI+L/WA2uW/qNm2Q2RxEfxigSMSzUUZvHs/jU3ZkpZ1JQb
l327tun1XNNY2JagcTj09X+VoasLuhQtvBKXM6gAWozXNszLesd1vaFexPsk13bV
GwqSxlfayYt5DwzEf7OCL9CXWfW86qs8snLYAPpv/pyoVNKw+iuPFlhDA1AD1ZMG
s+syQ5R7u5ikcfpYnaakDsn3KhxsX+jLk5PoSHk/6kGy/5BdJ1AUYQEsSNfdcxHy
pxNht12Nuoo2q2qI0gL4oegnz36cndtveCf9vs6K0Vg24ZRylhh8uz3v/ZcAu0Ne
CwFvpxMn5jtIgqh75i9R1/W6aiuKffkE29D4Me5RJxEqoM8yKKhKx6tHHzZftT3a
QSvbgsfBghetfTqcajBvDDN5GQM2Z8pz2iLrIw/EHuAh15hAhzf+7ULHprIh6IDz
m/Px72xrh39CAKI8IdsjD7QLT9a7xN3WKQXbSvFMEPjnJtGL3JGARZfsKB2gL7ZO
551ONexueFkilQmGQfy20VYGF1Mu9mWTUqyVnNaQUMbgKKDcAivy71UyFe/n3GOB
xJyEKTfrJg8Qn+vEJvlhXevVnz5FO/hiOAMIrMPKQq8XT0iNdAA=
=srxl
-----END PGP SIGNATURE-----