tag name | bcachefs-2025-01-20 (25cd90af11778d4633d62de06cf8448a3bc6dff7) |
tag date | 2025-01-20 11:13:06 -0500 |
tagged by | Kent Overstreet <kent.overstreet@linux.dev> |
tagged object | commit 63db187a57... |
bcachefs updates for 6.14-rc1
Lots of scalability work, another big on disk format change. On disk
format version goes from 1.13 to 1.20.
Like 6.11, this is another big and expensive automatic/required on disk
format upgrade. This is planned to be the last big on disk format
upgrade before the experimental label comes off. There will be one more
minor on disk format update for a few things that couldn't make this
release.
Headline improvements:
- Fix mount time regression that some users encountered post the 6.11
disk accounting rewrite.
Accounting keys were encoded little endian (typetag in the low bits) -
which didn't anticipate adding accounting keys for every inode, which
aren't stored in memory and we don't want to scan at mount time.
- fsck time on large filesystems is improved by multiple orders of
magnitude. Previously, 100TB was about the practical max filesystem
size, where users were reporting fsck times of a day+. With the new
changes (which nearly eliminate backpointers fsck overhead), we fsck'd
a filesystem with 10PB of data in 1.5 hours.
The problematic fsck passes were walking every extent and checking for
missing backpointers, and walking every backpointer to check for
dangling backpointers. As we've been adding more and more runtime self
healing there was no reason to keep around the backpointers -> extents
pass; dangling backpointers are just deleted, and we can do that when
using them - thus, backpointers -> extents is now only run in debug
mode.
extents -> backpointers does need to exist, since missing backpointers
would mean we can't find data to move it (for e.g. copygc, device
evacuate, scrub). But the new on disk format version makes possible a
new strategy where we sum up backpointers within a bucket and check it
against the bucket sector counts, and then only scan for missing
backpointers if the counts are off (and then, only for specific
buckets).
Full list of on disk format changes:
- 1.14: backpointer_bucket_gen
Backpointers now have a field for the bucket generation number,
replacing the obsolete bucket_offset field. This is needed for the
new "sum up backpointers within a bucket" code, since backpointers use
the btree write buffer - meaning we will see stale reads, and this
runs online, with the filesystem in full rw mode.
- 1.15: disk_accounting_big_endian
As previously described, fix the endianness of accounting keys so that
accounting keys with the same typetag sort together, and accounting
read can skip types it's not interested in.
- 1.16: reflink_p_may_update_opts:
This version indicates that a new reflink pointer field is understood
and may be used; the field indicates whether the reflink pointer has
permissions to update IO path options (e.g. compression, replicas) may
be updated on the indirect extent it points to.
This completes the rebalance/reflink data path option handling from
the 6.13 pull request.
- 1.17: inode_depth
Add a new inode field, bi_depth, to accelerate the
check_directory_structure fsck path, which checks for loops in the
filesystem heirarchy.
check_inodes and check_dirents check connectivity, so
check_directory_structure only has to check for loops - by walking
back up to the root from every directory.
But a path can't be a loop if it has a counter that increases
monotonically from root to leaf - adding a depth counter means that we
can check for loops with only local (parent -> child) checks. We might
need to occasionally renumber the depth field in fsck if directories
have been moved around, but then future fsck runs will be much faster.
- 1.18: persistent_inode_cursors
Previously, the cursor used for inode allocation was only kept in
memory, which meant that users with large filesystems and lots of
files were reporting that the first create after mounting would take
awhile - since it had to scan from the start.
Inode allocation cursors are now persistent, and also include a
generation field (incremented on wraparound, which will only happen if
inode allocation is restricted to 32 bit inodes), so that we don't
have to leave inode_generation keys around after a delete.
The option for 32 bit inode numbers may now also be set on individual
directories, and non-32 bit inode allocations are disallowed from
allocating from the 32 bit part of the inode number space.
- 1.19: autofix_errors
Runtime self healing is now the default.o
- 1.20: directory size (from Hongbo)
directory i_size is now meaningful, and not 0.
Release notes from the previous 6.13 pull request:
- Self healing work:
Allocator and reflink now run the exact same check/repair code that
fsck does at runtime, where applicable.
The long term goal here is to remove inconsistent() errors (that cause
us to go emergency read only) by lifting fsck code up to normal
runtime paths; we should only go emergency read-only if we detect an
inconsistency that was due to a runtime bug - or truly catastrophic
damage (corrupted btree roots/interior nodes).
- Reflink repair no longer deletes reflink pointers: instead we flip an
error bit and log the error, and they can still be deleted by file
deletion. This means a temporary failure to find an indirect extent
(perhaps repaired later by btree node scan) won't result in
unnecessary data loss
- Improvements to rebalance data path option handling: we can now
correctly apply changed filesystem-level io path options to pending
rebalance work, and soon we'll be able to apply file-level io path
option changes to indirect extents.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmeOfpsACgkQE6szbY3K
bnaD4RAAlyHcrrvgD/JjGjNMqL4On+UIS92+pUfeLaynPvNEs9qsFxitm4t1EM7b
yWUzMJW3ru39VgWIDYkkZdCZ7MrEzXIEGM0MJYl9re0UitvodIR0SUVkHecz7e7f
8rrXVs0NkwfUCcoHhOLdpsPwTNVBKwteSop9EY9hbofjY5Dvh5kC+D8o+XDrLoDO
ur0utr+UaT8QbOSLBHSKbloYcmYHou+0kHMzOQdPJckmgcOL8llvEGWKawnRIifO
RHL6tEA15rJ/sQnpUSXHoHC4fqNuZHotvitrmU5pHFSOHG/zkRHNoHcMXKdgzau1
ExgzgIa1i+SYk19k5DeyjqkH08nQUX8r8G4uhNttmzu4HP36LqMVFFB3ne7Mca2h
p2cEAkZKBQWyAV3F5auY2Zm/1/KZcu8GesSv7nYnNufhVVbg7PbIJb01QiTaLYRX
1Cp+Yd+rZ2ydj9uQ7uzZoItCnUQy0o/8xBtNUEeZIT0sP7A+9nc4HM/BzuiTX2/w
lyzpHsm67TawbKcY1lmWkryh4rdqUATSnvKQ853wE/W5TsNRGkWILeOjCL0mJA6j
0C4QP+C6DXl8QJw+t4yV/o5ogQP7XDgMBaTcrSfKZm52Gaab+jIwWltonghNHk2g
VkVpD+2+PwVcaZgPG2+H7aMww1Kjlqnh+sXyxAEyEW+HS8ywQl8=
=CTyu
-----END PGP SIGNATURE-----