summaryrefslogtreecommitdiff
tag namebcachefs-2025-03-23 (03fa73885c2b3a85a8050803a6007ef9e189383f)
tag date2025-03-23 10:22:36 -0400
tagged byKent Overstreet <kent.overstreet@linux.dev>
tagged objectcommit b4de3a7a16...
bcachefs updates for 6.15
On disk format is now soft frozen: no more required/automatic are anticipated before taking off the experimental label. Major changes/features since 6.14: - Scrub - Blocksize greater than page size support - A number of "rebalance spinning and doing no work" issues have been fixed; we now check if the write allocation will succeed in bch2_data_update_init(), before kicking off the read. There's still more work to do in this area. Later we may want to add another bitset btree, like rebalance_work, to track "extents that rebalance was requested to move but couldn't", e.g. due to destination target having insufficient online devices. - We can now support scaling well into the petabyte range: latest bcachefs-tools will pick an appropriate bucket size at format time to ensure fsck can run in available memory (e.g. a server with 256GB of ram and 100PB of storage would want 16MB buckets). On disk format changes: - 1.21: cached backpointers (scalability improvement) Cached replicas now get backpointers, which means we no longer rely on incrementing bucket generation numbers to invalidate cached data: this lets us get rid of the bucket generation number garbage collection, which had to periodically rescan all extents to recompute bucket oldest_gen. Bucket generation numbers are now only used as a consistency check, but they're quite useful for that. - 1.22: stripe backpointers Stripes now have backpointers: erasure coded stripes have their own checksums, separate from the checksums for the extents they contain (and stripe checksums also cover the parity blocks). This is required for implementing scrub for stripes. - 1.23: stripe lru (scalability improvement) Persistent lru for stripes, ordered by "number of empty blocks". This is used by the stripe creation path, which depending on free space may create a new stripe out of a partially empty existing stripe instead of starting a brand new stripe. This replaces an in-memory heap, and means we no longer have to read in the stripes btree at startup. - 1.24: casefolding Case insensitive directory support, courtesy of Valve. This is an incompatible feature, to enable mount with -o version_upgrade=incompatible - 1.25: extent_flags Another incompatible feature requiring explicit opt-in to enable. This adds a flags entry to extents, and a flag bit that marks extents as poisoned. A poisoned extent is an extent that was unreadable due to checksum errors. We can't move such extents without giving them a new checksum, and we may have to move them (for e.g. copygc or device evacuate). We also don't want to delete them: in the future we'll have an API that lets userspace ignore checksum errors and attempt to deal with simple bitrot itself. Marking them as poisoned lets us continue to return the correct error to userspace on normal read calls. Other changes/features: - BCH_IOCTL_QUERY_COUNTERS: this is used by the new 'bcachefs fs top' command, which shows a live view of all internal filesystem counters. - Improved journal pipelining: we can now have 16 journal writes in flight concurrently, up from 4. We're logging significantly more to the journal than we used to with all the recent disk accounting changes and additions, so some users should see a performance increase on some workloads. - BCH_MEMBER_STATE_failed: previously, we would do no IO at all to devices marked as failed. Now we will attempt to read from them, but only if we have no better options. - New option, write_error_timeout: devices will be kicked out of the filesystem if all writes have been failing for x number of seconds. We now also kick devices out when notified by blk_holder_ops that they've gone offline. - Device option handling improvements: the discard option should now be working as expected (additionally, in -tools, all device options that can be set at format time can now be set at device add time, i.e. data_allowed, state). - We now try harder to read data after a checksum error: we'll do additional retries if necessary to a device after after it gave us data with a checksum error, with FUA specified so that the entire IO path is retried. - More self healing work: the full inode <-> dirent consistency checks that are currently run by fsck are now also run every time we do a lookup, meaning we'll be able to correct errors at runtime. Runtime self healing will be flipped on after the new changes have seen more testing, currently they're just checking for consistency. - KMSAN fixes: our KMSAN builds should be nearly clean now, which will put a massive dent in the syzbot dashboard. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmfgMy8ACgkQE6szbY3K bnaWCRAApupF+ZRTDXz27HI0Wl4gK5in+HKRD9uDJESeemDrFEnAuvxeNbjYX4f5 635xi2Z0u18VYOLN4sl+yKxFsnbK0guK5bB0QVHJg0Kgi9gpek14ZVV40hujX1XT wk9mC/Rj3gLMZ1yuDqaVBm9JndrY8TQnYxr+PRJytuaH32U84s/jjyyC1pDaDzXW N8UqcJQ21w6po57GIhCgde5/pHgUWF056azwL89mvjyJsKGAynu0mfzEl3oZBCwO ysSA9QgGGcsyCvPFhaLWqW7lPemWkbyYxOttWueQSfRuuz5gY4tsERyKPjA31o4O zoj54/ISStLB5PWWNaITpaSufxqd268LLuMCa4U5IooCCkWuD+i5N742OA+Av0Ii yxuyOrd6cg2/F/LXKUORdRA0niacvyTuy+J/lM/Yjg7tpgSB0bRzZbdym75Z8BhT BRv3kSK9WBrRpKPeuTP6FXRP2MuW2uelkQl+cO+wpT2t6VBMZmo6qZN34jJ/dHFI TUxjVHYYPdqohu5FE30g6LBJPqojkgENxRqBqZ0VZxhYi6nrG1WjwlMDiIEHd3lZ L8K13LfKTwWpRXTKsmKM7zPSM7mpbT4pJGZCAApo6T4P/RxiF0V2Hud93lLYNd9v RjKMhzuDPCF+uRRwAhnR3qvoOgM2PGc4fmg1V7BLWDRfhOSCXcs= =C7p4 -----END PGP SIGNATURE-----