summaryrefslogtreecommitdiff
path: root/fs/bcachefs/io_read.c
AgeCommit message (Collapse)Author
2025-04-06bcachefs: Read retries are after checksum errors now REQ_FUAKent Overstreet
REQ_FUA means "skip the drive cache", and it can be used with reads to. If there was a checksum error, we want to retry the whole read path, not read it from cache again. Suggested-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06bcachefs: RO mounts now use less memoryKent Overstreet
Defer memory allocations only needed in RW mode until we actually go RW. This is part of improved support for RO images. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06bcachefs: add missing includeKent Overstreet
Hygeine, and fix build in userspace. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06bcachefs: Data move can read from poisoned extentsKent Overstreet
Now, if an extent is poisoned we can move it even if there was a checksum error. We'll have to give it a new checksum, but the poison bit means that userspace will still see the appropriate error when they try to read it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06bcachefs: Poison extents that can't be read due to checksum errorsKent Overstreet
Copygc needs to be able to move extents that have bitrotted. We don't want to delete them - in the future we'll have an API for "read me the data even if there's checksum errors", and in general we don't want to delete anything unless the user asks us to. That will require writing it with a new checksum, which means we can't forget that there was a checksum error so we return the correct error to userspace. Rebalance also wants to skip bad extents; we can now use the poison flag for that. This is currently disabled by default, as we want read fua support so that we can distinguish between transient and permanent errors from the device. It may be enabled with the module parameter: poison_extents_on_checksum_error Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-06bcachefs: Be precise about bch_io_failuresKent Overstreet
If the extent we're reading from changes, due to be being overwritten or moved (possibly partially) - we need to reset bch_io_failures so that we don't accidentally mark a new extent as poisoned prematurely. This means we have to separately track (in the retry path) the extent we previously read from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: Kill btree_iter.transKent Overstreet
This was planned to be done ages ago, now finally completed; there are places where we have quite a few btree_trans objects on the stack, so this reduces stack usage somewhat. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: Split up bch_dev.io_refKent Overstreet
We now have separate per device io_refs for read and write access. This fixes a device removal bug where the discard workers were still running while we're removing alloc info for that device. It's also a bit of hardening; we no longer allow writes to devices that are read-only. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-30bcachefs: Log original key being moved in data updatesKent Overstreet
There's something going on with the data move path; log the original key being moved for debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-25bcachefs: Fix silent short reads in data read retry pathKent Overstreet
__bch2_read, before calling __bch2_read_extent(), sets bvec_iter.bi_size to "the size we can read from the current extent" with a swap, and restores it to "the size for the total read" after the read_extent call with another swap. But we neglected to do the restore before the "if (ret) goto err;" - which is a problem if we're retrying those errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: Add missing random.h includesKent Overstreet
Fix build in userspace, and good hygeine. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: __bch2_read() now takes a btree_transKent Overstreet
Next patch will be checking if the extent we're reading from matches the IO failure we saw before marking the failure. For this to work, __bch2_read() needs to take the same transaction context that bch2_rbio_retry() uses to do that check. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-24bcachefs: BCH_READ_data_update -> bch_read_bio.data_updateKent Overstreet
Read flags are codepath dependent and change as they're passed around, while the fields in rbio._state are mostly fixed properties of that particular object. Losing track of BCH_READ_data_update would be bad, and previously it was not obvious if it was always correctly set in the rbio, so this is a safety cleanup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16bcachefs: Checksum errors get additional retriesKent Overstreet
It's possible for checksum errors to be transient - e.g. flakey controller or cable, thus we need additional retries (besides retrying from different replicas) before we can definitely return an error. This is particularly important for the next patch, which will allow the data move path to move extents with checksum errors - we don't want to accidentally introduce bitrot due to a transient error! - bch2_bkey_pick_read_device() is substantially reworked, and bch2_dev_io_failures is expanded to record more information about the type of failure (i.e. number of checksum errors). It now returns an error code that describes more precisely the reason for the failure - checksum error, io error, or offline device, instead of the previous generic "insufficient devices". This is important for the next patches that add poisoning, as we only want to poison extents when we've got real checksum errors (or perhaps IO errors?) - not because a device was offline. - Add a new option and superblock field for the number of checksum retries. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16bcachefs: Print message on successful read retryKent Overstreet
Users have been asking for this, and now that errors are returned to the top level read retry path - we can. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16bcachefs: Return errors to top level bch2_rbio_retry()Kent Overstreet
Next patch will be adding an additional retry loop for checksum errors, so that we can rule out transient errors before marking an extent as poisoned. Prerequisite to this is returning errors to bch2_rbio_retry(); this will also let us add a "successful retry" message. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16bcachefs: BCH_ERR_data_read_buffer_too_smallKent Overstreet
Now that the read path uses proper error codes, we can get rid of the weird rbio->hole signalling to the move path that the read didn't happen. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16bcachefs: Read error message now indicates if it was for an internal moveKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16bcachefs: Fix BCH_ERR_data_read_csum_err_maybe_userspace in retry pathKent Overstreet
When we do a read to a buffer that's mapped into userspace, it's possible to get a spurious checksum error if userspace was modified the buffer at the same time. When we retry those, they have to be bounced before we know definitively whether we're reading corrupt data. But the retry path propagates read flags differently, so needs special handling. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16bcachefs: Convert read path to standard error codesKent Overstreet
Kill the READ_ERR/READ_RETRY/READ_RETRY_AVOID enums, and add standard error codes that describe precisely which error occured. This is going to be used for the data move path, to move but poison extents with checksum errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-16bcachefs: Debug params for data corruption injectionKent Overstreet
dm-flakey is busted, and this is simpler anyways - this lets us test the checksum error retry ptahs Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: bch2_account_io_completion()Kent Overstreet
We need to start accounting successes for every IO, not just failures, so introduce a unified hook for io completion accounting and convert io_read.c. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Fix read path io_ref handlingKent Overstreet
We were using our device pointer after we'd released our ref to it. Unlikely to be a race that's practical to hit, since actually removing a member device is a whole process besides just taking it offline, but - needs to be fixed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: bch2_inum_offset_err_msg_trans() no longer handles transaction ↵Kent Overstreet
restarts we're starting to use error messages with paths in fsck_errors(), where we do not want nested transaction restart handling, so let's prepare for that. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Read/move path counter workKent Overstreet
Reorganize counters a bit, grouping related counters together. New counters: - io_read_inline - io_read_hole Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: ScrubKent Overstreet
Add a new data op to walk all data and metadata in a filesystem, checking if it can be read successfully, and on error repairing from another copy if possible. - New helper: bch2_dev_idx_is_online(), so that we can bail out and report to userspace when we're unable to scrub because the device is offline - data_update_opts, which controls the data move path, now understands scrub: data is only read, not written. The read path is responsible for rewriting on read error, as with other reads. - scrub_pred skips data extents that don't have checksums - bch_ioctl_data has a new scrub member, which has a data_types field for data types to check - i.e. all data types, or only metadata. - Add new entries to bch_move_stats so that we can report numbers for corrected and uncorrected errors - Add a new enum to bch_ioctl_data_event for explicitly reporting completion and return code (i.e. device offline) Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: bch2_bkey_pick_read_device() can now specify a deviceKent Overstreet
To be used for scrub, where we want the read to come from a specific device. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Internal reads can now correct errorsKent Overstreet
Rework the read path so that BCH_READ_NODECODE reads now also self-heal after a read error and a successful retry - prerequisite for scrub. - __bch2_read_endio() now handles a read that's both BCH_READ_NODECODE and a bounce. Normally, we don't want a BCH_READ_NODECODE read to ever allocate a split bch_read_bio: we want to maintain the relationship between the bch_read_bio and the data_update it's embedded in. But correcting read errors requires allocating a split/bounce rbio that's embedded in a promote_op. We do still have a 1-1 relationship, i.e. we only allocate a single split/bounce if it's a BCH_READ_NODECODE, so things hopefully don't get too crazy. - __bch2_read_extent() now is allowed to allocate the promote_op for rewriting after a failed read, even if it's BCH_READ_NODECODE. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Don't self-heal if a data update is already rewritingKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Don't start promotes from bch2_rbio_free()Kent Overstreet
we don't want to block completion of the read - starting a promote calls into the write path, which will block. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Self healing writes are BCH_WRITE_alloc_nowaitKent Overstreet
If a drive is failing and we're moving data off of it, we can't necessairly depend on capacity/disk reservation calculations to avoid deadlocking/blocking on the allocator. And, we don't want to queue up infinite self healing moves anyways. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Promotes should use BCH_WRITE_only_specified_devsKent Overstreet
Promotes, like most other internal moves, should only go to the specified target and not fall back to allocating from the full filesystem. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Be stricter in bch2_read_retry_nodecode()Kent Overstreet
Now that data_update embeds bch_read_bio, BCH_READ_NODECODE means that the read is embedded in a a data_update - and we can check in the retry path if the extent has changed and bail out. This likely fixes some subtle bugs with read errors and data moves. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: cleanup redundant code around data_update_op initializationKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: promote_op uses embedded bch_read_bioKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: rbio_init() cleanupKent Overstreet
Move more initialization to rbio_init(), to assist in further cleanups. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: rbio_init_fragment()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Rename BCH_WRITE flags fer consistency with other x-macros enumsKent Overstreet
The uppercase/lowercase style is nice for making the namespace explicit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: x-macroize BCH_READ flagsKent Overstreet
Will be adding a bch2_read_bio_to_text(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: kill bch_read_bio.devs_haveKent Overstreet
Dead code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Don't inc io_(read|write) counters for movesKent Overstreet
This makes 'bcachefs fs top' more useful; we can now see at a glance whether the IO to the device is being done for user reads/writes, or copygc/rebalance. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-13bcachefs: target_congested -> get_random_u32_below()Kent Overstreet
get_random_u32_below() has a better algorithm than bch2_rand_range(), it just didn't exist at the time. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-11bcachefs: Make sure trans is unlocked when submitting read IOKent Overstreet
We were still using the trans after the unlock, leading to this bug in the retry path: 00255 ------------[ cut here ]------------ 00255 kernel BUG at fs/bcachefs/btree_iter.c:3348! 00255 Internal error: Oops - BUG: 00000000f2000800 [#1] SMP 00255 bcachefs (0ca38fe8-0a26-41f9-9b5d-6a27796c7803): /fiotest offset 86048768: no device to read from: 00255 u64s 8 type extent 4098:168192:U32_MAX len 128 ver 0: durability: 0 crc: c_size 128 size 128 offset 0 nonce 0 csum crc32c 0:8040a368 compress none ec: idx 83 block 1 ptr: 0:302:128 gen 0 00255 bcachefs (0ca38fe8-0a26-41f9-9b5d-6a27796c7803): /fiotest offset 85983232: no device to read from: 00255 u64s 8 type extent 4098:168064:U32_MAX len 128 ver 0: durability: 0 crc: c_size 128 size 128 offset 0 nonce 0 csum crc32c 0:43311336 compress none ec: idx 83 block 1 ptr: 0:302:0 gen 0 00255 Modules linked in: 00255 CPU: 5 UID: 0 PID: 304 Comm: kworker/u70:2 Not tainted 6.14.0-rc6-ktest-g526aae23d67d #16040 00255 Hardware name: linux,dummy-virt (DT) 00255 Workqueue: events_unbound bch2_rbio_retry 00255 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00255 pc : __bch2_trans_get+0x100/0x378 00255 lr : __bch2_trans_get+0xa0/0x378 00255 sp : ffffff80c865b760 00255 x29: ffffff80c865b760 x28: 0000000000000000 x27: ffffff80d76ed880 00255 x26: 0000000000000018 x25: 0000000000000000 x24: ffffff80f4ec3760 00255 x23: ffffff80f4010140 x22: 0000000000000056 x21: ffffff80f4ec0000 00255 x20: ffffff80f4ec3788 x19: ffffff80d75f8000 x18: 00000000ffffffff 00255 x17: 2065707974203820 x16: 7334367520200a3a x15: 0000000000000008 00255 x14: 0000000000000001 x13: 0000000000000100 x12: 0000000000000006 00255 x11: ffffffc080b47a40 x10: 0000000000000000 x9 : ffffffc08038dea8 00255 x8 : ffffff80d75fc018 x7 : 0000000000000000 x6 : 0000000000003788 00255 x5 : 0000000000003760 x4 : ffffff80c922de80 x3 : ffffff80f18f0000 00255 x2 : ffffff80c922de80 x1 : 0000000000000130 x0 : 0000000000000006 00255 Call trace: 00255 __bch2_trans_get+0x100/0x378 (P) 00255 bch2_read_io_err+0x98/0x260 00255 bch2_read_endio+0xb8/0x2d0 00255 __bch2_read_extent+0xce8/0xfe0 00255 __bch2_read+0x2a8/0x978 00255 bch2_rbio_retry+0x188/0x318 00255 process_one_work+0x154/0x390 00255 worker_thread+0x20c/0x3b8 00255 kthread+0xf0/0x1b0 00255 ret_from_fork+0x10/0x20 00255 Code: 6b01001f 54ffff01 79408460 3617fec0 (d4210000) 00255 ---[ end trace 0000000000000000 ]--- 00255 Kernel panic - not syncing: Oops - BUG: Fatal exception 00255 SMP: stopping secondary CPUs 00255 Kernel Offset: disabled 00255 CPU features: 0x000,00000070,00000010,8240500b 00255 Memory Limit: none 00255 ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]--- Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-14bcachefs: Fix self healing on read errorKent Overstreet
We were incorrectly checking if there'd been an io error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: bch2_inum_to_path()Kent Overstreet
Add a function for walking backpointers to find a path from a given inode number, and convert various error messages to use it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: Don't try to en/decrypt when encryption not availableKent Overstreet
If a btree node says it's encrypted, but the superblock never had an encryptino key - whoops, that needs to be handled. Reported-by: syzbot+026f1857b12f5eb3f9e9@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: Don't delete reflink pointers to missing indirect extentsKent Overstreet
To avoid tragic loss in the event of transient errors (i.e., a btree node topology error that was later corrected by btree node scan), we can't delete reflink pointers to correct errors. This adds a new error bit to bch_reflink_p, indicating that it is known to point to a missing indirect extent, and the error has already been reported. Indirect extent lookups now use bch2_lookup_indirect_extent(), which on error reports it as a fsck_err() and sets the error bit, and clears it if necessary on succesful lookup. This also gets rid of the bch2_inconsistent_error() call in __bch2_read_indirect_extent, and in the reflink_p trigger: part of the online self healing project. An on disk format change isn't necessary here: setting the error bit will be interpreted by older versions as pointing to a different index, which will also be missing - which is fine. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: Reserve 8 bits in bch_reflink_pKent Overstreet
Better repair for reflink pointers, as well as propagating new inode options to indirect extents, are going to require a few extra bits bch_reflink_p: so claim a few from the high end of the destination index. Also add some missing bounds checking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-12-21bcachefs: small cleanup for extent ptr bitmasksKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07bcachefs: Fix UAF in __promote_alloc() error pathbcachefs-2024-11-07Kent Overstreet
If we error in data_update_init() after adding to the rhashtable of outstanding promotes, kfree_rcu() is required. Reported-by: Reed Riley <reed@riley.engineer> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>