summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-07-05ovl: support layers on case-folding capable filesystemsAmir Goldstein
Case folding is often applied to subtrees and not on an entire filesystem. Disallowing layers from filesystems that support case folding is over limiting. Replace the rule that case-folding capable are not allowed as layers with a rule that case folded directories are not allowed in a merged directory stack. Should case folding be enabled on an underlying directory while overlayfs is mounted the outcome is generally undefined. Specifically in ovl_lookup(), we check the base underlying directory and fail with -ESTALE and write a warning to kmsg if an underlying directory case folding is enabled. Suggested-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/linux-fsdevel/20250520051600.1903319-1-kent.overstreet@linux.dev/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Read retries are after checksum errors now REQ_FUAKent Overstreet
REQ_FUA means "skip the drive cache", and it can be used with reads to. If there was a checksum error, we want to retry the whole read path, not read it from cache again. Suggested-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: read_fua_testKent Overstreet
Add a sysfs attribute for checking whether read fua appears to behave properly on a device. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05block: Allow REQ_FUA|REQ_READKent Overstreet
FUA is also allowed with reads, not just writes. The specified behaviour is: - If the location being read from in the drive cache is dirty, it's flushed - Read is serviced from media, not cache It's documented in the NVME specification, and the nvme driver already passes through REQ_FUA for reads, not just writes, so there's no reason for the block layer to be disallowing it. To validate behaviour, a simple test was run on a variety of hardware that checks latency of repeated reads to the same location (cached reads), random reads (uncached), and FUA reads to the same location. Data: - Samsung consumer SSDs Reads appear to not be cached - Seagate SCSI hard drives (ST20000NM002D) Reads are cached, and FUA reads appear to work correctly Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/ Link: https://lore.kernel.org/linux-bcachefs/26585.34711.506258.318405@quad.stoffel.home/T/#m5fffbc0e1c68cf0479c94b9f4ac1bef297333782 Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: shrinker.to_text() methodsKent Overstreet
This adds shrinker.to_text() methods for our shrinkers and hooks them up to our existing to_text() functions. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05mm: shrinker: Add shrinker_to_text() to debugfs interfaceKent Overstreet
Previously, we added shrinker_to_text() and hooked it up to the OOM report - now, the same report is available via debugfs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05mm: Centralize & improve oom reporting in show_mem.cKent Overstreet
This patch: - Changes show_mem() to always report on slab usage - Instead of reporting on all slabs, we only report on top 10 slabs, and in sorted order - Also reports on shrinkers, with the new shrinkers_to_text(). Shrinkers need to be included in OOM/allocation failure reporting because they're responsible for memory reclaim - if a shrinker isn't giving up its memory, we need to know which one and why. More OOM reporting can be moved to show_mem.c and improved, this patch is only a start. New example output on OOM/memory allocation failure: 00177 Mem-Info: 00177 active_anon:13706 inactive_anon:32266 isolated_anon:16 00177 active_file:1653 inactive_file:1822 isolated_file:0 00177 unevictable:0 dirty:0 writeback:0 00177 slab_reclaimable:6242 slab_unreclaimable:11168 00177 mapped:3824 shmem:3 pagetables:1266 bounce:0 00177 kernel_misc_reclaimable:0 00177 free:4362 free_pcp:35 free_cma:0 00177 Node 0 active_anon:54824kB inactive_anon:129064kB active_file:6612kB inactive_file:7288kB unevictable:0kB isolated(anon):64kB isolated(file):0kB mapped:15296kB dirty:0kB writeback:0kB shmem:12kB writeback_tmp:0kB kernel_stack:3392kB pagetables:5064kB all_unreclaimable? no 00177 DMA free:2232kB boost:0kB min:88kB low:108kB high:128kB reserved_highatomic:0KB active_anon:2924kB inactive_anon:6596kB active_file:428kB inactive_file:384kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB 00177 lowmem_reserve[]: 0 426 426 426 00177 DMA32 free:15092kB boost:5836kB min:8432kB low:9080kB high:9728kB reserved_highatomic:0KB active_anon:52196kB inactive_anon:122392kB active_file:6176kB inactive_file:7068kB unevictable:0kB writepending:0kB present:507760kB managed:441816kB mlocked:0kB bounce:0kB free_pcp:72kB local_pcp:0kB free_cma:0kB 00177 lowmem_reserve[]: 0 0 0 0 00177 DMA: 284*4kB (UM) 53*8kB (UM) 21*16kB (U) 11*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2248kB 00177 DMA32: 2765*4kB (UME) 375*8kB (UME) 57*16kB (UM) 5*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15132kB 00177 4656 total pagecache pages 00177 1031 pages in swap cache 00177 Swap cache stats: add 6572399, delete 6572173, find 488603/3286476 00177 Free swap = 509112kB 00177 Total swap = 2097148kB 00177 130938 pages RAM 00177 0 pages HighMem/MovableOnly 00177 16644 pages reserved 00177 Unreclaimable slab info: 00177 9p-fcall-cache total: 8.25 MiB active: 8.25 MiB 00177 kernfs_node_cache total: 2.15 MiB active: 2.15 MiB 00177 kmalloc-64 total: 2.08 MiB active: 2.07 MiB 00177 task_struct total: 1.95 MiB active: 1.95 MiB 00177 kmalloc-4k total: 1.50 MiB active: 1.50 MiB 00177 signal_cache total: 1.34 MiB active: 1.34 MiB 00177 kmalloc-2k total: 1.16 MiB active: 1.16 MiB 00177 bch_inode_info total: 1.02 MiB active: 922 KiB 00177 perf_event total: 1.02 MiB active: 1.02 MiB 00177 biovec-max total: 992 KiB active: 960 KiB 00177 Shrinkers: 00177 super_cache_scan: objects: 127 00177 super_cache_scan: objects: 106 00177 jbd2_journal_shrink_scan: objects: 32 00177 ext4_es_scan: objects: 32 00177 bch2_btree_cache_scan: objects: 8 00177 nr nodes: 24 00177 nr dirty: 0 00177 cannibalize lock: 0000000000000000 00177 00177 super_cache_scan: objects: 8 00177 super_cache_scan: objects: 1 Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: linux-mm@kvack.org Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05mm: shrinker: Add new stats for .to_text()Kent Overstreet
Add a few new shrinker stats. number of objects requested to free, number of objects freed: Shrinkers won't necessarily free all objects requested for a variety of reasons, but if the two counts are wildly different something is likely amiss. .scan_objects runtime: If one shrinker is taking an excessive amount of time to free objects that will block kswapd from running other shrinkers. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: linux-mm@kvack.org Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05mm: shrinker: Add a .to_text() method for shrinkersKent Overstreet
This adds a new callback method to shrinkers which they can use to describe anything relevant to memory reclaim about their internal state, for example object dirtyness. This patch also adds shrinkers_to_text(), which reports on the top 10 shrinkers - by object count - in sorted order, to be used in OOM reporting. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: linux-mm@kvack.org Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> From david@fromorbit.com Tue Aug 27 23:32:26 2024 > > + if (!mutex_trylock(&shrinker_mutex)) { > > + seq_buf_puts(out, "(couldn't take shrinker lock)"); > > + return; > > + } > > Please don't use the shrinker_mutex like this. There can be tens of > thousands of entries in the shrinker list (because memcgs) and > holding the shrinker_mutex for long running traversals like this is > known to cause latency problems for memcg reaping. If we are at > ENOMEM, the last thing we want to be doing is preventing memcgs from > being reaped. > > > + list_for_each_entry(shrinker, &shrinker_list, list) { > > + struct shrink_control sc = { .gfp_mask = GFP_KERNEL, }; > > This iteration and counting setup is neither node or memcg aware. > For node aware shrinkers, this will only count the items freeable > on node 0, and ignore all the other memory in the system. For memcg > systems, it will also only scan the root memcg and so miss counting > any memory in memcg owned caches. > > IOWs, the shrinker iteration mechanism needs to iterate both by NUMA > node and by memcg. On large machines with multiple nodes and hosting > thousands of memcgs, a total shrinker state iteration is has to walk > a -lot- of structures. > > And example of this is drop_slab() - called from > /proc/sys/vm/drop_caches(). It does this to iterate all the > shrinkers for all the nodes and memcgs in the system: > > static unsigned long drop_slab_node(int nid) > { > unsigned long freed = 0; > struct mem_cgroup *memcg = NULL; > > memcg = mem_cgroup_iter(NULL, NULL, NULL); > do { > freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); > } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); > > return freed; > } > > void drop_slab(void) > { > int nid; > int shift = 0; > unsigned long freed; > > do { > freed = 0; > for_each_online_node(nid) { > if (fatal_signal_pending(current)) > return; > > freed += drop_slab_node(nid); > } > } while ((freed >> shift++) > 1); > } > > Hence any iteration for finding the 10 largest shrinkable caches in > the system needs to do something similar. Only, it needs to iterate > memcgs first and then aggregate object counts across all nodes for > shrinkers that are NUMA aware. > > Because it needs direct access to the shrinkers, it will need to use > the RCU lock + refcount method of traversal because that's the only > safe way to go from memcg to shrinker instance. IOWs, it > needs to mirror the code in shrink_slab/shrink_slab_memcg to obtain > a safe reference to the relevant shrinker so it can call > ->count_objects() and store a refcounted pointer to the shrinker(s) > that will get printed out after the scan is done.... > > Once the shrinker iteration is sorted out, I'll look further at the > rest of the code in this patch... > > -Dave.
2025-07-05seq_buf: seq_buf_human_readable_u64()Kent Overstreet
This adds a seq_buf wrapper for string_get_size(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05lib/crypto/poly1305: Fix arm64's poly1305_blocks_arch()Eric Biggers
For some reason arm64's Poly1305 code got changed to ignore the padbit argument. As a result, the output is incorrect when the message length is not a multiple of 16 (which is not reached with the standard ChaCha20Poly1305, but bcachefs could reach this). Fix this. Fixes: a59e5468a921 ("crypto: arm64/poly1305 - Add block-only interface") Reported-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05dm-flakey: Fix corrupt_bio_byte setup checksKent Overstreet
Fix the error_reads mode - it's incompatible with corrupt_bio_byte, but that's only enabled if corrupt_bio_byte is nonzero. Cc: Benjamin Marzinski <bmarzins@redhat.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: dm-devel@lists.linux.dev Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Debug param for injecting btree node corruption on readKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Fix error message in buffered read pathKent Overstreet
We should print error codes as strings, not numbers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Print errcode when bch2_read_extent() sees errorKent Overstreet
This is not supposed to happen, so we WARN() - make the warning more useful. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Suppress unnecessary inode_i_sectors_wrong fsck errorNikita Ofitserov
It is possible that fsck first miscounted the expected sector count (due to applying other fixes at the same time, for example) and then corrected itself using extents. No need to log an fsck error and write the inode in this case. Signed-off-by: Nikita Ofitserov <himikof@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Add missing bch2_log_msg_start()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Faster checking for missing journal entriesKent Overstreet
Previously, we would do a linear search over journal entry gaps, checking for entries that aren't actually missing beacuse the sequence number was blacklisted. But multi device filesystems can have massive gaps in journal entry sequence numbers, and the linear search then becomes quite painful. Fix this with two new helpers for iterating over the blacklist table: bch2_journal_seq_next_blacklisted() bch2_journal_seq_next_nonblacklisted() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: bch2_journal_entry_missing_range()Kent Overstreet
Factor out a small common helper for userspace (bcachefs list_journal) to use as well. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Don't lock exec_update_lockAlan Huang
exec_update_lock is used to check permissions, no need here. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: ptr_to_removed_deviceKent Overstreet
Differentiate between pointers to invalid devices and pointers to removed devices in log messages and superblock error counters. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: bch_fs.devs_removedKent Overstreet
Add a bitmask of device slots that have been marked as removed: this will be used in the next patch for differentiating, in error messages and counters, between references to invalid devices and references to removed devices. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: check_key_has_inode() reconstructs more aggressivelyKent Overstreet
For regular files: reconstruct if more than three extents are found found For directories: reconstruct if a single dirent is found. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Before removing dangling dirents, check for contentsKent Overstreet
If we find a dirent pointing to a missing inode, check for dirents/extents assocatiated with that inode number: if present, reconstruct the inode insead of deleting the dirent. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Improve inode_create behaviour on old filesystemsKent Overstreet
On filesystems that predate persistent cursors, inode_generation keys will be present (no longer needed because the cursor tracks the current generation). But the new inode allocation code skipped allocating slots with inode_generation keys, which led to a lot of unnecessary scanning, which this patch fixes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Use bio_add_folio_nofail() for unfailable operationsYouling Tang
Use bio_add_folio_nofail() to replace the unfailable bio_add_folio() operation. Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Simplify bch2_bio_map()Youling Tang
For the part of directly mapping the kernel virtual address, there is no need to increase to bio page-by-page. It can be directly replaced by bio_add_virt_nofail(). For the address part of the vmalloc region, its physical address is discontinuous and needs to be increased page-by-page to bio. The helper function bio_add_vmalloc() can be used to simplify the implementation of bch2_bio_map(). Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Finish error_throw tracepointsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Improved btree node tracepointsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: -o fix_errors may now be used without -o fsckKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Shut up clang warningAlan Huang
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506261026.ZxtJ7yeV-lkp@intel.com/ Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Refactor trans->mem allocationAlan Huang
Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Evict/bypass key cache when deletingKent Overstreet
Fix a performance bug when doing many unlinks. The btree has optimizations to ensure we don't have too many whiteouts to scan in peek() before we find a real key to return, but unflushed key cache deletions break this. To fix this, tweak the existing code for redirecting updates that create a key to the underlying btree so that we can use it for deletions as well. Reported-by: John Schoenick <johns@valvesoftware.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Don't peek key cache unless we have a real keyKent Overstreet
We require that if a key exists in the key cache it also be present in the underlying btree, for cache coherency reasons. So checking the key cache on whiteout is unnecessary. This is part of fixing a major performance bug when doing many unlinks all in a row - we end up scanning through a ton of key cache whiteouts before peek() can return a real key. Reported-by: John Schoenick <johns@valvesoftware.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Improve inode deletionKent Overstreet
Don't delete dirents or extents if it's the wrong type of inode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: bch2_trans_has_updates()Kent Overstreet
new helper Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Don't memcpy more than neededAlan Huang
buf->u64s is all we used. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Don't log error twice in allocator async repairKent Overstreet
Add a new fsck flag, FSCK_ERR_SILENT, to suppress logging the error in dmesg. Use this for allocator async repair. Also, make sure that we _do_ still log silent error correction in the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Plumb trans_kmalloc ip to trans_log_msgKent Overstreet
the 'ip' parameter to bch2_trans_kmalloc() is used for bump allocator tracing: when we exceed the bump allocator limit, it dumps a list of allocations and what function did them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: add an unlikely() to trans_begin()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: More errcode conversionsKent Overstreet
Convert standard errcodes to private error codes, and return them with bch_err_throw(), for better debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: DEFINE_CLASS()es for dev refcountsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: use scoped_guard() in fast_list.cKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Allow CONFIG_UNICODE=mKent Overstreet
Fix the CONFIG_UNICODE checks - IS_ENABLED() is required when unicode is bulit as a module. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Reduce __bch2_btree_node_alloc() stack usageKent Overstreet
Kill some temporaries on the stack that were completely unnecessary. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: kill darray_u32_has()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: fsck: dir_loop, subvol_loop now autofixKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: Fix bch2_btree_transactions_read() synchronizationKent Overstreet
Since we're accessing btree_trans objects owned by another thread, we need to guard against using pointers to freed key cache entries: we need our own srcu read lock, and we should skip a btree_trans if it didn't hold the srcu lock (and thus it might have pointers to freed key cache entries). 00693 Mem abort info: 00693 ESR = 0x0000000096000005 00693 EC = 0x25: DABT (current EL), IL = 32 bits 00693 SET = 0, FnV = 0 00693 EA = 0, S1PTW = 0 00693 FSC = 0x05: level 1 translation fault 00693 Data abort info: 00693 ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 00693 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 00693 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 00693 user pgtable: 4k pages, 39-bit VAs, pgdp=000000012e650000 00693 [000000008fb96218] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 00693 Internal error: Oops: 0000000096000005 [#1] SMP 00693 Modules linked in: 00693 CPU: 0 UID: 0 PID: 4307 Comm: cat Not tainted 6.16.0-rc2-ktest-g9e15af94fd86 #27578 NONE 00693 Hardware name: linux,dummy-virt (DT) 00693 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00693 pc : six_lock_counts+0x20/0xe8 00693 lr : bch2_btree_bkey_cached_common_to_text+0x38/0x130 00693 sp : ffffff80ca98bb60 00693 x29: ffffff80ca98bb60 x28: 000000008fb96200 x27: 0000000000000007 00693 x26: ffffff80eafd06b8 x25: 0000000000000000 x24: ffffffc080d75a60 00693 x23: ffffff80eafd0000 x22: ffffffc080bdfcc0 x21: ffffff80eafd0210 00693 x20: ffffff80c192ff08 x19: 000000008fb96200 x18: 00000000ffffffff 00693 x17: 0000000000000000 x16: 0000000000000000 x15: 00000000ffffffff 00693 x14: 0000000000000000 x13: ffffff80ceb5a29a x12: 20796220646c6568 00693 x11: 72205d3e303c5b20 x10: 0000000000000020 x9 : ffffffc0805fb6b0 00693 x8 : 0000000000000020 x7 : 0000000000000000 x6 : 0000000000000020 00693 x5 : ffffff80ceb5a29c x4 : 0000000000000001 x3 : 000000000000029c 00693 x2 : 0000000000000000 x1 : ffffff80ef66c000 x0 : 000000008fb96200 00693 Call trace: 00693 six_lock_counts+0x20/0xe8 (P) 00693 bch2_btree_bkey_cached_common_to_text+0x38/0x130 00693 bch2_btree_trans_to_text+0x260/0x2a8 00693 bch2_btree_transactions_read+0xac/0x1e8 00693 full_proxy_read+0x74/0xd8 00693 vfs_read+0x90/0x300 00693 ksys_read+0x6c/0x108 00693 __arm64_sys_read+0x20/0x30 00693 invoke_syscall.constprop.0+0x54/0xe8 00693 do_el0_svc+0x44/0xc8 00693 el0_svc+0x18/0x58 00693 el0t_64_sync_handler+0x104/0x130 00693 el0t_64_sync+0x154/0x158 00693 Code: 910003fd f9423c22 f90017e2 d2800002 (f9400c01) 00693 ---[ end trace 0000000000000000 ]--- Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: btree read retry fixesKent Overstreet
Fix btree node read retries after validate errors: __btree_err() is the wrong place to flag a topology error: that is done by btree_lost_data(). Additionally, some calls to bch2_bkey_pick_read_device() were not updated in the 6.16 rework for improved log messages; we were failing to signal that we still had a retry. Cc: Nikita Ofitserov <himikof@gmail.com> Cc: Alan Huang <mmpgouride@gmail.com> Reported-and-tested-by: Edoardo Codeglia <bcachefs@404.blue> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: btree node scan no longer uses btree cacheKent Overstreet
Previously, btree node scan used the btree node cache to check if btree nodes were readable, but this is subject to interference from threads scanning different devices trying to read the same node - and more critically, nodes that we already attempted and failed to read before kicking off scan. Instead, we now allocate a 'struct btree' that does not live in the btree node cache, and call bch2_btree_node_read_done() directly. Cc: Nikita Ofitserov <himikof@gmail.com> Reviewed-by: Nikita Ofitserov <himikof@gmail.com> Reported-and-tested-by: Edoardo Codeglia <bcachefs@404.blue> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>