summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-04-30pop assert when freeing page after bcachefs shutdownbcachefs-put-folio-assertKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-30bcachefs: Avoid -Wflex-array-member-not-at-end warningsGustavo A. R. Silva
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. Refactor a couple of structs that contain flexible arrays in the middle by replacing them with unions. So, with these changes, fix the following warnings: fs/bcachefs/disk_accounting.c:429:51: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] fs/bcachefs/ec_types.h:8:41: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-30bcachefs: lock_graph refactoringKent Overstreet
Prep work for the next patch, moving the lock_graph to btree_trans, and off the stack. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-29fixup! bcachefs: trace bch2_trans_kmalloc()Kent Overstreet
2025-04-29fixup! bcachefs: Make various async objs visible in debugfsKent Overstreet
2025-04-29bcachefs: sysfs trigger_recalc_capacityKent Overstreet
For bug diagnosis Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-29bcachefs: readdir fixesKent Overstreet
- Don't call bch2_trans_relock() after dir_emit(); taking a transaction restart here will cause us to emit the same dirent to userspace twice - Fix incorrect checking of the return value on dir_emit(): "true" means success, keep going, but bch2_dir_emit() needs to return true when we're finished iterating. https://github.com/koverstreet/bcachefs/issues/867 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-29bcachefs: Don't emit bch_sb_field_members_v1 if not requiredKent Overstreet
In 'bcachefs_metadata_extent_flags', we stopped requireding members_v1 to be present - only that either v1 or v2 is present. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: improve missing journal write device error messageKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: Read retries are after checksum errors now REQ_FUAKent Overstreet
REQ_FUA means "skip the drive cache", and it can be used with reads to. If there was a checksum error, we want to retry the whole read path, not read it from cache again. Suggested-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: read_fua_testKent Overstreet
Add a sysfs attribute for checking whether read fua appears to behave properly on a device. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28block: Allow REQ_FUA|REQ_READKent Overstreet
FUA is also allowed with reads, not just writes. The specified behaviour is: - If the location being read from in the drive cache is dirty, it's flushed - Read is serviced from media, not cache It's documented in the NVME specification, and the nvme driver already passes through REQ_FUA for reads, not just writes, so there's no reason for the block layer to be disallowing it. To validate behaviour, a simple test was run on a variety of hardware that checks latency of repeated reads to the same location (cached reads), random reads (uncached), and FUA reads to the same location. Data: - Samsung consumer SSDs Reads appear to not be cached - Seagate SCSI hard drives (ST20000NM002D) Reads are cached, and FUA reads appear to work correctly Link: https://lore.kernel.org/linux-block/20250311133517.3095878-1-kent.overstreet@linux.dev/ Link: https://lore.kernel.org/linux-bcachefs/26585.34711.506258.318405@quad.stoffel.home/T/#m5fffbc0e1c68cf0479c94b9f4ac1bef297333782 Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: shrinker.to_text() methodsKent Overstreet
This adds shrinker.to_text() methods for our shrinkers and hooks them up to our existing to_text() functions. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28mm: shrinker: Add shrinker_to_text() to debugfs interfaceKent Overstreet
Previously, we added shrinker_to_text() and hooked it up to the OOM report - now, the same report is available via debugfs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28mm: Centralize & improve oom reporting in show_mem.cKent Overstreet
This patch: - Changes show_mem() to always report on slab usage - Instead of reporting on all slabs, we only report on top 10 slabs, and in sorted order - Also reports on shrinkers, with the new shrinkers_to_text(). Shrinkers need to be included in OOM/allocation failure reporting because they're responsible for memory reclaim - if a shrinker isn't giving up its memory, we need to know which one and why. More OOM reporting can be moved to show_mem.c and improved, this patch is only a start. New example output on OOM/memory allocation failure: 00177 Mem-Info: 00177 active_anon:13706 inactive_anon:32266 isolated_anon:16 00177 active_file:1653 inactive_file:1822 isolated_file:0 00177 unevictable:0 dirty:0 writeback:0 00177 slab_reclaimable:6242 slab_unreclaimable:11168 00177 mapped:3824 shmem:3 pagetables:1266 bounce:0 00177 kernel_misc_reclaimable:0 00177 free:4362 free_pcp:35 free_cma:0 00177 Node 0 active_anon:54824kB inactive_anon:129064kB active_file:6612kB inactive_file:7288kB unevictable:0kB isolated(anon):64kB isolated(file):0kB mapped:15296kB dirty:0kB writeback:0kB shmem:12kB writeback_tmp:0kB kernel_stack:3392kB pagetables:5064kB all_unreclaimable? no 00177 DMA free:2232kB boost:0kB min:88kB low:108kB high:128kB reserved_highatomic:0KB active_anon:2924kB inactive_anon:6596kB active_file:428kB inactive_file:384kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB 00177 lowmem_reserve[]: 0 426 426 426 00177 DMA32 free:15092kB boost:5836kB min:8432kB low:9080kB high:9728kB reserved_highatomic:0KB active_anon:52196kB inactive_anon:122392kB active_file:6176kB inactive_file:7068kB unevictable:0kB writepending:0kB present:507760kB managed:441816kB mlocked:0kB bounce:0kB free_pcp:72kB local_pcp:0kB free_cma:0kB 00177 lowmem_reserve[]: 0 0 0 0 00177 DMA: 284*4kB (UM) 53*8kB (UM) 21*16kB (U) 11*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2248kB 00177 DMA32: 2765*4kB (UME) 375*8kB (UME) 57*16kB (UM) 5*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15132kB 00177 4656 total pagecache pages 00177 1031 pages in swap cache 00177 Swap cache stats: add 6572399, delete 6572173, find 488603/3286476 00177 Free swap = 509112kB 00177 Total swap = 2097148kB 00177 130938 pages RAM 00177 0 pages HighMem/MovableOnly 00177 16644 pages reserved 00177 Unreclaimable slab info: 00177 9p-fcall-cache total: 8.25 MiB active: 8.25 MiB 00177 kernfs_node_cache total: 2.15 MiB active: 2.15 MiB 00177 kmalloc-64 total: 2.08 MiB active: 2.07 MiB 00177 task_struct total: 1.95 MiB active: 1.95 MiB 00177 kmalloc-4k total: 1.50 MiB active: 1.50 MiB 00177 signal_cache total: 1.34 MiB active: 1.34 MiB 00177 kmalloc-2k total: 1.16 MiB active: 1.16 MiB 00177 bch_inode_info total: 1.02 MiB active: 922 KiB 00177 perf_event total: 1.02 MiB active: 1.02 MiB 00177 biovec-max total: 992 KiB active: 960 KiB 00177 Shrinkers: 00177 super_cache_scan: objects: 127 00177 super_cache_scan: objects: 106 00177 jbd2_journal_shrink_scan: objects: 32 00177 ext4_es_scan: objects: 32 00177 bch2_btree_cache_scan: objects: 8 00177 nr nodes: 24 00177 nr dirty: 0 00177 cannibalize lock: 0000000000000000 00177 00177 super_cache_scan: objects: 8 00177 super_cache_scan: objects: 1 Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: linux-mm@kvack.org Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28mm: shrinker: Add new stats for .to_text()Kent Overstreet
Add a few new shrinker stats. number of objects requested to free, number of objects freed: Shrinkers won't necessarily free all objects requested for a variety of reasons, but if the two counts are wildly different something is likely amiss. .scan_objects runtime: If one shrinker is taking an excessive amount of time to free objects that will block kswapd from running other shrinkers. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: linux-mm@kvack.org Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28mm: shrinker: Add a .to_text() method for shrinkersKent Overstreet
This adds a new callback method to shrinkers which they can use to describe anything relevant to memory reclaim about their internal state, for example object dirtyness. This patch also adds shrinkers_to_text(), which reports on the top 10 shrinkers - by object count - in sorted order, to be used in OOM reporting. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: linux-mm@kvack.org Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> From david@fromorbit.com Tue Aug 27 23:32:26 2024 > > + if (!mutex_trylock(&shrinker_mutex)) { > > + seq_buf_puts(out, "(couldn't take shrinker lock)"); > > + return; > > + } > > Please don't use the shrinker_mutex like this. There can be tens of > thousands of entries in the shrinker list (because memcgs) and > holding the shrinker_mutex for long running traversals like this is > known to cause latency problems for memcg reaping. If we are at > ENOMEM, the last thing we want to be doing is preventing memcgs from > being reaped. > > > + list_for_each_entry(shrinker, &shrinker_list, list) { > > + struct shrink_control sc = { .gfp_mask = GFP_KERNEL, }; > > This iteration and counting setup is neither node or memcg aware. > For node aware shrinkers, this will only count the items freeable > on node 0, and ignore all the other memory in the system. For memcg > systems, it will also only scan the root memcg and so miss counting > any memory in memcg owned caches. > > IOWs, the shrinker iteration mechanism needs to iterate both by NUMA > node and by memcg. On large machines with multiple nodes and hosting > thousands of memcgs, a total shrinker state iteration is has to walk > a -lot- of structures. > > And example of this is drop_slab() - called from > /proc/sys/vm/drop_caches(). It does this to iterate all the > shrinkers for all the nodes and memcgs in the system: > > static unsigned long drop_slab_node(int nid) > { > unsigned long freed = 0; > struct mem_cgroup *memcg = NULL; > > memcg = mem_cgroup_iter(NULL, NULL, NULL); > do { > freed += shrink_slab(GFP_KERNEL, nid, memcg, 0); > } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); > > return freed; > } > > void drop_slab(void) > { > int nid; > int shift = 0; > unsigned long freed; > > do { > freed = 0; > for_each_online_node(nid) { > if (fatal_signal_pending(current)) > return; > > freed += drop_slab_node(nid); > } > } while ((freed >> shift++) > 1); > } > > Hence any iteration for finding the 10 largest shrinkable caches in > the system needs to do something similar. Only, it needs to iterate > memcgs first and then aggregate object counts across all nodes for > shrinkers that are NUMA aware. > > Because it needs direct access to the shrinkers, it will need to use > the RCU lock + refcount method of traversal because that's the only > safe way to go from memcg to shrinker instance. IOWs, it > needs to mirror the code in shrink_slab/shrink_slab_memcg to obtain > a safe reference to the relevant shrinker so it can call > ->count_objects() and store a refcounted pointer to the shrinker(s) > that will get printed out after the scan is done.... > > Once the shrinker iteration is sorted out, I'll look further at the > rest of the code in this patch... > > -Dave.
2025-04-28seq_buf: seq_buf_human_readable_u64()Kent Overstreet
This adds a seq_buf wrapper for string_get_size(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: bch2_dev_add() can run on a non-started fsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: bch2_fs_open() now takes a darrayKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: bch2_trans_update_ip()Kent Overstreet
Allow btree_insert_entry.ip_allocated to be passed in, so we get better info on where alloc updates are coming from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: Run most explicit recovery passes persistentKent Overstreet
If we detect an error that requires running a recovery pass, and we're not in recovery, we won't be able to fix it until the next mount - make sure we're noting in the superblock that it needs to run. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: provide unlocked version of run_explicit_recovery_pass_persistentKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: bch2_dirent_to_text() shows casefolded direntsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: Single err message for btree node readsKent Overstreet
Like we just did with the data read path, emit a single error message per btree node reads, nicely formatted, with all the actions we took grouped together. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: bch2_mark_btree_validate_failure()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: bch2_fsck_err_opt()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: Plumb printbuf through bch2_btree_lost_data()Kent Overstreet
Part of the ongoing project to improve error messages by building them up in printbufs and emitting them all at once, so that we can easily see what events are related in the log. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: kill bch2_run_explicit_recovery_pass_persistent()Kent Overstreet
No longer has users, so we can kill it and rename bch2_run_explicit_recovery_pass_persistent_locked(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: Remove redundant calls to btree_lost_data()Kent Overstreet
The btree node read path calls this before returning the read error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: bch2_btree_lost_data() now handles snapshots treeKent Overstreet
We have a consolidated places for "this btree lost data, run this repair", so use it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: Kill redundant error message in topology repairKent Overstreet
The btree node read path already logs btree node read errors, this isn't needed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: Emit a single log message on data read errorKent Overstreet
Instead of emitting a message immediately when we get an error in the read path, and then another at the end if we successfully retry - emit one single log message before returning from bch2_rbio_retry(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: bch2_io_failures_to_text()Kent Overstreet
Pretty printer for bch_io_failures, to be used for better read error messages. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: print_string_as_lines: avoid printing empty lineKent Overstreet
If the final line in in the message to be printed is blang, don't print it. This happens with indented printbufs - after a newline we emit spaces up to the indent level. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: Make various async objs visible in debugfsKent Overstreet
Add async objs list for - promote_op - bch_read_bio - btree_read_bio - btree_write_bio This gets us introspection on in-flight async ops, and because under the hood it uses fast_lists (percpu slot buffer on top of a radix tree), it'll be fast enough to enable in production. This will be very helpful for debugging "something got stuck" issues, which have been cropping up from time to time (in the CI, especially with folio writeback). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: Async object debuggingKent Overstreet
Debugging infrastructure for async objs: this lets us easily create fast_lists for various object types so they'll be visible in debugfs. Add new object types to the BCH_ASYNC_OBJS_TYPES() enum, and drop a pretty-printer wrapper in async_objs.c. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: fast_listKent Overstreet
A fast "list" data structure, which is actually a radix tree, with an IDA for slot allocation and a percpu buffer on top of that. Items cannot be added or moved to the head or tail, only added at some (arbitrary) position and removed. The advantage is that adding, removing and iteration is generally lockless, only hitting the lock in ida when the percpu buffer is full or empty. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: bch2_read_bio_to_textKent Overstreet
Pretty printer for struct bch_read_bio. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: bch2_bio_to_text()Kent Overstreet
Pretty printer for struct bio, to be used for async object debugging. This is pretty minimal, we'll add more to it as we discover what we need. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: bch_dev.io_ref -> enumerated_refKent Overstreet
Convert device IO refs to enumerated_refs, for easier debugging of refcount issues. Simple conversion: enumerate all users and convert to the new helpers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: bch_fs.writes -> enumerated_refsKent Overstreet
Drop the single-purpose write ref code in bcachefs.h, and convert to enumarated refs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: enumerated_ref.cKent Overstreet
Factor out the debug code for rw filesystem refs into a small library. In release mode an enumerated ref is a normal percpu refcount, but in debug mode all enumerated users of the ref get their own atomic_long_t ref - making it much easier to chase down refcount usage bugs for when a refcount has many users. For debugging, we have enumerated_ref_to_text(), which prints the current value of each different user. Additionally, in debug mode enumerated_ref_stop() has a 10 second timeout, after which it will dump outstanding refcounts. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: for_each_rw_member_rcu()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: __bch2_fs_read_write() no longer depends on io_refKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: for_each_online_member_rcu()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: recalc_capacity() no longer depends on io_refKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: bch2_target_to_text() no longer depends on io_refKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: bch2_check_rebalance_work()Kent Overstreet
Add a pass for checking the rebalance_work btree. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-28bcachefs: Kill dead codeAlan Huang
Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>