summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-11-12bcachefs: fix missing commitbcachefs-tracepointsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-12bcachefs: trace_propagate_key_to_snapshot_leaves()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-11bcachefs: Improve transaction_commit trace eventKent Overstreet
Now it includes the updates being done. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-11bcachefs: Improve trans_restart_too_many_iters()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-10bcachefs: Unwritten journal buffers are always dirtyKent Overstreet
Ensure that journal bufs that haven't been written can't be reclaimed from the journal pin fifo, and can thus have new pins taken. Prep work for changing the btree write buffer to pull keys from the journal directly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-10bcachefs: Don't flush journal after replayKent Overstreet
The flush_all_pins() after journal replay was unecessary, and trying to completely flush the journal while RW is not a great idea - it's not guaranteed to terminate if other threads keep adding things to the jorunal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-10bcachefs: Make journal replay more efficientKent Overstreet
Journal replay now first attempts to replay keys in sorted order, similar to how the btree write buffer flush path works. Any keys that can not be replayed due to journal deadlock are then left for later and replayed in journal order, unpinning journal entries as we go. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-10bcachefs: Go rw before journal replayKent Overstreet
This gets us slightly nicer log messages. Also, this slightly clarifies synchronization of c->journal_keys; after we go RW it's in use by multiple threads (so that the btree iterator code can overlay keys from the journal); so it has to be prepped before that point. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-10bcachefs: Kill BTREE_UPDATE_PREJOURNALKent Overstreet
With the previous patch that reworks BTREE_INSERT_JOURNAL_REPLAY, we can now switch the btree write buffer to use it for flushing. This has the advantage that transaction commits don't need to take a journal reservation at all. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-10bcachefs: BTREE_INSERT_JOURNAL_REPLAY now "don't init trans->journal_res"Kent Overstreet
This slightly changes how trans->journal_res works, in preparation for changing the btree write buffer flush path to use it. Now, BTREE_INSERT_JOURNAL_REPLAY means "don't take a journal reservation; trans->journal_res.seq already refers to the journal sequence number to pin". Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-10bcachefs: Journal pins must always have a flush_fnKent Overstreet
flush_fn is how we identify journal pins in debugfs - this is a debugging aid. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-10bcachefs: track_event_change()Kent Overstreet
This introduces a new helper for connecting time_stats to state changes, i.e. when taking journal reservations is blocked for some reason. We use this to track separately the different reasons the journal might be blocked - i.e. space in the journal full, or the journal pin fifo full. Also do some cleanup and improvements on the time stats code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-08bcachefs: Kill journal pre-reservationsKent Overstreet
This deletes the complicated and somewhat expensive journal pre-reservation machinery in favor of just using journal watermarks: when the journal is more than half full, we run journal reclaim more aggressively, and when the journal is more than 3/4s full we only allow journal reclaim to get new journal reservations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-08bcachefs: Guard against insufficient devices to create stripesKent Overstreet
We can't create stripes if we don't have enough devices - this manifested as an integer underflow bug later. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-08bcachefs: Factor out darray resize slowpathKent Overstreet
Move the slowpath (actually growing the darray) to an out-of-line function; also, add some helpers for the upcoming btree write buffer rewrite. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-08bcachefs: Include average write size in sysfs journal_debugKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-08bcachefs: Add an assertion in bch2_journal_pin_set()Kent Overstreet
Previously, bch2_journal_pin_set() would silently ignore a request to pin a journal sequence number that was no longer dirty, because it was used internally by bch2_journal_pin_copy() which could race with the src pin being flushed. Split these apart so that we can properly assert that @seq is a currently dirty journal sequence number - this is almost always a bug. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-08bcachefs: Clear k->needs_whitout earlier in commit pathKent Overstreet
The upcoming btree write buffer rework is going to use the journal itself as the first stage of the write buffer; this is a cleanup to make sure k->needs_whiteout is initialized before keys hit the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-08bcachefs: Avoiding dropping/retaking write locks in ↵Kent Overstreet
bch2_btree_write_buffer_flush_one() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-08bcachefs: Fix null ptr deref in bch2_backpointer_get_node()Kent Overstreet
bch2_btree_iter_peek_node() can return a NULL ptr (when the tree is shorter than the search depth); handle this with an early return. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: https://lore.kernel.org/linux-bcachefs/5fc3c28b-c232-4ec7-b0ac-4ef220ddf976@moroto.mountain/T/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-08bcachefs: Check for nonce offset inconsistency in data_update pathKent Overstreet
We've rarely been seeing a nonce offset inconsistency that doesn't show up in tests: this adds some extra verification code to the data update path that prints out more relevant info when it occurs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-07bcachefs: Make sure to drop/retake btree locks before reclaimKent Overstreet
We really don't want to be invoking memory reclaim with btree locks held: even aside from (solvable, but tricky) recursion issues, it can cause painful to diagnose performance edge cases. This fixes a recently reported issue in btree_key_can_insert_cached(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reported-by: Mateusz Guzik <mjguzik@gmail.com> Fixes: https://lore.kernel.org/linux-bcachefs/CAGudoHEsb_hGRMeWeXh+UF6po0qQuuq_NKSEo+s1sEb6bDLjpA@mail.gmail.com/T/
2023-11-06bcachefs: btree_trans->write_lockedKent Overstreet
As prep work for the next patch to fix a key cache reclaim issue, we need to start tracking whether we're currently holding write locks - so that we can release and retake the before calling into memory reclaim. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-06bcachefs: Run btree key cache shrinker less aggressivelyKent Overstreet
The btree key cache maintains lists of items that have been freed, but can't yet be reclaimed because a bch2_trans_relock() call might find them - we're waiting for SRCU readers to release. Previously, we wouldn't count these items against the number we're attempting to scan for, which would mean we'd evict more live key cache entries - doing quite a bit of potentially unecessary work. With recent work to make sure we don't hold SRCU locks for too long, it should be safe to count all the items on the freelists against number to scan - even if we can't reclaim them yet, we will be able to soon. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-06bcachefs: Fix multiple -Warray-bounds warningsGustavo A. R. Silva
Transform zero-length array `entries` into a proper flexible-array member in `struct journal_seq_blacklist_table`; and fix the following -Warray-bounds warnings: fs/bcachefs/journal_seq_blacklist.c:148:26: warning: array subscript idx is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=] fs/bcachefs/journal_seq_blacklist.c:150:30: warning: array subscript idx is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=] fs/bcachefs/journal_seq_blacklist.c:154:27: warning: array subscript idx is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=] fs/bcachefs/journal_seq_blacklist.c:176:27: warning: array subscript i is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=] fs/bcachefs/journal_seq_blacklist.c:177:27: warning: array subscript i is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=] fs/bcachefs/journal_seq_blacklist.c:297:34: warning: array subscript i is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=] fs/bcachefs/journal_seq_blacklist.c:298:34: warning: array subscript i is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=] fs/bcachefs/journal_seq_blacklist.c:300:31: warning: array subscript i is outside array bounds of 'struct journal_seq_blacklist_table_entry[0]' [-Warray-bounds=] This results in no differences in binary output. This helps with the ongoing efforts to globally enable -Warray-bounds. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-06bcachefs: Use DECLARE_FLEX_ARRAY() helper and fix multiple -Warray-bounds ↵Gustavo A. R. Silva
warnings Transform zero-length array `s` into a proper flexible-array member in `struct snapshot_table` via the DECLARE_FLEX_ARRAY() helper; and fix tons of the following -Warray-bounds warnings: fs/bcachefs/snapshot.h:36:21: warning: array subscript <unknown> is outside array bounds of 'struct snapshot_t[0]' [-Warray-bounds=] fs/bcachefs/snapshot.h:36:21: warning: array subscript <unknown> is outside array bounds of 'struct snapshot_t[0]' [-Warray-bounds=] fs/bcachefs/snapshot.c:135:70: warning: array subscript <unknown> is outside array bounds of 'struct snapshot_t[0]' [-Warray-bounds=] fs/bcachefs/snapshot.h:36:21: warning: array subscript <unknown> is outside array bounds of 'struct snapshot_t[0]' [-Warray-bounds=] fs/bcachefs/snapshot.h:36:21: warning: array subscript <unknown> is outside array bounds of 'struct snapshot_t[0]' [-Warray-bounds=] fs/bcachefs/snapshot.h:36:21: warning: array subscript <unknown> is outside array bounds of 'struct snapshot_t[0]' [-Warray-bounds=] This helps with the ongoing efforts to globally enable -Warray-bounds. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-06bcachefs: Split out btree_key_cache_types.hKent Overstreet
More consistent organization. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-11-05bcachefs: make bch2_target_to_text_sb staticJiapeng Chong
The bch2_target_to_text_sb are not used outside the file disk_groups.c, so the modification is defined as static. fs/bcachefs/disk_groups.c:583:6: warning: no previous prototype for ‘bch2_target_to_text_sb’. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=7144 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-11-05bcachefs: six locks: Simplify optimistic spinningKent Overstreet
osq lock maintainers don't want it to be used outside of kernel/locking/ - but, we can do better. Since we have lock handoff signalled via waitlist entries, there's no reason for optimistic spinning to have to look at the lock at all - aside from checking lock-owner; we can just spin looking at our waitlist entry. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05powerpc: Export kvm_guest static key, for bcachefs six locksKent Overstreet
bcachefs's six locks need kvm_guest, via ower_on_cpu() -> vcpu_is_preempted() -> is_kvm_guest() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: linuxppc-dev@lists.ozlabs.org Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
2023-11-05kbuild: Allow gcov to be enabled on the command lineKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05rhashtable: Better error message on allocation failureKent Overstreet
Memory allocation failures print backtraces by default, but when we're running out of a rhashtable worker the backtrace is useless - it doesn't tell us which hashtable the allocation failure was for. This adds a dedicated warning that prints out functions from the rhashtable params, which will be a bit more useful. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Thomas Graf <tgraf@suug.ch> Cc: Herbert Xu <herbert@gondor.apana.org.au>
2023-11-05Update issue templatesDaniel Hill
2023-11-05fs/aio: obey min_nr when doing wakeupsKent Overstreet
I've been observing workloads where IPIs due to wakeups in aio_complete() are ~15% of total CPU time in the profile. Most of those wakeups are unnecessary when completion batching is in use in io_getevents(). This plumbs min_nr through via the wait eventry, so that aio_complete() can avoid doing unnecessary wakeups. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Benjamin LaHaise <bcrl@kvack.org Cc: linux-aio@kvack.org Cc: linux-fsdevel@vger.kernel.org
2023-11-05bcachefs: add counters for failed shrinker reclaimDaniel Hill
These counters should help us debug OOM issues. Signed-off-by: Daniel Hill <daniel@gluo.nz>
2023-11-05bcachefs: shrinker.to_text() methodsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05mm: vmscan: Fix leak in unregister_shrinker()Kent Overstreet
shrinker->name needs to be freed Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05mm: Add shrinker_to_text() to debugfs interfaceKent Overstreet
Previously, we added shrinker_to_text() and hooked it up to the OOM report - now, the same report is available via debugfs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05mm: Centralize & improve oom reporting in show_mem.cKent Overstreet
This patch: - Changes show_mem() to always report on slab usage - Instead of reporting on all slabs, we only report on top 10 slabs, and in sorted order - Also reports on shrinkers, with the new shrinkers_to_text(). Shrinkers need to be included in OOM/allocation failure reporting because they're responsible for memory reclaim - if a shrinker isn't giving up its memory, we need to know which one and why. More OOM reporting can be moved to show_mem.c and improved, this patch is only a start. New example output on OOM/memory allocation failure: 00177 Mem-Info: 00177 active_anon:13706 inactive_anon:32266 isolated_anon:16 00177 active_file:1653 inactive_file:1822 isolated_file:0 00177 unevictable:0 dirty:0 writeback:0 00177 slab_reclaimable:6242 slab_unreclaimable:11168 00177 mapped:3824 shmem:3 pagetables:1266 bounce:0 00177 kernel_misc_reclaimable:0 00177 free:4362 free_pcp:35 free_cma:0 00177 Node 0 active_anon:54824kB inactive_anon:129064kB active_file:6612kB inactive_file:7288kB unevictable:0kB isolated(anon):64kB isolated(file):0kB mapped:15296kB dirty:0kB writeback:0kB shmem:12kB writeback_tmp:0kB kernel_stack:3392kB pagetables:5064kB all_unreclaimable? no 00177 DMA free:2232kB boost:0kB min:88kB low:108kB high:128kB reserved_highatomic:0KB active_anon:2924kB inactive_anon:6596kB active_file:428kB inactive_file:384kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB 00177 lowmem_reserve[]: 0 426 426 426 00177 DMA32 free:15092kB boost:5836kB min:8432kB low:9080kB high:9728kB reserved_highatomic:0KB active_anon:52196kB inactive_anon:122392kB active_file:6176kB inactive_file:7068kB unevictable:0kB writepending:0kB present:507760kB managed:441816kB mlocked:0kB bounce:0kB free_pcp:72kB local_pcp:0kB free_cma:0kB 00177 lowmem_reserve[]: 0 0 0 0 00177 DMA: 284*4kB (UM) 53*8kB (UM) 21*16kB (U) 11*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2248kB 00177 DMA32: 2765*4kB (UME) 375*8kB (UME) 57*16kB (UM) 5*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15132kB 00177 4656 total pagecache pages 00177 1031 pages in swap cache 00177 Swap cache stats: add 6572399, delete 6572173, find 488603/3286476 00177 Free swap = 509112kB 00177 Total swap = 2097148kB 00177 130938 pages RAM 00177 0 pages HighMem/MovableOnly 00177 16644 pages reserved 00177 Unreclaimable slab info: 00177 9p-fcall-cache total: 8.25 MiB active: 8.25 MiB 00177 kernfs_node_cache total: 2.15 MiB active: 2.15 MiB 00177 kmalloc-64 total: 2.08 MiB active: 2.07 MiB 00177 task_struct total: 1.95 MiB active: 1.95 MiB 00177 kmalloc-4k total: 1.50 MiB active: 1.50 MiB 00177 signal_cache total: 1.34 MiB active: 1.34 MiB 00177 kmalloc-2k total: 1.16 MiB active: 1.16 MiB 00177 bch_inode_info total: 1.02 MiB active: 922 KiB 00177 perf_event total: 1.02 MiB active: 1.02 MiB 00177 biovec-max total: 992 KiB active: 960 KiB 00177 Shrinkers: 00177 super_cache_scan: objects: 127 00177 super_cache_scan: objects: 106 00177 jbd2_journal_shrink_scan: objects: 32 00177 ext4_es_scan: objects: 32 00177 bch2_btree_cache_scan: objects: 8 00177 nr nodes: 24 00177 nr dirty: 0 00177 cannibalize lock: 0000000000000000 00177 00177 super_cache_scan: objects: 8 00177 super_cache_scan: objects: 1 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05mm: Count requests to free & nr freed per shrinkerKent Overstreet
The next step in this patch series for improving debugging of shrinker related issues: keep counts of number of objects we request to free vs. actually freed, and prints them in shrinker_to_text(). Shrinkers won't necessarily free all objects requested for a variety of reasons, but if the two counts are wildly different something is likely amiss. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-11-05mm: Add a .to_text() method for shrinkersKent Overstreet
This adds a new callback method to shrinkers which they can use to describe anything relevant to memory reclaim about their internal state, for example object dirtyness. This uses the new printbufs to output to heap allocated strings, so that the .to_text() methods can be used both for messages logged to the console, and also sysfs/debugfs. This patch also adds shrinkers_to_text(), which reports on the top 10 shrinkers - by object count - in sorted order, to be used in OOM reporting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05seq_buf: seq_buf_human_readable_u64()Kent Overstreet
This adds a seq_buf wrapper for string_get_size(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05xfs: add nodataio mount option to skip all data I/OBrian Foster
When mounted with nodataio, add the NOSUBMIT iomap flag to all data mappings passed into the iomap layer. This causes iomap to skip all data I/O submission and thus facilitates metadata only performance testing. For experimental use only. Only tested insofar as fsstress runs for a few minutes without blowing up. Signed-off-by: Brian Foster <bfoster@redhat.com>
2023-11-05iomap: add nosubmit flag to skip data I/O on iomap mappingBrian Foster
Implement a quick and dirty hack to skip data I/O submission on a specified mapping. The iomap layer will still perform every step up through constructing the bio as if it will be submitted, but instead invokes completion on the bio directly from submit context. The purpose of this is to facilitate filesystem metadata performance testing without the overhead of actual data I/O. Note that this may be dangerous in current form in that folios are not explicitly zeroed where they otherwise wouldn't be, so whatever previous data exists in a folio prior to being added to a read bio is mapped into pagecache for the file. Signed-off-by: Brian Foster <bfoster@redhat.com>
2023-11-05vfs: inode cache conversion to hash-blDave Chinner
Because scalability of the global inode_hash_lock really, really sucks. 32-way concurrent create on a couple of different filesystems before: - 52.13% 0.04% [kernel] [k] ext4_create - 52.09% ext4_create - 41.03% __ext4_new_inode - 29.92% insert_inode_locked - 25.35% _raw_spin_lock - do_raw_spin_lock - 24.97% __pv_queued_spin_lock_slowpath - 72.33% 0.02% [kernel] [k] do_filp_open - 72.31% do_filp_open - 72.28% path_openat - 57.03% bch2_create - 56.46% __bch2_create - 40.43% inode_insert5 - 36.07% _raw_spin_lock - do_raw_spin_lock 35.86% __pv_queued_spin_lock_slowpath 4.02% find_inode Convert the inode hash table to a RCU-aware hash-bl table just like the dentry cache. Note that we need to store a pointer to the hlist_bl_head the inode has been added to in the inode so that when it comes to unhash the inode we know what list to lock. We need to do this because the hash value that is used to hash the inode is generated from the inode itself - filesystems can provide this themselves so we have to either store the hash or the head pointer in the inode to be able to find the right list head for removal... Same workload after: Signed-off-by: Dave Chinner <dchinner@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org
2023-11-05hlist-bl: add hlist_bl_fake()Dave Chinner
in preparation for switching the VFS inode cache over the hlist_bl lists, we nee dto be able to fake a list node that looks like it is hased for correct operation of filesystems that don't directly use the VFS indoe cache. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2023-11-05vfs: factor out inode hash head calculationDave Chinner
In preparation for changing the inode hash table implementation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org
2023-11-05mean and variance: Promote to lib/mathKent Overstreet
This promotes mean_and_variance from bcachefs to lib/math, so it can be used by other things. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-11-05Increase MAX_LOCK_DEPTH, bcachefs BTREE_ITER_MAX (do not upstream)Kent Overstreet
2023-11-05bcachefs: Switch to lockdep_set_no_check_recursion()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>