Age | Commit message (Collapse) | Author |
|
A few fsck paths weren't using BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE -
oops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We now print out the full previous extent we overlapping with, to aid in
debugging and searching through the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This is prep work for consolidating with JOURNAL_WATERMARK.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Upcoming rebalance_work btree will require extent triggers to be
BTREE_TRIGGER_WANTS_OLD_AND_NEW - so to reduce potential confusion,
let's just make all triggers BTREE_TRIGGER_WANTS_OLD_AND_NEW.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When we return errors outside of bcachefs, we need to return a standard
error code - fix this for BCH_ERR_fsck.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This adds an option to specify whether btree write buffer updates go on
to the head or the tail of the list of pending updates.
When triggers are updating backpointers btree, we need deletions to run
before updates. When an extent is being updated in place, we need the
updates from deleting the old version of the extent to happen before the
updates adding backpointers for the new version of the extent, otherwise
we'll incorrectly delete our new backpointers.
This is contrary to what we need for alloc info/reflink btree updates,
where we need inserts to happen before overwrites: for alloc info, if an
extent is being moved we need to process the insert of the new version
of the extent before the deletion of the old extent - otherwise an
indirect extent or bucket could end up being marked as not in use
because the new reference hasn't been processed yet.
To solve this, add an explicit ordering parameter to
bch2_trans_update_buffered().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bch2_bucket_backpointer_mod() doesn't need to update the alloc key, we
can exit the alloc iter earlier.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The uuid_t the kernel defines is different from uuid_t as defined by
libuuid, which is a problem because we need to use libuuid when building
in userspace.
Switch references to uuid_t to __uuid_t to fix this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
|
|
Similar to previous fixes, we can't incur page faults while holding
btree locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
generic_file_splice_read() is going away
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bio_release_pages() handles the BIO_NO_PAGE_REF check.
Also, iterating over/releasing _folios_ was incorrect, we need to match
how bio_iov_iter_get_pages() got the refs - single pages or folios.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
iov_iter is a union type that can now iterate over _many_ different
sources of pages, we can't treat them all like they point to an iov.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
uuid_le is being removed, and it wasn't even correct for bcachefs since
we weren't printing uuids as little endian.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
dir_emit() can fault (taking mmap_lock); thus we can't be holding btree
locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Memory allocation failures print backtraces by default, but when we're
running out of a rhashtable worker the backtrace is useless - it doesn't
tell us which hashtable the allocation failure was for.
This adds a dedicated warning that prints out functions from the
rhashtable params, which will be a bit more useful.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
|
|
|
|
I've been observing workloads where IPIs due to wakeups in
aio_complete() are ~15% of total CPU time in the profile. Most of those
wakeups are unnecessary when completion batching is in use in
io_getevents().
This plumbs min_nr through via the wait eventry, so that aio_complete()
can avoid doing unnecessary wakeups.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Benjamin LaHaise <bcrl@kvack.org
Cc: linux-aio@kvack.org
Cc: linux-fsdevel@vger.kernel.org
|
|
Originally, we used kmap() instead of kmap_atomic() for reading events
out of the completion ringbuffer because we're using copy_to_user(),
which can fault.
Now that kmap_local() is a thing, use that instead.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Benjamin LaHaise <bcrl@kvack.org
Cc: linux-aio@kvack.org
Cc: linux-fsdevel@vger.kernel.org
|
|
These counters should help us debug OOM issues.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This patch:
- Changes show_mem() to always report on slab usage
- Instead of reporting on all slabs, we only report on top 10 slabs,
and in sorted order
- Also reports on shrinkers, with the new shrinkers_to_text().
Shrinkers need to be included in OOM/allocation failure reporting
because they're responsible for memory reclaim - if a shrinker isn't
giving up its memory, we need to know which one and why.
More OOM reporting can be moved to show_mem.c and improved, this patch
is only a start.
New example output on OOM/memory allocation failure:
00177 Mem-Info:
00177 active_anon:13706 inactive_anon:32266 isolated_anon:16
00177 active_file:1653 inactive_file:1822 isolated_file:0
00177 unevictable:0 dirty:0 writeback:0
00177 slab_reclaimable:6242 slab_unreclaimable:11168
00177 mapped:3824 shmem:3 pagetables:1266 bounce:0
00177 kernel_misc_reclaimable:0
00177 free:4362 free_pcp:35 free_cma:0
00177 Node 0 active_anon:54824kB inactive_anon:129064kB active_file:6612kB inactive_file:7288kB unevictable:0kB isolated(anon):64kB isolated(file):0kB mapped:15296kB dirty:0kB writeback:0kB shmem:12kB writeback_tmp:0kB kernel_stack:3392kB pagetables:5064kB all_unreclaimable? no
00177 DMA free:2232kB boost:0kB min:88kB low:108kB high:128kB reserved_highatomic:0KB active_anon:2924kB inactive_anon:6596kB active_file:428kB inactive_file:384kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
00177 lowmem_reserve[]: 0 426 426 426
00177 DMA32 free:15092kB boost:5836kB min:8432kB low:9080kB high:9728kB reserved_highatomic:0KB active_anon:52196kB inactive_anon:122392kB active_file:6176kB inactive_file:7068kB unevictable:0kB writepending:0kB present:507760kB managed:441816kB mlocked:0kB bounce:0kB free_pcp:72kB local_pcp:0kB free_cma:0kB
00177 lowmem_reserve[]: 0 0 0 0
00177 DMA: 284*4kB (UM) 53*8kB (UM) 21*16kB (U) 11*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2248kB
00177 DMA32: 2765*4kB (UME) 375*8kB (UME) 57*16kB (UM) 5*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15132kB
00177 4656 total pagecache pages
00177 1031 pages in swap cache
00177 Swap cache stats: add 6572399, delete 6572173, find 488603/3286476
00177 Free swap = 509112kB
00177 Total swap = 2097148kB
00177 130938 pages RAM
00177 0 pages HighMem/MovableOnly
00177 16644 pages reserved
00177 Unreclaimable slab info:
00177 9p-fcall-cache total: 8.25 MiB active: 8.25 MiB
00177 kernfs_node_cache total: 2.15 MiB active: 2.15 MiB
00177 kmalloc-64 total: 2.08 MiB active: 2.07 MiB
00177 task_struct total: 1.95 MiB active: 1.95 MiB
00177 kmalloc-4k total: 1.50 MiB active: 1.50 MiB
00177 signal_cache total: 1.34 MiB active: 1.34 MiB
00177 kmalloc-2k total: 1.16 MiB active: 1.16 MiB
00177 bch_inode_info total: 1.02 MiB active: 922 KiB
00177 perf_event total: 1.02 MiB active: 1.02 MiB
00177 biovec-max total: 992 KiB active: 960 KiB
00177 Shrinkers:
00177 super_cache_scan: objects: 127
00177 super_cache_scan: objects: 106
00177 jbd2_journal_shrink_scan: objects: 32
00177 ext4_es_scan: objects: 32
00177 bch2_btree_cache_scan: objects: 8
00177 nr nodes: 24
00177 nr dirty: 0
00177 cannibalize lock: 0000000000000000
00177
00177 super_cache_scan: objects: 8
00177 super_cache_scan: objects: 1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
show_mem.c is really mm specific, and the next patch in the series is
going to require mm/slab.h, so let's move it before doing more work on
it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
The next step in this patch series for improving debugging of shrinker
related issues: keep counts of number of objects we request to free vs.
actually freed, and prints them in shrinker_to_text().
Shrinkers won't necessarily free all objects requested for a variety of
reasons, but if the two counts are wildly different something is likely
amiss.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This adds a new callback method to shrinkers which they can use to
describe anything relevant to memory reclaim about their internal state,
for example object dirtyness.
This uses the new printbufs to output to heap allocated strings, so that
the .to_text() methods can be used both for messages logged to the
console, and also sysfs/debugfs.
This patch also adds shrinkers_to_text(), which reports on the top 10
shrinkers - by object count - in sorted order, to be used in OOM
reporting.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This adds a seq_buf wrapper for string_get_size().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When mounted with nodataio, add the NOSUBMIT iomap flag to all data
mappings passed into the iomap layer. This causes iomap to skip all
data I/O submission and thus facilitates metadata only performance
testing.
For experimental use only. Only tested insofar as fsstress runs for
a few minutes without blowing up.
Signed-off-by: Brian Foster <bfoster@redhat.com>
|
|
Implement a quick and dirty hack to skip data I/O submission on a
specified mapping. The iomap layer will still perform every step up
through constructing the bio as if it will be submitted, but instead
invokes completion on the bio directly from submit context. The
purpose of this is to facilitate filesystem metadata performance
testing without the overhead of actual data I/O.
Note that this may be dangerous in current form in that folios are
not explicitly zeroed where they otherwise wouldn't be, so whatever
previous data exists in a folio prior to being added to a read bio
is mapped into pagecache for the file.
Signed-off-by: Brian Foster <bfoster@redhat.com>
|
|
Because scalability of the global inode_hash_lock really, really
sucks.
32-way concurrent create on a couple of different filesystems
before:
- 52.13% 0.04% [kernel] [k] ext4_create
- 52.09% ext4_create
- 41.03% __ext4_new_inode
- 29.92% insert_inode_locked
- 25.35% _raw_spin_lock
- do_raw_spin_lock
- 24.97% __pv_queued_spin_lock_slowpath
- 72.33% 0.02% [kernel] [k] do_filp_open
- 72.31% do_filp_open
- 72.28% path_openat
- 57.03% bch2_create
- 56.46% __bch2_create
- 40.43% inode_insert5
- 36.07% _raw_spin_lock
- do_raw_spin_lock
35.86% __pv_queued_spin_lock_slowpath
4.02% find_inode
Convert the inode hash table to a RCU-aware hash-bl table just like
the dentry cache. Note that we need to store a pointer to the
hlist_bl_head the inode has been added to in the inode so that when
it comes to unhash the inode we know what list to lock. We need to
do this because the hash value that is used to hash the inode is
generated from the inode itself - filesystems can provide this
themselves so we have to either store the hash or the head pointer
in the inode to be able to find the right list head for removal...
Same workload after:
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
|
|
in preparation for switching the VFS inode cache over the hlist_bl
lists, we nee dto be able to fake a list node that looks like it is
hased for correct operation of filesystems that don't directly use
the VFS indoe cache.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
|
|
In preparation for changing the inode hash table implementation.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
|
|
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add two new helpers for printing error messages with __func__ and
bch2_err_str():
- bch_err_fn
- bch_err_msg
Also kill the old error strings in the recovery path, which were causing
us to incorrectly report memory allocation failures - they're not needed
anymore.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
As with the previous patch, we generally can't hold btree locks while
copying to userspace, as that may incur a page fault and require
mmap_lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We can't be holding btree_trans_lock while copying to user space, which
might incur a page fault. To fix this, convert it to a seqmutex so we
can unlock/relock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes a deadlock:
01305 WARNING: possible circular locking dependency detected
01305 6.3.0-ktest-gf4de9bee61af #5305 Tainted: G W
01305 ------------------------------------------------------
01305 cat/14658 is trying to acquire lock:
01305 ffffffc00982f460 (fs_reclaim){+.+.}-{0:0}, at: __kmem_cache_alloc_node+0x48/0x278
01305
01305 but task is already holding lock:
01305 ffffff8011aaf040 (&lock->wait_lock){+.+.}-{2:2}, at: bch2_check_for_deadlock+0x4b8/0xa58
01305
01305 which lock already depends on the new lock.
01305
01305
01305 the existing dependency chain (in reverse order) is:
01305
01305 -> #2 (&lock->wait_lock){+.+.}-{2:2}:
01305 _raw_spin_lock+0x54/0x70
01305 __six_lock_wakeup+0x40/0x1b0
01305 six_unlock_ip+0xe8/0x248
01305 bch2_btree_key_cache_scan+0x720/0x940
01305 shrink_slab.constprop.0+0x284/0x770
01305 shrink_node+0x390/0x828
01305 balance_pgdat+0x390/0x6d0
01305 kswapd+0x2e4/0x718
01305 kthread+0x184/0x1a8
01305 ret_from_fork+0x10/0x20
01305
01305 -> #1 (&c->lock#2){+.+.}-{3:3}:
01305 __mutex_lock+0x104/0x14a0
01305 mutex_lock_nested+0x30/0x40
01305 bch2_btree_key_cache_scan+0x5c/0x940
01305 shrink_slab.constprop.0+0x284/0x770
01305 shrink_node+0x390/0x828
01305 balance_pgdat+0x390/0x6d0
01305 kswapd+0x2e4/0x718
01305 kthread+0x184/0x1a8
01305 ret_from_fork+0x10/0x20
01305
01305 -> #0 (fs_reclaim){+.+.}-{0:0}:
01305 __lock_acquire+0x19d0/0x2930
01305 lock_acquire+0x1dc/0x458
01305 fs_reclaim_acquire+0x9c/0xe0
01305 __kmem_cache_alloc_node+0x48/0x278
01305 __kmalloc_node_track_caller+0x5c/0x278
01305 krealloc+0x94/0x180
01305 bch2_printbuf_make_room.part.0+0xac/0x118
01305 bch2_prt_printf+0x150/0x1e8
01305 bch2_btree_bkey_cached_common_to_text+0x170/0x298
01305 bch2_btree_trans_to_text+0x244/0x348
01305 print_cycle+0x7c/0xb0
01305 break_cycle+0x254/0x528
01305 bch2_check_for_deadlock+0x59c/0xa58
01305 bch2_btree_deadlock_read+0x174/0x200
01305 full_proxy_read+0x94/0xf0
01305 vfs_read+0x15c/0x3a8
01305 ksys_read+0xb8/0x148
01305 __arm64_sys_read+0x48/0x60
01305 invoke_syscall.constprop.0+0x64/0x138
01305 do_el0_svc+0x84/0x138
01305 el0_svc+0x34/0x80
01305 el0t_64_sync_handler+0xb0/0xb8
01305 el0t_64_sync+0x14c/0x150
01305
01305 other info that might help us debug this:
01305
01305 Chain exists of:
01305 fs_reclaim --> &c->lock#2 --> &lock->wait_lock
01305
01305 Possible unsafe locking scenario:
01305
01305 CPU0 CPU1
01305 ---- ----
01305 lock(&lock->wait_lock);
01305 lock(&c->lock#2);
01305 lock(&lock->wait_lock);
01305 lock(fs_reclaim);
01305
01305 *** DEADLOCK ***
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We weren't correctly checking the freespace btree - it's an extents
btree, which means we need to iterate over each bucket in a freespace
extent.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes a spurious assert in the btree node read path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The calculation for number of nodes to allocate in
bch2_btree_update_start() was incorrect - this fixes a BUG_ON() on the
small nodes test.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This adds a new helper for getting a pointer's durability irrespective
of the device state, and uses it in the the data update path.
This fixes a bug where we do a data update but request 0 replicas to be
allocated, because the replica being rewritten is on a device marked as
failed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
- We may need to drop btree locks before taking the writepoint_lock, as
is done in other places.
- We should be using open_bucket_free_unused(), so that we don't waste
space.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
since we currently don't have a good fault injection library,
bch2_btree_insert_node() was randomly injecting faults based on
local_clock().
At the very least this should have been a debug mode only thing, but
this is a brittle method so let's just delete it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
btree write buffer flush is only invoked from contexts that already hold
a write ref, and checking if we're still RW could cause us to fail to
completely flush the write buffer when shutting down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
On Mon, 29 May 2023, Mikulas Patocka wrote:
> The oops happens in set_btree_iter_dontneed and it is caused by the fact
> that iter->path is NULL. The code in try_alloc_bucket is buggy because it
> sets "struct btree_iter iter = { NULL };" and then jumps to the "err"
> label that tries to dereference values in "iter".
Here I'm sending a patch for it.
From: Mikulas Patocka <mpatocka@redhat.com>
The function try_alloc_bucket sets the variable "iter" to NULL and then
(on various error conditions) jumps to the label "err". On the "err"
label, it calls "set_btree_iter_dontneed" that tries to dereference
"iter->trans" and "iter->path".
So, we get an oops on error condition.
This patch fixes the crash by testing that iter.trans and iter.path is
non-zero before calling set_btree_iter_dontneed.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
|