summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-06-25bcachefs: fsck needs BTERE_UPDATE_INTERNAL_SNAPSHOT_NODEbcachefs-v6.3Kent Overstreet
A few fsck paths weren't using BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE - oops. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-24bcachefs: Improve error message for overlapping extentsKent Overstreet
We now print out the full previous extent we overlapping with, to aid in debugging and searching through the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-24bcachefs: Fix check_pos_snapshot_overwritten()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-24bcachefs: Rename enum alloc_reserve -> bch_watermarkKent Overstreet
This is prep work for consolidating with JOURNAL_WATERMARK. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-24bcachefs: All triggers are BTREE_TRIGGER_WANTS_OLD_AND_NEWKent Overstreet
Upcoming rebalance_work btree will require extent triggers to be BTREE_TRIGGER_WANTS_OLD_AND_NEW - so to reduce potential confusion, let's just make all triggers BTREE_TRIGGER_WANTS_OLD_AND_NEW. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-24bcachefs: BCH_ERR_fsck -> EINVALKent Overstreet
When we return errors outside of bcachefs, we need to return a standard error code - fix this for BCH_ERR_fsck. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-24bcachefs: btree write buffer update orderingKent Overstreet
This adds an option to specify whether btree write buffer updates go on to the head or the tail of the list of pending updates. When triggers are updating backpointers btree, we need deletions to run before updates. When an extent is being updated in place, we need the updates from deleting the old version of the extent to happen before the updates adding backpointers for the new version of the extent, otherwise we'll incorrectly delete our new backpointers. This is contrary to what we need for alloc info/reflink btree updates, where we need inserts to happen before overwrites: for alloc info, if an extent is being moved we need to process the insert of the new version of the extent before the deletion of the old extent - otherwise an indirect extent or bucket could end up being marked as not in use because the new reference hasn't been processed yet. To solve this, add an explicit ordering parameter to bch2_trans_update_buffered(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-24bcachefs: bch2_trans_mark_pointer() refactoringKent Overstreet
bch2_bucket_backpointer_mod() doesn't need to update the alloc key, we can exit the alloc iter earlier. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-24bcachefs: uuid_t -> __uuid_tKent Overstreet
The uuid_t the kernel defines is different from uuid_t as defined by libuuid, which is a problem because we need to use libuuid when building in userspace. Switch references to uuid_t to __uuid_t to fix this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-22fixup! bcachefs: Switch to uuid_t instead of uuid_leKent Overstreet
2023-06-22bcachefs: Fix more lockdep splats in debug.cKent Overstreet
Similar to previous fixes, we can't incur page faults while holding btree locks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-22bcachefs: generic_file_splic_read() -> filemap_splice_read()Kent Overstreet
generic_file_splice_read() is going away Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-22bcachefs: fs-io.c: dio write path: use bio_release_pages()Kent Overstreet
bio_release_pages() handles the BIO_NO_PAGE_REF check. Also, iterating over/releasing _folios_ was incorrect, we need to match how bio_iov_iter_get_pages() got the refs - single pages or folios. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-22bcachefs: fix bch2_dio_write_copy_iov() to check iov_iter typeKent Overstreet
iov_iter is a union type that can now iterate over _many_ different sources of pages, we can't treat them all like they point to an iov. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-22bcachefs: Switch to uuid_t instead of uuid_leKent Overstreet
uuid_le is being removed, and it wasn't even correct for bcachefs since we weren't printing uuids as little endian. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-21bcachefs: Fix lockdep splat in bch2_readdirKent Overstreet
dir_emit() can fault (taking mmap_lock); thus we can't be holding btree locks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-21kbuild: Allow gcov to be enabled on the command lineKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-21rhashtable: Better error message on allocation failureKent Overstreet
Memory allocation failures print backtraces by default, but when we're running out of a rhashtable worker the backtrace is useless - it doesn't tell us which hashtable the allocation failure was for. This adds a dedicated warning that prints out functions from the rhashtable params, which will be a bit more useful. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Thomas Graf <tgraf@suug.ch> Cc: Herbert Xu <herbert@gondor.apana.org.au>
2023-06-21Update issue templatesDaniel Hill
2023-06-21fs/aio: obey min_nr when doing wakeupsKent Overstreet
I've been observing workloads where IPIs due to wakeups in aio_complete() are ~15% of total CPU time in the profile. Most of those wakeups are unnecessary when completion batching is in use in io_getevents(). This plumbs min_nr through via the wait eventry, so that aio_complete() can avoid doing unnecessary wakeups. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Benjamin LaHaise <bcrl@kvack.org Cc: linux-aio@kvack.org Cc: linux-fsdevel@vger.kernel.org
2023-06-21fs/aio: Use kmap_local() instead of kmap()Kent Overstreet
Originally, we used kmap() instead of kmap_atomic() for reading events out of the completion ringbuffer because we're using copy_to_user(), which can fault. Now that kmap_local() is a thing, use that instead. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Benjamin LaHaise <bcrl@kvack.org Cc: linux-aio@kvack.org Cc: linux-fsdevel@vger.kernel.org
2023-06-21bcachefs: add counters for failed shrinker reclaimDaniel Hill
These counters should help us debug OOM issues. Signed-off-by: Daniel Hill <daniel@gluo.nz>
2023-06-21bcachefs: shrinker.to_text() methodsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-21mm: Centralize & improve oom reporting in show_mem.cKent Overstreet
This patch: - Changes show_mem() to always report on slab usage - Instead of reporting on all slabs, we only report on top 10 slabs, and in sorted order - Also reports on shrinkers, with the new shrinkers_to_text(). Shrinkers need to be included in OOM/allocation failure reporting because they're responsible for memory reclaim - if a shrinker isn't giving up its memory, we need to know which one and why. More OOM reporting can be moved to show_mem.c and improved, this patch is only a start. New example output on OOM/memory allocation failure: 00177 Mem-Info: 00177 active_anon:13706 inactive_anon:32266 isolated_anon:16 00177 active_file:1653 inactive_file:1822 isolated_file:0 00177 unevictable:0 dirty:0 writeback:0 00177 slab_reclaimable:6242 slab_unreclaimable:11168 00177 mapped:3824 shmem:3 pagetables:1266 bounce:0 00177 kernel_misc_reclaimable:0 00177 free:4362 free_pcp:35 free_cma:0 00177 Node 0 active_anon:54824kB inactive_anon:129064kB active_file:6612kB inactive_file:7288kB unevictable:0kB isolated(anon):64kB isolated(file):0kB mapped:15296kB dirty:0kB writeback:0kB shmem:12kB writeback_tmp:0kB kernel_stack:3392kB pagetables:5064kB all_unreclaimable? no 00177 DMA free:2232kB boost:0kB min:88kB low:108kB high:128kB reserved_highatomic:0KB active_anon:2924kB inactive_anon:6596kB active_file:428kB inactive_file:384kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB 00177 lowmem_reserve[]: 0 426 426 426 00177 DMA32 free:15092kB boost:5836kB min:8432kB low:9080kB high:9728kB reserved_highatomic:0KB active_anon:52196kB inactive_anon:122392kB active_file:6176kB inactive_file:7068kB unevictable:0kB writepending:0kB present:507760kB managed:441816kB mlocked:0kB bounce:0kB free_pcp:72kB local_pcp:0kB free_cma:0kB 00177 lowmem_reserve[]: 0 0 0 0 00177 DMA: 284*4kB (UM) 53*8kB (UM) 21*16kB (U) 11*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2248kB 00177 DMA32: 2765*4kB (UME) 375*8kB (UME) 57*16kB (UM) 5*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15132kB 00177 4656 total pagecache pages 00177 1031 pages in swap cache 00177 Swap cache stats: add 6572399, delete 6572173, find 488603/3286476 00177 Free swap = 509112kB 00177 Total swap = 2097148kB 00177 130938 pages RAM 00177 0 pages HighMem/MovableOnly 00177 16644 pages reserved 00177 Unreclaimable slab info: 00177 9p-fcall-cache total: 8.25 MiB active: 8.25 MiB 00177 kernfs_node_cache total: 2.15 MiB active: 2.15 MiB 00177 kmalloc-64 total: 2.08 MiB active: 2.07 MiB 00177 task_struct total: 1.95 MiB active: 1.95 MiB 00177 kmalloc-4k total: 1.50 MiB active: 1.50 MiB 00177 signal_cache total: 1.34 MiB active: 1.34 MiB 00177 kmalloc-2k total: 1.16 MiB active: 1.16 MiB 00177 bch_inode_info total: 1.02 MiB active: 922 KiB 00177 perf_event total: 1.02 MiB active: 1.02 MiB 00177 biovec-max total: 992 KiB active: 960 KiB 00177 Shrinkers: 00177 super_cache_scan: objects: 127 00177 super_cache_scan: objects: 106 00177 jbd2_journal_shrink_scan: objects: 32 00177 ext4_es_scan: objects: 32 00177 bch2_btree_cache_scan: objects: 8 00177 nr nodes: 24 00177 nr dirty: 0 00177 cannibalize lock: 0000000000000000 00177 00177 super_cache_scan: objects: 8 00177 super_cache_scan: objects: 1 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-21mm: Move lib/show_mem.c to mm/Kent Overstreet
show_mem.c is really mm specific, and the next patch in the series is going to require mm/slab.h, so let's move it before doing more work on it. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-21mm: Count requests to free & nr freed per shrinkerKent Overstreet
The next step in this patch series for improving debugging of shrinker related issues: keep counts of number of objects we request to free vs. actually freed, and prints them in shrinker_to_text(). Shrinkers won't necessarily free all objects requested for a variety of reasons, but if the two counts are wildly different something is likely amiss. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-21mm: Add a .to_text() method for shrinkersKent Overstreet
This adds a new callback method to shrinkers which they can use to describe anything relevant to memory reclaim about their internal state, for example object dirtyness. This uses the new printbufs to output to heap allocated strings, so that the .to_text() methods can be used both for messages logged to the console, and also sysfs/debugfs. This patch also adds shrinkers_to_text(), which reports on the top 10 shrinkers - by object count - in sorted order, to be used in OOM reporting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-21seq_buf: seq_buf_human_readable_u64()Kent Overstreet
This adds a seq_buf wrapper for string_get_size(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-21xfs: add nodataio mount option to skip all data I/OBrian Foster
When mounted with nodataio, add the NOSUBMIT iomap flag to all data mappings passed into the iomap layer. This causes iomap to skip all data I/O submission and thus facilitates metadata only performance testing. For experimental use only. Only tested insofar as fsstress runs for a few minutes without blowing up. Signed-off-by: Brian Foster <bfoster@redhat.com>
2023-06-21iomap: add nosubmit flag to skip data I/O on iomap mappingBrian Foster
Implement a quick and dirty hack to skip data I/O submission on a specified mapping. The iomap layer will still perform every step up through constructing the bio as if it will be submitted, but instead invokes completion on the bio directly from submit context. The purpose of this is to facilitate filesystem metadata performance testing without the overhead of actual data I/O. Note that this may be dangerous in current form in that folios are not explicitly zeroed where they otherwise wouldn't be, so whatever previous data exists in a folio prior to being added to a read bio is mapped into pagecache for the file. Signed-off-by: Brian Foster <bfoster@redhat.com>
2023-06-21vfs: inode cache conversion to hash-blDave Chinner
Because scalability of the global inode_hash_lock really, really sucks. 32-way concurrent create on a couple of different filesystems before: - 52.13% 0.04% [kernel] [k] ext4_create - 52.09% ext4_create - 41.03% __ext4_new_inode - 29.92% insert_inode_locked - 25.35% _raw_spin_lock - do_raw_spin_lock - 24.97% __pv_queued_spin_lock_slowpath - 72.33% 0.02% [kernel] [k] do_filp_open - 72.31% do_filp_open - 72.28% path_openat - 57.03% bch2_create - 56.46% __bch2_create - 40.43% inode_insert5 - 36.07% _raw_spin_lock - do_raw_spin_lock 35.86% __pv_queued_spin_lock_slowpath 4.02% find_inode Convert the inode hash table to a RCU-aware hash-bl table just like the dentry cache. Note that we need to store a pointer to the hlist_bl_head the inode has been added to in the inode so that when it comes to unhash the inode we know what list to lock. We need to do this because the hash value that is used to hash the inode is generated from the inode itself - filesystems can provide this themselves so we have to either store the hash or the head pointer in the inode to be able to find the right list head for removal... Same workload after: Signed-off-by: Dave Chinner <dchinner@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org
2023-06-21hlist-bl: add hlist_bl_fake()Dave Chinner
in preparation for switching the VFS inode cache over the hlist_bl lists, we nee dto be able to fake a list node that looks like it is hased for correct operation of filesystems that don't directly use the VFS indoe cache. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2023-06-21vfs: factor out inode hash head calculationDave Chinner
In preparation for changing the inode hash table implementation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org
2023-06-21Increase MAX_LOCK_DEPTH, bcachefs BTREE_ITER_MAX (do not upstream)Kent Overstreet
2023-06-21bcachefs: Check for ERR_PTR() from filemap_lock_folio()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: New error message helpersKent Overstreet
Add two new helpers for printing error messages with __func__ and bch2_err_str(): - bch_err_fn - bch_err_msg Also kill the old error strings in the recovery path, which were causing us to incorrectly report memory allocation failures - they're not needed anymore. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: fiemap: Fix a lockdep splatKent Overstreet
As with the previous patch, we generally can't hold btree locks while copying to userspace, as that may incur a page fault and require mmap_lock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: seqmutex; fix a lockdep splatKent Overstreet
We can't be holding btree_trans_lock while copying to user space, which might incur a page fault. To fix this, convert it to a seqmutex so we can unlock/relock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Don't call lock_graph_descend() with wait lock heldKent Overstreet
This fixes a deadlock: 01305 WARNING: possible circular locking dependency detected 01305 6.3.0-ktest-gf4de9bee61af #5305 Tainted: G W 01305 ------------------------------------------------------ 01305 cat/14658 is trying to acquire lock: 01305 ffffffc00982f460 (fs_reclaim){+.+.}-{0:0}, at: __kmem_cache_alloc_node+0x48/0x278 01305 01305 but task is already holding lock: 01305 ffffff8011aaf040 (&lock->wait_lock){+.+.}-{2:2}, at: bch2_check_for_deadlock+0x4b8/0xa58 01305 01305 which lock already depends on the new lock. 01305 01305 01305 the existing dependency chain (in reverse order) is: 01305 01305 -> #2 (&lock->wait_lock){+.+.}-{2:2}: 01305 _raw_spin_lock+0x54/0x70 01305 __six_lock_wakeup+0x40/0x1b0 01305 six_unlock_ip+0xe8/0x248 01305 bch2_btree_key_cache_scan+0x720/0x940 01305 shrink_slab.constprop.0+0x284/0x770 01305 shrink_node+0x390/0x828 01305 balance_pgdat+0x390/0x6d0 01305 kswapd+0x2e4/0x718 01305 kthread+0x184/0x1a8 01305 ret_from_fork+0x10/0x20 01305 01305 -> #1 (&c->lock#2){+.+.}-{3:3}: 01305 __mutex_lock+0x104/0x14a0 01305 mutex_lock_nested+0x30/0x40 01305 bch2_btree_key_cache_scan+0x5c/0x940 01305 shrink_slab.constprop.0+0x284/0x770 01305 shrink_node+0x390/0x828 01305 balance_pgdat+0x390/0x6d0 01305 kswapd+0x2e4/0x718 01305 kthread+0x184/0x1a8 01305 ret_from_fork+0x10/0x20 01305 01305 -> #0 (fs_reclaim){+.+.}-{0:0}: 01305 __lock_acquire+0x19d0/0x2930 01305 lock_acquire+0x1dc/0x458 01305 fs_reclaim_acquire+0x9c/0xe0 01305 __kmem_cache_alloc_node+0x48/0x278 01305 __kmalloc_node_track_caller+0x5c/0x278 01305 krealloc+0x94/0x180 01305 bch2_printbuf_make_room.part.0+0xac/0x118 01305 bch2_prt_printf+0x150/0x1e8 01305 bch2_btree_bkey_cached_common_to_text+0x170/0x298 01305 bch2_btree_trans_to_text+0x244/0x348 01305 print_cycle+0x7c/0xb0 01305 break_cycle+0x254/0x528 01305 bch2_check_for_deadlock+0x59c/0xa58 01305 bch2_btree_deadlock_read+0x174/0x200 01305 full_proxy_read+0x94/0xf0 01305 vfs_read+0x15c/0x3a8 01305 ksys_read+0xb8/0x148 01305 __arm64_sys_read+0x48/0x60 01305 invoke_syscall.constprop.0+0x64/0x138 01305 do_el0_svc+0x84/0x138 01305 el0_svc+0x34/0x80 01305 el0t_64_sync_handler+0xb0/0xb8 01305 el0t_64_sync+0x14c/0x150 01305 01305 other info that might help us debug this: 01305 01305 Chain exists of: 01305 fs_reclaim --> &c->lock#2 --> &lock->wait_lock 01305 01305 Possible unsafe locking scenario: 01305 01305 CPU0 CPU1 01305 ---- ---- 01305 lock(&lock->wait_lock); 01305 lock(&c->lock#2); 01305 lock(&lock->wait_lock); 01305 lock(fs_reclaim); 01305 01305 *** DEADLOCK *** Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Fix bch2_check_discard_freespace_key()Kent Overstreet
We weren't correctly checking the freespace btree - it's an extents btree, which means we need to iterate over each bucket in a freespace extent. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: bch2_trans_unlock_noassert()Kent Overstreet
This fixes a spurious assert in the btree node read path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Fix bch2_btree_update_start()Kent Overstreet
The calculation for number of nodes to allocate in bch2_btree_update_start() was incorrect - this fixes a BUG_ON() on the small nodes test. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: bch2_extent_ptr_desired_durability()Kent Overstreet
This adds a new helper for getting a pointer's durability irrespective of the device state, and uses it in the the data update path. This fixes a bug where we do a data update but request 0 replicas to be allocated, because the replica being rewritten is on a device marked as failed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: snapshot_to_text() includes snapshot treeKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Fix try_decrease_writepoints()Kent Overstreet
- We may need to drop btree locks before taking the writepoint_lock, as is done in other places. - We should be using open_bucket_free_unused(), so that we don't waste space. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Delete weird hacky transaction restart injectionKent Overstreet
since we currently don't have a good fault injection library, bch2_btree_insert_node() was randomly injecting faults based on local_clock(). At the very least this should have been a debug mode only thing, but this is a brittle method so let's just delete it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Write buffer flush needs BTREE_INSERT_NOCHECK_RWKent Overstreet
btree write buffer flush is only invoked from contexts that already hold a write ref, and checking if we're still RW could cause us to fail to completely flush the write buffer when shutting down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: New assertions when marking filesystem cleanKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: ec: Fix a lost wakeupKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: fix NULL pointer dereference in try_alloc_bucketMikulas Patocka
On Mon, 29 May 2023, Mikulas Patocka wrote: > The oops happens in set_btree_iter_dontneed and it is caused by the fact > that iter->path is NULL. The code in try_alloc_bucket is buggy because it > sets "struct btree_iter iter = { NULL };" and then jumps to the "err" > label that tries to dereference values in "iter". Here I'm sending a patch for it. From: Mikulas Patocka <mpatocka@redhat.com> The function try_alloc_bucket sets the variable "iter" to NULL and then (on various error conditions) jumps to the label "err". On the "err" label, it calls "set_btree_iter_dontneed" that tries to dereference "iter->trans" and "iter->path". So, we get an oops on error condition. This patch fixes the crash by testing that iter.trans and iter.path is non-zero before calling set_btree_iter_dontneed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>