summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-02-15xfs: add nodataio mount option to skip all data I/Oxfs_no_data_ioBrian Foster
When mounted with nodataio, add the NOSUBMIT iomap flag to all data mappings passed into the iomap layer. This causes iomap to skip all data I/O submission and thus facilitates metadata only performance testing. For experimental use only. Only tested insofar as fsstress runs for a few minutes without blowing up. Signed-off-by: Brian Foster <bfoster@redhat.com>
2024-02-15iomap: add nosubmit flag to skip data I/O on iomap mappingBrian Foster
Implement a quick and dirty hack to skip data I/O submission on a specified mapping. The iomap layer will still perform every step up through constructing the bio as if it will be submitted, but instead invokes completion on the bio directly from submit context. The purpose of this is to facilitate filesystem metadata performance testing without the overhead of actual data I/O. Note that this may be dangerous in current form in that folios are not explicitly zeroed where they otherwise wouldn't be, so whatever previous data exists in a folio prior to being added to a read bio is mapped into pagecache for the file. Signed-off-by: Brian Foster <bfoster@redhat.com>
2024-02-15Increase MAX_LOCK_DEPTH, bcachefs BTREE_ITER_MAX (do not upstream)Kent Overstreet
2024-02-15bcachefs: Check for btree locks held on transaction initKent Overstreet
Ideally we would disallow multiple btree_trans being initialized within the same process - and hopefully we will at some point, the stack usage is excessive - but for now there are a couple places where we do this: - transaction commit error path -> journal reclaim - btree key cache flush - move data path -> do_pending_writes -> write path -> bucket allocation (in the near future when bucket allocation switches to using a freespace btree) In order to avoid deadlocking the first btree_trans must have been unlocked with bch2_trans_unlock() before using the second btree_trans - this patch adds an assertion to bch2_trans_init() that verifies that this has been done when lockdep is enabled. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2024-02-15bcachefs: Use lock_class_is_held() for btree locking assertsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: Assert that btree node locks aren't being leakedKent Overstreet
This asserts (when lockdep is enabled) that btree locks aren't held when exiting a btree_trans. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2024-02-15locking/lockdep: lockdep_set_no_check_recursion()Kent Overstreet
This adds a method to tell lockdep not to check lock ordering within a lock class - but to still check lock ordering w.r.t. other lock types. This is for bcachefs, where for btree node locks we have our own deadlock avoidance strategy w.r.t. other btree node locks (cycle detection), but we still want lockdep to check lock ordering w.r.t. other lock types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com>
2024-02-15locking/lockdep: lock_class_is_held()Kent Overstreet
This patch adds lock_class_is_held(), which can be used to assert that a particular type of lock is not held. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com>
2024-02-15hlist-bl: introduced nested locking for dm-snapDave Chinner
Testing with lockdep enabled threw this warning from generic/081 in fstests: [ 2369.724151] ============================================ [ 2369.725805] WARNING: possible recursive locking detected [ 2369.727125] 6.7.0-rc2-dgc+ #1952 Not tainted [ 2369.728647] -------------------------------------------- [ 2369.730197] systemd-udevd/389493 is trying to acquire lock: [ 2369.732378] ffff888116a1a320 (&(et->table + i)->lock){+.+.}-{2:2}, at: snapshot_map+0x13e/0x7f0 [ 2369.736197] but task is already holding lock: [ 2369.738657] ffff8881098a4fd0 (&(et->table + i)->lock){+.+.}-{2:2}, at: snapshot_map+0x136/0x7f0 [ 2369.742118] other info that might help us debug this: [ 2369.744403] Possible unsafe locking scenario: [ 2369.746814] CPU0 [ 2369.747675] ---- [ 2369.748496] lock(&(et->table + i)->lock); [ 2369.749877] lock(&(et->table + i)->lock); [ 2369.751241] *** DEADLOCK *** [ 2369.753173] May be due to missing lock nesting notation [ 2369.754963] 4 locks held by systemd-udevd/389493: [ 2369.756124] #0: ffff88811b3a1f48 (mapping.invalidate_lock#2){++++}-{3:3}, at: page_cache_ra_unbounded+0x69/0x190 [ 2369.758516] #1: ffff888121ceff10 (&md->io_barrier){.+.+}-{0:0}, at: dm_get_live_table+0x52/0xd0 [ 2369.760888] #2: ffff888110240078 (&s->lock#2){++++}-{3:3}, at: snapshot_map+0x12e/0x7f0 [ 2369.763254] #3: ffff8881098a4fd0 (&(et->table + i)->lock){+.+.}-{2:2}, at: snapshot_map+0x136/0x7f0 [ 2369.765896] stack backtrace: [ 2369.767429] CPU: 3 PID: 389493 Comm: systemd-udevd Not tainted 6.7.0-rc2-dgc+ #1952 [ 2369.770203] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 2369.773771] Call Trace: [ 2369.774657] <TASK> [ 2369.775494] dump_stack_lvl+0x5c/0xc0 [ 2369.776765] dump_stack+0x10/0x20 [ 2369.778031] print_deadlock_bug+0x220/0x2f0 [ 2369.779568] __lock_acquire+0x1255/0x2180 [ 2369.781013] lock_acquire+0xb9/0x2c0 [ 2369.782456] ? snapshot_map+0x13e/0x7f0 [ 2369.783927] ? snapshot_map+0x136/0x7f0 [ 2369.785240] _raw_spin_lock+0x34/0x70 [ 2369.786413] ? snapshot_map+0x13e/0x7f0 [ 2369.787482] snapshot_map+0x13e/0x7f0 [ 2369.788462] ? lockdep_init_map_type+0x75/0x250 [ 2369.789650] __map_bio+0x1d7/0x200 [ 2369.790364] dm_submit_bio+0x17d/0x570 [ 2369.791387] __submit_bio+0x4a/0x80 [ 2369.792215] submit_bio_noacct_nocheck+0x108/0x350 [ 2369.793357] submit_bio_noacct+0x115/0x450 [ 2369.794334] submit_bio+0x43/0x60 [ 2369.795112] mpage_readahead+0xf1/0x130 [ 2369.796037] ? blkdev_write_begin+0x30/0x30 [ 2369.797007] blkdev_readahead+0x15/0x20 [ 2369.797893] read_pages+0x5c/0x230 [ 2369.798703] page_cache_ra_unbounded+0x143/0x190 [ 2369.799810] force_page_cache_ra+0x9a/0xc0 [ 2369.800754] page_cache_sync_ra+0x2e/0x50 [ 2369.801704] filemap_get_pages+0x112/0x630 [ 2369.802696] ? __lock_acquire+0x413/0x2180 [ 2369.803663] filemap_read+0xfc/0x3a0 [ 2369.804527] ? __might_sleep+0x42/0x70 [ 2369.805443] blkdev_read_iter+0x6d/0x150 [ 2369.806370] vfs_read+0x1a6/0x2d0 [ 2369.807148] ksys_read+0x71/0xf0 [ 2369.807936] __x64_sys_read+0x19/0x20 [ 2369.808810] do_syscall_64+0x3c/0xe0 [ 2369.809746] entry_SYSCALL_64_after_hwframe+0x63/0x6b [ 2369.810914] RIP: 0033:0x7f9f14dbb03d Turns out that dm-snap holds two hash-bl locks at the same time, so we need nesting semantics to ensure lockdep understands what is going on. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2024-02-15list_bl: don't use bit locks for PREEMPT_RT or lockdepDave Chinner
hash-bl nests spinlocks inside the bit locks. This causes problems for CONFIG_PREEMPT_RT which converts spin locks to sleeping locks, and we're not allowed to sleep while holding a spinning lock. Further, lockdep does not support bit locks, so we lose lockdep coverage of the inode hash table with the hash-bl conversion. To enable these configs to work, add an external per-chain spinlock to the hlist_bl_head() and add helpers to use this instead of the bit spinlock when preempt_rt or lockdep are enabled. This converts all users of hlist-bl to use the external spinlock in these situations, so we also gain lockdep coverage of things like the dentry cache hash table with this change. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2024-02-15vfs: inode cache conversion to hash-blDave Chinner
Scalability of the global inode_hash_lock really sucks for filesystems that use the vfs inode cache (i.e. everything but XFS). Profiles of a 32-way concurrent sharded directory walk (no contended directories) on a couple of different filesystems. All numbers from a 6.7-rc4 kernel. Bcachefs: - 98.78% vfs_statx - 97.74% filename_lookup - 97.70% path_lookupat - 97.54% walk_component - 97.06% lookup_slow - 97.03% __lookup_slow - 96.21% bch2_lookup - 91.87% bch2_vfs_inode_get - 84.10% iget5_locked - 44.09% ilookup5 - 43.50% _raw_spin_lock - 43.49% do_raw_spin_lock 42.75% __pv_queued_spin_lock_slowpath - 39.06% inode_insert5 - 38.46% _raw_spin_lock - 38.46% do_raw_spin_lock 37.51% __pv_queued_spin_lock_slowpath ext4: - 93.75% vfs_statx - 92.39% filename_lookup - 92.34% path_lookupat - 92.09% walk_component - 91.48% lookup_slow - 91.43% __lookup_slow - 90.18% ext4_lookup - 84.84% __ext4_iget - 83.67% iget_locked - 81.24% _raw_spin_lock - 81.23% do_raw_spin_lock - 78.90% __pv_queued_spin_lock_slowpath Both bcachefs and ext4 demonstrate poor scaling at >=8 threads on concurrent lookup or create workloads. Hence convert the inode hash table to a RCU-aware hash-bl table just like the dentry cache. Note that we need to store a pointer to the hlist_bl_head the inode has been added to in the inode so that when it comes to unhash the inode we know what list to lock. We need to do this because, unlike the dentry cache, the hash value that is used to hash the inode is not generated from the inode itself. i.e. filesystems can provide this themselves so we have to either store the hashval or the hlist head pointer in the inode to be able to find the right list head for removal... Concurrent walk of 400k files per thread with varying thread count in seconds is as follows. Perfect scaling is an unchanged walk time as thread count increases. ext4 bcachefs threads vanilla patched vanilla patched 2 7.923 7.358 8.003 7.276 4 8.152 7.530 9.097 8.506 8 13.090 7.871 11.752 10.015 16 24.602 9.540 24.614 13.989 32 49.536 19.314 49.179 25.982 The big wins here are at >= 8 threads, with both filesytsems now being limited by internal filesystem algorithms, not the VFS inode cache scalability. Ext4 contention moves to the buffer cache on directory block lookups: - 66.45% 0.44% [kernel] [k] __ext4_read_dirblock - 66.01% __ext4_read_dirblock - 66.01% ext4_bread - ext4_getblk - 64.77% bdev_getblk - 64.69% __find_get_block - 63.01% _raw_spin_lock - 62.96% do_raw_spin_lock 59.21% __pv_queued_spin_lock_slowpath bcachefs contention moves to internal btree traversal locks. - 95.37% __lookup_slow - 93.95% bch2_lookup - 82.57% bch2_vfs_inode_get - 65.44% bch2_inode_find_by_inum_trans - 65.41% bch2_inode_peek_nowarn - 64.60% bch2_btree_iter_peek_slot - 64.55% bch2_btree_path_traverse_one - bch2_btree_path_traverse_cached - 63.02% bch2_btree_path_traverse_cached_slowpath - 56.60% mutex_lock - 55.29% __mutex_lock_slowpath - 55.25% __mutex_lock 50.29% osq_lock 1.84% __raw_callee_save___kvm_vcpu_is_preempted 0.54% mutex_spin_on_owner Signed-off-by: Dave Chinner <dchinner@redhat.com>
2024-02-15hlist-bl: add hlist_bl_fake()Dave Chinner
in preparation for switching the VFS inode cache over the hlist_bl lists, we nee dto be able to fake a list node that looks like it is hased for correct operation of filesystems that don't directly use the VFS indoe cache. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2024-02-15vfs: factor out inode hash head calculationDave Chinner
In preparation for changing the inode hash table implementation. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2024-02-15lib/dlock-list: Make sibling CPUs share the same linked listWaiman Long
The dlock list needs one list for each of the CPUs available. However, for sibling CPUs, they are sharing the L2 and probably L1 caches too. As a result, there is not much to gain in term of avoiding cacheline contention while increasing the cacheline footprint of the L1/L2 caches as separate lists may need to be in the cache. This patch makes all the sibling CPUs share the same list, thus reducing the number of lists that need to be maintained in each dlock list without having any noticeable impact on performance. It also improves dlock list iteration performance as fewer lists need to be iterated. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2024-02-15vfs: Remove unnecessary list_for_each_entry_safe() variantsJan Kara
evict_inodes() and invalidate_inodes() use list_for_each_entry_safe() to iterate sb->s_inodes list. However, since we use i_lru list entry for our local temporary list of inodes to destroy, the inode is guaranteed to stay in sb->s_inodes list while we hold sb->s_inode_list_lock. So there is no real need for safe iteration variant and we can use list_for_each_entry() just fine. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Waiman Long <longman@redhat.com>
2024-02-15lib/dlock-list: Distributed and lock-protected listsWaiman Long
Linked list is used everywhere in the Linux kernel. However, if many threads are trying to add or delete entries into the same linked list, it can create a performance bottleneck. This patch introduces a new list APIs that provide a set of distributed lists (one per CPU), each of which is protected by its own spinlock. To the callers, however, the set of lists acts like a single consolidated list. This allows list entries insertion and deletion operations to happen in parallel instead of being serialized with a global list and lock. List entry insertion is strictly per cpu. List deletion, however, can happen in a cpu other than the one that did the insertion. So we still need lock to protect the list. Because of that, there may still be a small amount of contention when deletion is being done. A new header file include/linux/dlock-list.h will be added with the associated dlock_list_head and dlock_list_node structures. The following functions are provided to manage the per-cpu list: 1. int alloc_dlock_list_heads(struct dlock_list_heads *dlist) 2. void free_dlock_list_heads(struct dlock_list_heads *dlist) 3. void dlock_list_add(struct dlock_list_node *node, struct dlock_list_heads *dlist) 4. void dlock_list_del(struct dlock_list *node) Iteration of all the list entries within a dlock list array is done by calling either the dlist_for_each_entry() or dlist_for_each_entry_safe() macros. They correspond to the list_for_each_entry() and list_for_each_entry_safe() macros respectively. The iteration states are keep in a dlock_list_iter structure that is passed to the iteration macros. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2024-02-15bcachefs: kthread_should_stop() now checks if we're a kthreadKent Overstreet
There's no need for us to check current->flags & PF_KTHREAD anymore. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15kthread: kthread_should_stop() checks if we're a kthreadKent Overstreet
bcachefs has a fair amount of code that may or may not be running from a kthread (it may instead be called by a userspace ioctl); having kthread_should_stop() check if we're a kthread enables a fair bit of cleanup and makes it safer to use. Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: Zqiang <qiang1.zhang@intel.com> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: bch_degraded_optsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: bch_degraded_optsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: BCH_DATA_unstripedKent Overstreet
Add a new pseudo data type, to track buckets that are members of a stripe, but have unstriped data in them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: bch_alloc->stripe_sectorsKent Overstreet
Add a separate counter to bch_alloc_v4 for amount of striped data; this lets us separately track striped and unstriped data in a bucket, which lets us see when erasure coding has failed to update extents with stripe pointers, and also find buckets to continue updating if we crash mid way through creating a new stripe. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: Kill BTREE_ITER_WITH_UPDATES XXXKent Overstreet
occasional stale dirty pointers? Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: Enable key merging for indirect extentsKent Overstreet
The .key_merge method wasn't hooked up - noticed by sparse Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15lockdep: lockdep_set_notrack_class()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15lockdep debugKent Overstreet
2024-02-15bcachefs: Add lockdep support for btree node locksKent Overstreet
This adds lockdep tracking for held btree locks with a single dep_map in btree_trans, i.e. tracking all held btree locks as one object. This is more practical and more useful than having lockdep track held btree locks individually, because - we can take more locks than lockdep can track (unbounded, now that we have dynamically resizable btree paths) - there's no lock ordering between btree locks for lockdep to track (we do cycle detection) - and this makes it easy to teach lockdep that btree locks are not safe to hold while invoking memory reclaim. The last rule is one that lockdep would never learn, because we only do trylock() from within shrinkers - but we very much do not want to be invoking memory reclaim while holding btree node locks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: repair bad data after successful retry WIPKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: Fix btree node merging on write buffer btreesKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15rhashtable: Better error message on allocation failureKent Overstreet
Memory allocation failures print backtraces by default, but when we're running out of a rhashtable worker the backtrace is useless - it doesn't tell us which hashtable the allocation failure was for. This adds a dedicated warning that prints out functions from the rhashtable params, which will be a bit more useful. Cc: Thomas Graf <tgraf@suug.ch> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: better srcu warningKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: omit alignment attribute on big endian struct bkeyThomas Bertschinger
This is needed for building Rust bindings on big endian architectures like s390x. Currently this is only done in userspace, but it might happen in-kernel in the future. When creating a Rust binding for struct bkey, the "packed" attribute is needed to get a type with the correct member offsets in the big endian case. However, rustc does not allow types to have both a "packed" and "align" attribute. Thus, in order to get a Rust type compatible with the C type, we must omit the "aligned" attribute in C. This does not affect the struct's size or member offsets, only its toplevel alignment, which should be an acceptable impact. The little endian version can have the "align" attribute because the "packed" attr is redundant, and rust-bindgen will omit the "packed" attr when an "align" attr is present and it can do so without changing a type's layout Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: bch2_trigger_alloc() handles state changes betterKent Overstreet
bch2_trigger_alloc() kicks off certain tasks on bucket state changes; e.g. triggering the bucket discard worker and the invalidate worker. We've observed the discard worker running too often - most runs it doesn't do any work, according to the tracepoint - so clearly, we're kicking it off too often. This adds an explicit statechange() macro to make these checks more precise. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: skip invisible entries in empty subvolume checkingGuoyu Ou
When we are checking whether a subvolume is empty in the specified snapshot, entries that do not belong to this subvolume should be skipped. This fixes the following case: $ bcachefs subvolume create ./sub $ cd sub $ bcachefs subvolume create ./sub2 $ bcachefs subvolume snapshot . ./snap $ ls -a snap . .. $ rmdir snap rmdir: failed to remove 'snap': Directory not empty As Kent suggested, we pass 0 in may_delete_deleted_inode() to ignore subvols in the subvol we are checking, because inode.bi_subvol is only set on subvolume roots, and we can't go through every inode in the subvolume and change bi_subvol when taking a snapshot. It makes the check less strict, but that's ok, the rest of fsck will still catch it. Signed-off-by: Guoyu Ou <benogy@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: fix iov_iter count underflow on sub-block dio readBrian Foster
bch2_direct_IO_read() checks the request offset and size for sector alignment and then falls through to a couple calculations to shrink the size of the request based on the inode size. The problem is that these checks round up to the fs block size, which runs the risk of underflowing iter->count if the block size happens to be large enough. This is triggered by fstest generic/361 with a 4k block size, which subsequently leads to a crash. To avoid this crash, check that the shorten length doesn't exceed the overall length of the iter. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Su Yue <glass.su@suse.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15thread_with_file: Fix missing va_end()Kent Overstreet
Fixes: https://lore.kernel.org/linux-bcachefs/202402131603.E953E2CF@keescook/T/#u Reported-by: coverity scan Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15thread_with_file: allow ioctls against these filesDarrick J. Wong
Make it so that a thread_with_stdio user can handle ioctls against the file descriptor. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15thread_with_file: create ops structure for thread_with_stdioDarrick J. Wong
Create an ops structure so we can add more file-based functionality in the next few patches. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15thread_with_file: fix various printf problemsDarrick J. Wong
Experimentally fix some problems with stdio_redirect_vprintf by creating a MOO variant with which we can experiment. We can't do a GFP_KERNEL allocation while holding the spinlock, and I don't like how the printf function can silently truncate the output if memory allocation fails. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15thread_with_file: allow creation of readonly filesDarrick J. Wong
Create a new run_thread_with_stdout function that opens a file in O_RDONLY mode so that the kernel can write things to userspace but userspace cannot write to the kernel. This will be used to convey xfs health event information to userspace. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: bch2_print_opts()Kent Overstreet
Make sure early error messages get redirected, for kernel-fsck-from-userland. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: Improve error messages in device remove pathKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: Use kvzalloc() when dynamically allocating btree pathsKent Overstreet
THis silences a mm/page_alloc.c warning about allocating more than a page with GFP_NOFAIL - and there's no reason for this to not have a vmalloc fallback anyways. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: Track iter->ip_allocated at bch2_trans_copy_iter()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: Save key_cache_path in peek_slot()Kent Overstreet
When bch2_btree_iter_peek_slot() clones the iterator to search for the next key, and then discovers that the key from the cloned iterator is the key we want to return - we also want to save the iter->key_cache_path as well, for the update path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: Pin btree cache in ram for random access in fsckKent Overstreet
Various phases of fsck involve checking references from one btree to another: this means doing a sequential scan of one btree, and then mostly random access into the second. This is particularly painful for checking extents <-> backpointers; we can prefetch btree node access on the sequential scan, but not on the random access portion, and this is particularly painful on spinning rust, where we'd like to keep the pipeline fairly full of btree node reads so that the elevator can reduce seeking. This patch implements prefetching and pinning of the portion of the btree that we'll be doing random access to. We already calculate how much of the random access btree will fit in memory so it's a fairly straightforward change. This will put more pressure on system memory usage, so we introduce a new option, fsck_memory_usage_percent, which is the percentage of total system ram that fsck is allowed to pin. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-15bcachefs: Check for subvolume children when deleting subvolumesKent Overstreet
Recursively destroying subvolumes isn't allowed yet. Fixes: https://github.com/koverstreet/bcachefs/issues/634 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-13bcachefs: BTREE_ID_subvolume_childrenKent Overstreet
Add a btree to record a parent -> child subvolume relationships, according to the filesystem heirarchy. The subvolume_children btree is a bitset btree: if a bit is set at pos p, that means p.offset is a child of subvolume p.inode. This will be used for efficiently listing subvolumes, as well as recursive deletion. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-13bcachefs: bch_subvolume::fs_path_parentKent Overstreet
Record the filesystem path heirarchy for subvolumes in bch_subvolume Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-02-13bcachefs: bch2_btree_bit_mod()Kent Overstreet
Provide a non-write buffer version of bch2_btree_bit_mod_buffered(), for the subvolume children btree. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>