summaryrefslogtreecommitdiff
path: root/fs/bcachefs/btree_cache.c
AgeCommit message (Collapse)Author
2023-06-21bcachefs: add counters for failed shrinker reclaimDaniel Hill
These counters should help us debug OOM issues. Signed-off-by: Daniel Hill <daniel@gluo.nz>
2023-06-21bcachefs: shrinker.to_text() methodsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: bch2_trans_unlock_noassert()Kent Overstreet
This fixes a spurious assert in the btree node read path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Ensure bch2_btree_node_get() calls relock() after unlock()Kent Overstreet
Fix a bug where bch2_btree_node_get() might call bch2_trans_unlock() (in fill) without calling bch2_trans_relock(); this is a bug when it's done in the core btree code. Also, twea bch2_btree_node_mem_alloc() to drop btree locks before doing a blocking memory allocation. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20six locks: Kill six_lock_state unionKent Overstreet
As suggested by Linus, this drops the six_lock_state union in favor of raw bitmasks. On the one hand, bitfields give more type-level structure to the code. However, a significant amount of the code was working with six_lock_state as a u64/atomic64_t, and the conversions from the bitfields to the u64 were deemed a bit too out-there. More significantly, because bitfield order is poorly defined (#ifdef __LITTLE_ENDIAN_BITFIELD can be used, but is gross), incrementing the sequence number would overflow into the rest of the bitfield if the compiler didn't put the sequence number at the high end of the word. The new code is a bit saner when we're on an architecture without real atomic64_t support - all accesses to lock->state now go through atomic64_*() operations. On architectures with real atomic64_t support, we additionally use atomic bit ops for setting/clearing individual bits. Text size: 7467 bytes -> 4649 bytes - compilers still suck at bitfields. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20six locks: Kill six_lock_pcpu_(alloc|free)Kent Overstreet
six_lock_pcpu_alloc() is an unsafe interface: it's not safe to allocate or free the percpu reader count on an existing lock that's in use, the only safe time to allocate percpu readers is when the lock is first being initialized. This patch adds a flags parameter to six_lock_init(), and instead of six_lock_pcpu_free() we now expose six_lock_exit(), which does the same thing but is less likely to be misused. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Clear btree_node_just_written() when node reused or evictedKent Overstreet
This fixes the following bug: Journal reclaim attempts to flush a node, but races with the node being evicted from the btree node cache; when we lock the node, the data buffers have already been freed. We don't evict a node that's dirty, so calling btree_node_write() is fine - it's a noop - except that the btree_node_just_written bit causes bch2_btree_post_write_cleanup() to run (resorting the node), which then causes a null ptr deref. 00078 Unable to handle kernel NULL pointer dereference at virtual address 000000000000009e 00078 Mem abort info: 00078 ESR = 0x0000000096000005 00078 EC = 0x25: DABT (current EL), IL = 32 bits 00078 SET = 0, FnV = 0 00078 EA = 0, S1PTW = 0 00078 FSC = 0x05: level 1 translation fault 00078 Data abort info: 00078 ISV = 0, ISS = 0x00000005 00078 CM = 0, WnR = 0 00078 user pgtable: 4k pages, 39-bit VAs, pgdp=000000007ed64000 00078 [000000000000009e] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 00078 Internal error: Oops: 0000000096000005 [#1] SMP 00078 Modules linked in: 00078 CPU: 75 PID: 1170 Comm: stress-ng-utime Not tainted 6.3.0-ktest-g5ef5b466e77e #2078 00078 Hardware name: linux,dummy-virt (DT) 00078 pstate: 80001005 (Nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00078 pc : btree_node_sort+0xc4/0x568 00078 lr : bch2_btree_post_write_cleanup+0x6c/0x1c0 00078 sp : ffffff803e30b350 00078 x29: ffffff803e30b350 x28: 0000000000000001 x27: ffffff80076e52a8 00078 x26: 0000000000000002 x25: 0000000000000000 x24: ffffffc00912e000 00078 x23: ffffff80076e52a8 x22: 0000000000000000 x21: ffffff80076e52bc 00078 x20: ffffff80076e5200 x19: 0000000000000000 x18: 0000000000000000 00078 x17: fffffffff8000000 x16: 0000000008000000 x15: 0000000008000000 00078 x14: 0000000000000002 x13: 0000000000000000 x12: 00000000000000a0 00078 x11: ffffff803e30b400 x10: ffffff803e30b408 x9 : 0000000000000001 00078 x8 : 0000000000000000 x7 : ffffff803e480000 x6 : 00000000000000a0 00078 x5 : 0000000000000088 x4 : 0000000000000000 x3 : 0000000000000010 00078 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff80076e52a8 00078 Call trace: 00078 btree_node_sort+0xc4/0x568 00078 bch2_btree_post_write_cleanup+0x6c/0x1c0 00078 bch2_btree_node_write+0x108/0x148 00078 __btree_node_flush+0x104/0x160 00078 bch2_btree_node_flush0+0x1c/0x30 00078 journal_flush_pins.constprop.0+0x184/0x2d0 00078 __bch2_journal_reclaim+0x4d4/0x508 00078 bch2_journal_reclaim+0x1c/0x30 00078 __bch2_journal_preres_get+0x244/0x268 00078 bch2_trans_journal_preres_get_cold+0xa4/0x180 00078 __bch2_trans_commit+0x61c/0x1bb0 00078 bch2_setattr_nonsize+0x254/0x318 00078 bch2_setattr+0x5c/0x78 00078 notify_change+0x2bc/0x408 00078 vfs_utimes+0x11c/0x218 00078 do_utimes+0x84/0x140 00078 __arm64_sys_utimensat+0x68/0xa8 00078 invoke_syscall.constprop.0+0x54/0xf0 00078 do_el0_svc+0x48/0xd8 00078 el0_svc+0x14/0x48 00078 el0t_64_sync_handler+0xb0/0xb8 00078 el0t_64_sync+0x14c/0x150 00078 Code: 8b050265 910020c6 8b060266 910060ac (79402cad) 00078 ---[ end trace 0000000000000000 ]--- Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Private error codes: ENOMEMKent Overstreet
This adds private error codes for most (but not all) of our ENOMEM uses, which makes it easier to track down assorted allocation failures. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: bch2_btree_node_to_text() const correctnessKent Overstreet
This is for the Rust interface - Rust cares more about const than C does. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Centralize btree node lock initializationKent Overstreet
This fixes some confusion in the lockdep code due to initializing btree node/key cache locks with the same lockdep key, but different names. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Plumb btree_trans through btree cache codeKent Overstreet
Soon, __bch2_btree_node_write() is going to require a btree_trans: zoned device support is going to require a new allocation for every btree node write. This is a bit of prep work. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Use six_lock_ip()Kent Overstreet
This uses the new _ip() interface to six locks and hooks it up to btree_path->ip_allocated, when available. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Convert EAGAIN errors to private error codesKent Overstreet
More error code cleanup, for better error messages and debugability. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: New bpos_cmp(), bkey_cmp() replacementsKent Overstreet
This patch introduces - bpos_eq() - bpos_lt() - bpos_le() - bpos_gt() - bpos_ge() and equivalent replacements for bkey_cmp(). Looking at the generated assembly these could probably be improved further, but we already see a significant code size improvement with this patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Don't set accessed bit on btree node fillKent Overstreet
Btree nodes shouldn't have their accessed bit set when entering the btree cache by being read in from disk - this fixes linear scans thrashing the cache. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Split out __bch2_btree_node_get()Kent Overstreet
Standard splitting out of the slow path from the fast path of a function. We may follow this up in another patch with inlining the fast path into btree_iter.c. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Fix a race with b->write_typeKent Overstreet
b->write_type needs to be set atomically with setting the btree_node_need_write flag, so move it into b->flags. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: More style fixesKent Overstreet
Fixes for various checkpatch errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Improved btree write statisticsKent Overstreet
This replaces sysfs btree_avg_write_size with btree_write_stats, which now breaks out statistics by the source of the btree write. Btree writes that are too small are a source of inefficiency, and excessive btree resort overhead - this will let us see what's causing them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Assorted checkpatch fixesKent Overstreet
checkpatch.pl gives lots of warnings that we don't want - suggested ignore list: ASSIGN_IN_IF UNSPECIFIED_INT - bcachefs coding style prefers single token type names NEW_TYPEDEFS - typedefs are occasionally good FUNCTION_ARGUMENTS - we prefer to look at functions in .c files (hopefully with docbook documentation), not .h file prototypes MULTISTATEMENT_MACRO_USE_DO_WHILE - we have _many_ x-macros and other macros where we can't do this Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: improve behaviour of btree_cache_scan()Daniel Hill
Appending new nodes to the end of the list means we're more likely to evict old entries when btree_cache_scan() is started. Signed-off-by: Daniel Hill <daniel@gluo.nz>
2023-06-20bcachefs: bch2_btree_cache_scan() improvementKent Overstreet
We're still seeing OOM issues caused by the btree node cache shrinker not sufficiently freeing memory: thus, this patch changes the shrinker to not exit if __GFP_FS was not supplied. Instead, tweak btree node memory allocation so that we never invoke memory reclaim while holding the btree node cache lock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Delete old deadlock avoidance codeKent Overstreet
This deletes our old lock ordering based deadlock avoidance code. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: New locking functionsKent Overstreet
In the future, with the new deadlock cycle detector, we won't be using bare six_lock_* anymore: lock wait entries will all be embedded in btree_trans, and we will need a btree_trans context whenever locking a btree node. This patch plumbs a btree_trans to the few places that need it, and adds two new locking functions - btree_node_lock_nopath, which may fail returning a transaction restart, and - btree_node_lock_nopath_nofail, to be used in places where we know we cannot deadlock (i.e. because we're holding no other locks). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: Add persistent counters for all tracepointsKent Overstreet
Also, do some reorganizing/renaming, convert atomic counters in bch_fs to persistent counters, and add a few missing counters. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Switch btree locking code to struct btree_bkey_cached_commonKent Overstreet
This is just some type safety cleanup. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: btree_locking.cKent Overstreet
Start to centralize some of the locking code in a new file; more locking code will be moving here in the future. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: Tracepoint improvementsKent Overstreet
Our types are exported to the tracepoint code, so it's not necessary to break things out individually when passing them to tracepoints - we can also call other functions from TP_fast_assign(). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: EINTR -> BCH_ERR_transaction_restartKent Overstreet
Now that we have error codes, with subtypes, we can switch to our own error code for transaction restarts - and even better, a distinct error code for each transaction restart reason: clearer code and better debugging. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: lock time stats prep work.Daniel Hill
We need the caller name and a place to store our results, btree_trans provides this. Signed-off-by: Daniel Hill <daniel@gluo.nz>
2023-06-20bcachefs: Printbuf reworkKent Overstreet
This converts bcachefs to the modern printbuf interface/implementation, synced with the version to be submitted upstream. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Improve btree_bad_header()Kent Overstreet
In the future printbufs will be mempool-ified, so we shouldn't be using more than one at a time if we don't have to. This also fixes an extra trailing newline. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: Don't normalize to pages in btree cache shrinkerKent Overstreet
This behavior dates from the early, early days of bcache, and upon further delving appears to not make any sense. The shrinker only works in terms of 'objects' of unknown size; normalizing to pages only had the effect of changing the batch size, which we could do directly - if we wanted; we probably don't. Normalizing to pages meant our batch size was very small, which seems to have been keeping us from doing as much shrinking as we should be under heavy memory pressure; this patch appears to alleviate some OOMs we've been seeing. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: Fix usage of six lock's percpu modeKent Overstreet
Six locks have a percpu mode, which we use for interior btree nodes, as well as btree key cache keys for the subvolumes btree. We've been switching locks back and forth between percpu and non percpu mode as needed, but it turns out this is racy - when we're reusing an existing node, other threads could be attempting to lock it while we're switching it between modes. This patch fixes this by never switching 'struct btree' between the two modes, and instead segragating them between two different freed lists. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Refactor bch2_btree_node_mem_alloc()Kent Overstreet
This is prep work for the next patch, which is going to fix our usage of the percpu mode of six locks by never switching struct btree between the two modes - which means we need separate freed lists. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: Make bch2_btree_cache_scan() try harderKent Overstreet
Previously, when bch2_btree_cache_scan() attempted to reclaim a node but failed (because trylock failed, because it was dirty, etc.), it would count that against the number of nodes it was scanning and attempting to free. This patch changes that behaviour, so that now we only count nodes that we then don't free if they have the accessed bit (which we also clear). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: Fix race leading to btree node write getting stuckKent Overstreet
Checking btree_node_may_write() isn't atomic with the other btree flags, dirty and need_write in particular. There was a rare race where we'd unblock a node from writing while __btree_node_flush() was setting need_write, and no thread would notice that the node was now both able to write and needed to be written. Fix this by adding btree node flags for will_make_reachable and write_blocked that can be checked in the cmpxchg loop in __bch2_btree_node_write. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: Improve btree_node_write_if_need()Kent Overstreet
btree_node_write_if_need() kicks off a btree node write only if need_write is set; this makes the locking easier to reason about by moving the check into the cmpxchg loop in __bch2_btree_node_write(). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: Use x-macros for btree node flagsKent Overstreet
This is for adding an array of strings for btree node flag names. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: Kill BCH_FS_HOLD_BTREE_WRITESKent Overstreet
This was just dead code. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: Heap allocate printbufsKent Overstreet
This patch changes printbufs dynamically allocate and reallocate a buffer as needed. Stack usage has become a bit of a problem, and a major cause of that has been static size string buffers on the stack. The most involved part of this refactoring is that printbufs must now be exited with printbuf_exit(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Fix failure to allocate btree node in cacheKent Overstreet
The error code when we fail to allocate a node in the btree node cache doesn't make it to bch2_btree_path_traverse_all(). Instead, we need to stash a flag in btree_trans so we know we have to take the cannibalize lock. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: Tracepoint improvementsKent Overstreet
This improves the transaction restart tracepoints - adding distinct tracepoints for all the locations and reasons a transaction might have been restarted, and ensures that there's a tracepoint for every transaction restart. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: Switch to __func__for recording where btree_trans was initializedKent Overstreet
Symbol decoding, via %ps, isn't supported in userspace - this will also be faster when we're using trans->fn in the fast path, as with the new BCH_JSET_ENTRY_log journal messages. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: Add a tracepoint for the btree cache shrinkerKent Overstreet
This is to help with diagnosing why the btree node can doesn't seem to be shrinking - we've had issues in the past with granularity/batch size, since btree nodes are so big. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: Optimize memory accesses in bch2_btree_node_get()Kent Overstreet
This puts a load behind some branches before where it's used, so that it can execute in parallel with other loads. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: Fix infinite loop in bch2_btree_cache_scan()Kent Overstreet
When attempting to free btree nodes, we might not be able to free all the nodes that were requested. But the code was looping until it had freed _all_ the nodes requested, when it should have only been attempting to free nr nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Improve btree_node_mem_ptr optimizationKent Overstreet
This patch checks b->hash_val before attempting to lock the node in the btree, which makes it more equivalent to the "lookup in hash table" path - and potentially avoids an unnecessary transaction restart if btree_node_mem_ptr(k) no longer points to the node we want. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-06-20bcachefs: btree_pathKent Overstreet
This splits btree_iter into two components: btree_iter is now the externally visible componont, and it points to a btree_path which is now reference counted. This means we no longer have to clone iterators up front if they might be mutated - btree_path can be shared by multiple iterators, and cloned if an iterator would mutate a shared btree_path. This will help us use iterators more efficiently, as well as slimming down the main long lived state in btree_trans, and significantly cleans up the logic for iterator lifetimes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-06-20bcachefs: Add an assertion for removing btree nodes from cacheKent Overstreet
Chasing a bug that has something to do with the btree node cache. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>