bcachefs.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2022-10-03	bcachefs: Improve btree_deadlock debugfs output	Kent Overstreet
	This changes bch2_check_for_deadlock() to print the longest chains it finds - when we have a deadlock because the cycle detector isn't finding something, this will let us see what it's missing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2022-10-03	bcachefs: Print deadlock cycle in debugfs	Kent Overstreet
	In the event that we're not finished debugging the cycle detector, this adds a new file to debugfs that shows what the cycle detector finds, if anything. By comparing this with btree_transactions, which shows held locks for every btree_transaction, we'll be able to determine if it's the cycle detector that's buggy or something else. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2022-10-03	bcachefs: Deadlock cycle detector	Kent Overstreet
	We've outgrown our own deadlock avoidance strategy. The btree iterator API provides an interface where the user doesn't need to concern themselves with lock ordering - different btree iterators can be traversed in any order. Without special care, this will lead to deadlocks. Our previous strategy was to define a lock ordering internally, and whenever we attempt to take a lock and trylock() fails, we'd check if the current btree transaction is holding any locks that cause a lock ordering violation. If so, we'd issue a transaction restart, and then bch2_trans_begin() would re-traverse all previously used iterators, but in the correct order. That approach had some issues, though. - Sometimes we'd issue transaction restarts unnecessarily, when no deadlock would have actually occured. Lock ordering restarts have become our primary cause of transaction restarts, on some workloads totally 20% of actual transaction commits. - To avoid deadlock or livelock, we'd often have to take intent locks when we only wanted a read lock: with the lock ordering approach, it is actually illegal to hold _any_ read lock while blocking on an intent lock, and this has been causing us unnecessary lock contention. - It was getting fragile - the various lock ordering rules are not trivial, and we'd been seeing occasional livelock issues related to this machinery. So, since bcachefs is already a relational database masquerading as a filesystem, we're stealing the next traditional database technique and switching to a cycle detector for avoiding deadlocks. When we block taking a btree lock, after adding ourself to the waitlist but before sleeping, we do a DFS of btree transactions waiting on other btree transactions, starting with the current transaction and walking our held locks, and transactions blocking on our held locks. If we find a cycle, we emit a transaction restart. Occasionally (e.g. the btree split path) we can not allow the lock() operation to fail, so if necessary we'll tell another transaction that it has to fail. Result: trans_restart_would_deadlock events are reduced by a factor of 10 to 100, and we'll be able to delete a whole bunch of grotty, fragile code. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Track maximum transaction memory	Kent Overstreet
	This patch - tracks maximum bch2_trans_kmalloc() memory used in btree_transaction_stats - makes it available in debugfs - switches bch2_trans_init() to using that for the amount of memory to preallocate, instead of the parameter passed in This drastically reduces transaction restarts, and means we no longer need to track this in the source code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2022-10-03	bcachefs: Debugfs cleanup	Kent Overstreet
	This improves flush_buf() so that it always returns nonzero when we're done reading and ready to return to userspace, and so that it returns the value we want to return to userspace (number of bytes read, if there wasn't an error). In the future we'll be better abstracting this mechanism and pulling it out of bcachefs, and using it to replace seq_file. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	lib/printbuf: Tabstop improvements	Kent Overstreet
	- Add a flag, has_indent_or_tabstops, that is set if indent level or tabstops are set. - Tabstops can no longer be set by modifying the tabstop array directly: instead, the new functions are provided: printbuf_tabstop_push() - add a new tabstop, n spaces after previous tabstop printbuf_tabtstop_pop() - remove previous tabstop printbuf_tabstops_reset() - remove all tabstops Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Print last line in debugfs/btree_transaction_stats	Kent Overstreet
	We need to turn the flush_buf() thing into a proper API, to replace seq_file. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Track the maximum btree_paths ever allocated by each transaction	Kent Overstreet
	We need a way to check if the machinery for handling btree_paths with in a transaction is behaving reasonably, as it often has not been - we've had bugs with transaction path overflows caused by duplicate paths and plenty of other things. This patch tracks, per transaction fn, the most btree paths ever allocated by that transaction and makes it available in debugfs. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Rename lock_held_stats -> btree_transaction_stats	Kent Overstreet
	Going to be adding more things to this in the next patch. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Convert debugfs code to for_each_btree_key2()	Kent Overstreet
	This fixes a bug where we were leaking a transaction restart error to userspace. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: added lock held time stats	Daniel Hill
	We now record the length of time btree locks are held and expose this in debugfs. Enabled via CONFIG_BCACHEFS_LOCK_TIME_STATS. Signed-off-by: Daniel Hill <daniel@gluo.nz>
2022-10-03	bcachefs: Move 'btree_transactions' debug to debugs	Kent Overstreet
	This moves btree_transactions from sysfs to debugfs, and makes it more verbose: now we also include the backtrace of each task, since we generally need this for debugging deadlocks. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Convert to lib/printbuf.c	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Fix usage of six lock's percpu mode	Kent Overstreet
	Six locks have a percpu mode, which we use for interior btree nodes, as well as btree key cache keys for the subvolumes btree. We've been switching locks back and forth between percpu and non percpu mode as needed, but it turns out this is racy - when we're reusing an existing node, other threads could be attempting to lock it while we're switching it between modes. This patch fixes this by never switching 'struct btree' between the two modes, and instead segragating them between two different freed lists. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Start moving debug info from sysfs to debugfs	Kent Overstreet
	In sysfs, files can only output at most PAGE_SIZE. This is a problem for debug info that needs to list an arbitrary number of times, and because of this limit some of our debug info has been terser and harder to read than we'd like. This patch moves info about journal pins and cached btree nodes to debugfs, and greatly expands and improves the output we return. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Heap allocate printbufs	Kent Overstreet
	This patch changes printbufs dynamically allocate and reallocate a buffer as needed. Stack usage has become a bit of a problem, and a major cause of that has been static size string buffers on the stack. The most involved part of this refactoring is that printbufs must now be exited with printbuf_exit(). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Fix debugfs -bfloat-failed	Kent Overstreet
	It wasn't updated for snapshots - it's iterating across keys in all snapshots, so needs to be specifying BTREE_ITER_ALL_SNAPSHOTS. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Add missing bch2_trans_iter_exit() call	Kent Overstreet
	This fixes a bug where the filesystem goes read only when reading from debugfs. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: for_each_btree_node() now returns errors directly	Kent Overstreet
	This changes for_each_btree_node() to work like for_each_btree_key(), and to that end bch2_btree_iter_peek_node() and next_node() also return error ptrs. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Fix initialization of bch_write_op.nonce	Kent Overstreet
	If an extent ends up with a replica that is encrypted an a replica that isn't encrypted (due the user changing options), and then copygc/rebalance moves one of the replicas by reading from the unencrypted replica, we had a bug where we wouldn't correctly initialize op->nonce - for each crc field in an extent, crc.offset + crc.nonce must be equal. This patch fixes that by moving op.nonce initialization to bch2_migrate_write_init. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Really don't hold btree locks while btree IOs are in flight	Kent Overstreet
	This is something we've attempted to stick to for quite some time, as it helps guarantee filesystem latency - but there's a few remaining paths that this patch fixes. We also add asserts that we're not holding btree locks when waiting on btree reads or writes. This is also necessary for an upcoming patch to update btree pointers after every btree write - since the btree write completion path will now be doing btree operations. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Split out SPOS_MAX	Kent Overstreet
	Internal btree code really wants a POS_MAX with all fields ~0; external code more likely wants the snapshot field to be 0, because when we're passing it to bch2_trans_get_iter() it's used for the snapshot we're operating in, which should be 0 for most btrees that don't use snapshots. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Update bch2_btree_verify()	Kent Overstreet
	bch2_btree_verify() verifies that the btree node on disk matches what we have in memory. This patch changes it to verify every replica, and also fixes it for interior btree nodes - there's a mem_ptr field which is used as a scratch space and needs to be zeroed out for comparing with what's on disk. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Eliminate more PAGE_SIZE uses	Kent Overstreet
	In userspace, we don't really have a well defined PAGE_SIZE and shouln't be relying on it. This is some more incremental work to remove references to it. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Start using bpos.snapshot field	Kent Overstreet
	This patch starts treating the bpos.snapshot field like part of the key in the btree code: * bpos_successor() and bpos_predecessor() now include the snapshot field * Keys in btrees that will be using snapshots (extents, inodes, dirents and xattrs) now always have their snapshot field set to U32_MAX The btree iterator code gets a new flag, BTREE_ITER_ALL_SNAPSHOTS, that determines whether we're iterating over keys in all snapshots or not - internally, this controlls whether bkey_(successor\|predecessor) increment/decrement the snapshot field, or only the higher bits of the key. We add a new member to struct btree_iter, iter->snapshot: when BTREE_ITER_ALL_SNAPSHOTS is not set, iter->pos.snapshot should always equal iter->snapshot, which will be 0 for btrees that don't use snapshots, and alsways U32_MAX for btrees that will use snapshots (until we enable snapshot creation). This patch also introduces a new metadata version number, and compat code for reading from/writing to older versions - this isn't a forced upgrade (yet). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Split out bpos_cmp() and bkey_cmp()	Kent Overstreet
	With snapshots, we're going to need to differentiate between comparisons that should and shouldn't include the snapshot field. bpos_cmp is now the comparison function that does include the snapshot field, used by core btree code. Upper level filesystem code generally does _not_ want to compare against the snapshot field - that code wants keys to compare as equal even when one of them is in an ancestor snapshot. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Replace bch2_btree_iter_next() calls with bch2_btree_iter_advance	Kent Overstreet
	The way btree iterators work internally has been changing, particularly with the iter->real_pos changes, and bch2_btree_iter_next() is no longer hyper optimized - it's just advance followed by peek, so it's more efficient to just call advance where we're not using the return value of bch2_btree_iter_next(). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Require all btree iterators to be freed	Kent Overstreet
	We keep running into occasional bugs with btree transaction iterators overflowing - this will make those bugs more visible. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Include device in btree IO error messages	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Drop sysfs interface to debug parameters	Kent Overstreet
	It's not used much anymore, the module paramter interface is better. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Change bch2_dump_bset() to also print key values	Kent Overstreet
	Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: btree_bkey_cached_common	Kent Overstreet
	This is prep work for the btree key cache: btree iterators will point to either struct btree, or a new struct bkey_cached. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2022-10-03	bcachefs: Initial commit	Kent Overstreet
	Forked from drivers/md/bcache, now a full blown COW multi device filesystem with a long list of features - https://bcachefs.org Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>