Age | Commit message (Collapse) | Author |
|
Add a new helper for allocating a new slot in a btree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
It's safe to call bch2_trans_update with a k/v pair where the value
hasn't been filled out, as long as the key part has been and the value
is filled out by transaction commit time.
This patch folds the bch2_trans_update() call into bch2_bkey_make_mut(),
eliminating a bit of boilerplate.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Brian Foster <bfoster@redhat.com>
|
|
When a device is removed from a bcachefs volume, the associated
content is removed from the various btrees. The alloc tree uses the
key cache, so when keys are removed the deletes exist in cache for a
period of time until reclaim comes along and flushes outstanding
updates.
When a device is re-added to the bcachefs volume, the add process
re-adds some of these previously deleted keys. When marking device
superblock locations on device add, the keys will likely refer to
some of the same alloc keys that were just removed. The memory
triggers for these key updates are responsible for further updates,
such as bch2_mark_alloc() calling into bch2_dev_usage_update() to
update per-device usage accounting.
When a new key is added to key cache, the trans update path also
flushes the key to the backing btree for coherency reasons for tree
walks.
With all of this context, if a device is removed and re-added
quickly enough such that some key deletes from the remove are still
pending a key cache flush, the trans update path can view this as
addition of a new key because the old key in the insert entry refers
to a deleted key. However the deleted cached key has not been filled
by absence of a btree key, but rather refers to an explicit deletion
of an existing key that occurred during device removal.
The trans update path adds a new update to flush the key and tags
the original (cached) update to skip running the memory triggers.
This results in running triggers on the non-cached update instead,
which in turn will perform accounting updates based on incoherent
values. For example, bch2_dev_usage_update() subtracts the the old
alloc key dirty sector count in the non-cached btree key from the
newly initialized (i.e. zeroed) per device counters, leading to
underflow and accounting corruption.
There are at least a few ways to avoid this problem, the simplest of
which may be to run triggers against the cached update rather than
the non-cached update. If the key only needs to be flushed when the
key is not present in the tree, however, then this still performs an
unnecessary update. We could potentially use the cached key dirty
state to determine whether the delete is a dirty, cached update vs.
a clean cache fill, but this may require transmitting key cache
dirty state across layers, which adds complexity and seems to be of
limited value. Instead, update flush_new_cached_update() to handle
this by simply checking for the key in the btree and only perform
the flush when a backing key is not present.
Signed-off-by: Brian Foster <bfoster@redhat.com>
|
|
We don't store backpointers in alloc keys anymore, since we gained the
btree write buffer.
This patch drops support for backpointers in alloc keys, and revs the on
disk format version so that we know a fsck is required.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If we block on journal reservation attempting to log journal
messages during recovery, particularly for the first message(s)
before we start doing actual work, chances are the filesystem ends
up deadlocked.
Allow logged messages to use reserved journal space to mitigate this
problem. In the worst case where no space is available whatsoever,
this at least allows the fs to recognize that the journal is stuck
and fail the mount gracefully.
Signed-off-by: Brian Foster <bfoster@redhat.com>
|
|
This could return a transaction restart; we need to check for that.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Brian Foster <bfoster@redhat.com>
|
|
transaction hooks aren't supposed to run unless we know the transaction
is going to commit succesfully: this fixes a bug with attempting to
delete a subvolume multiple times.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
fstest generic/388 occasionally reproduces corruptions where an
inode has extents beyond i_size. This is a deliberate crash and
recovery test, and the post crash+recovery characteristics are
usually the same: the inode exists on disk in an early (i.e. just
allocated) state based on the journal sequence number associated
with the inode. Subsequent inode updates exist in the journal at
higher sequence numbers, but the inode hadn't been written back
before the associated crash and the post-crash recovery processes a
set of journal sequence numbers that doesn't include updates to the
inode. In fact, the sequence with the most recent inode key update
always happens to be the sequence just before the front of the
journal processed by recovery.
This last bit is a significant hint that the problem relates to an
on-disk journal update of the front of the journal. The root cause
of this problem is basically that the inode is updated (multiple
times) in-core and in the key cache, each time bumping the key cache
sequence number used to control the cache flush. The cache flush
skips one or more times, bumping the associated key cache journal
pin to the key cache seq value. This has a side effect of holding
the inode in memory a bit longer than normal, which helps exacerbate
this problem, but is also unsafe in certain cases where the key
cache seq may have been updated by a transaction commit that didn't
journal the associated key.
For example, consider an inode that has been allocated, updated
several times in the key cache, journaled, but not yet written back.
At this stage, everything should be consistent if the fs happens to
crash because the latest update has been journal. Now consider a key
update via bch2_extent_update_i_size_sectors() that uses the
BTREE_UPDATE_NOJOURNAL flag. While this update may not change inode
state, it can have the side effect of bumping ck->seq in
bch2_btree_insert_key_cached(). In turn, if a subsequent key cache
flush skips due to seq not matching the former, the ck->journal pin
is updated to ck->seq even though the most recent key update was not
journaled. If this pin happens to reside at the front (tail) of the
journal, this means a subsequent journal write can update last_seq
to a value beyond that which includes the most recent update to the
inode. If this occurs and the fs happens to crash before the inode
happens to flush, recovery will see the latest last_seq, fail to
recover the inode and leave the inode in the inconsistent state
described above.
To avoid this problem, skip the key cache seq update on NOJOURNAL
commits, except on initial pin add. Pass the insert entry directly
to bch2_btree_insert_key_cached() to make the associated flag
available and be consistent with btree_insert_key_leaf().
Signed-off-by: Brian Foster <bfoster@redhat.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
btree & level are passed to trans_mark - for backpointers -
bch2_mark_key() should take them as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Cached btrees should be doing cached updates by default: this fixes a
bug in the migrate tool.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When fully overwriting an existing extent, we may need to generate a
whiteout - not just if the extent being overwritten was in an older
snapshot, but also if it was overwriting an extent in an older snapshot.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This adds a new helper to delete some redundant code in
bch2_trans_update_extent().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes a very-rare race in our assertion, with needs_whiteout being
modified in the btree key.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This patch is prep work for the following patch.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
|
|
This adds a new method of doing btree updates - a straight write buffer,
implemented as a flat fixed size array.
This is only useful when we don't need to read from the btree in order
to do the update, and when reading is infrequent - perfect for the LRU
btree.
This will make LRU btree updates fast enough that we'll be able to use
it for persistently indexing buckets by fragmentation, which will be a
massive boost to copygc performance.
Changes:
- A new btree_insert_type enum, for btree_insert_entries. Specifies
btree, btree key cache, or btree write buffer.
- bch2_trans_update_buffered(): updates via the btree write buffer
don't need a btree path, so we need a new update path.
- Transaction commit path changes:
The update to the btree write buffer both mutates global, and can
fail if there isn't currently room. Therefore we do all write buffer
updates in the transaction all at once, and also if it fails we have
to revert filesystem usage counter changes.
If there isn't room we flush the write buffer in the transaction
commit error path and retry.
- A new persistent option, for specifying the number of entries in the
write buffer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Recursive transaction commits are occasionally necessary - in
particular, for the upcoming btree write buffer's flush path.
This avoids bugs due to trans->flags being accidentally mutated
mid-commit, which can cause c->writes refcount leaks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This adds a debug mode where we split up the c->writes refcount into
distinct refcounts for every codepath that takes a reference, and adds
sysfs code to print the value of each ref.
This will make it easier to debug shutdown hangs due to refcount leaks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
It's important that in BTREE_ITER_FILTER_SNAPSHOTS mode we always use
peek_upto() and provide an end for the interval we're searching for -
otherwise, when we hit the end of the inode the next inode be in a
different subvolume and not have any keys in the current snapshot, and
we'd iterate over arbitrarily many keys before returning one.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This replaces various BUG_ON() assertions with panics that tell us where
the restart was done and the restart type.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This should have been resetting trans->fs_usage_deltas as well.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
More error code cleanup, for better error messages and debugability.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
More error code improvements - this gets us more useful error messages.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
In debug mode, we now track where btree iterators and paths are
initialized/allocated - helpful in tracking down btree path overflows.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This patch
- Adds a mechanism for queuing up journal entries prior to the journal
being started, which will be used for early journal log messages
- Adds bch2_fs_log_msg() and improves bch2_trans_log_msg(), which now
take format strings. bch2_fs_log_msg() can be used before or after
the journal has been started, and will use the appropriate mechanism.
- Deletes the now obsolete bch2_journal_log_msg()
- And adds more log messages to the recovery path - messages for
journal/filesystem started, journal entries being blacklisted, and
journal replay starting/finishing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This introduces some new conveniences, to help cut down on boilerplate:
- bch2_trans_kmalloc_nomemzero() - performance optimiation
- bch2_bkey_make_mut()
- bch2_bkey_get_mut()
- bch2_bkey_get_mut_typed()
- bch2_bkey_alloc()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Previously, we were journalling extra_journal_entries (which is used for
new btree roots, and irreversably mutates system state) before calling
bch2_trans_fs_usage_apply(), which can fail - whoops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When we started stashing the key being overwritten in
btree_insert_entry, this introduced a typical iterator invalidation
problem, triggered by btree node splits or resorts.
Previously, dealt with this by unconditionally re-validating those
stashed pointers in the transaction commit path. This patch gets rid of
that by doing it only when needed, in bch2_trans_node_add() or
bch2_trans_node_reinit_iter().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Replace with standard bcachefs-private error codes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When we start using the key cache for inodes again, it'll be possible
for bch2_btree_path_peek_slot() to return a key in a different snapshot
with a key cache path.
This isn't what we want when triggers are checking what they're
overwriting, so introduce a new helper for the commit path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This patch introduces
- bpos_eq()
- bpos_lt()
- bpos_le()
- bpos_gt()
- bpos_ge()
and equivalent replacements for bkey_cmp().
Looking at the generated assembly these could probably be improved
further, but we already see a significant code size improvement with
this patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This drops some unneeded references to JOURNAL_REPLAY_DONE in c->flags:
we're already mirroring it in btree_trans, we just weren't using it
consistently.
We may want to do this with more flags:
btree_iter.c: unsigned nr = test_bit(BCH_FS_STARTED, &c->flags)
btree_update_leaf.c: if (unlikely(!test_bit(BCH_FS_MAY_GO_RW, &c->flags))) {
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This just removes a redundant comparison - there's more work we could do
here to remove some redundant copying.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Convert some non-critical asserts in long-stable code to debug asserts.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
b->write_type needs to be set atomically with setting the
btree_node_need_write flag, so move it into b->flags.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
- factor out more slowpath code into non-inline function
- use bch2_print_string_as_lines(), so our error message doesn't get
truncated
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
- Don't call into bch2_encrypt_bio() when we're not encrypting
- Pull slowpath out of trans_lock_write()
- Make sure bc2h_trans_journal_res_get() gets inlined.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This replaces sysfs btree_avg_write_size with btree_write_stats, which
now breaks out statistics by the source of the btree write.
Btree writes that are too small are a source of inefficiency, and
excessive btree resort overhead - this will let us see what's causing
them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This shouldn't be needed anymore, since we don't rely on the pointer
validity that this was guarding against anymore - we get a new good
reference and save it right after this function.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This separates out the slowpath of bch2_trans_update_by_path_trace()
into a new non-inlined helper.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Now that we have an error path plumbed through, there's no need to be
using bch2_btree_node_lock_write_nofail().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bch2_btree_delete_range_trans() was using btree_trans_too_many_iters()
to avoid path overflow, but this was buggy here (and also
btree_trans_too_many_iters() is suspect in general).
btree_trans_too_many_iters() only returns true when we're close to the
maximum number of paths - within 8 - but extent insert/delete assumes
that it can use more paths than that.
Instead, we need to call bch2_trans_begin() on every loop iteration.
Since we don't want to call bch2_trans_begin() (restarting the outer
transaction) if the call was a no-op - if we had no work to do - we have
to structure things a bit oddly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This moves the slowpath of check_pos_snapshot_overwritten() to a
separate function, and inlines the fast path - helping performance on
btrees that don't use snapshot and for users that aren't using
snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Before we had the deadlock cycle detector, we didn't want to be holding
read locks when taking intent locks, because blocking on an intent lock
while holding a read lock was a lock ordering violation that could
cause a deadlock.
With the cycle detector this is no longer an issue, so this code can be
deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This deletes our old lock ordering based deadlock avoidance code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Continuing the saga of introducing private dedicated error codes for
each error path, this patch converts ENOSPC to error codes that are
subtypes of ENOSPC. We've recently had a test failure where we got
-ENOSPC where we shouldn't have, and didn't have enough information to
tell where it came from, so this patch will solve that problem.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|