Age | Commit message (Collapse) | Author |
|
This introduces two new macros for iterating through the btree, with
transaction restart handling
- for_each_btree_key2()
- for_each_btree_key_commit()
Every iteration is now in an implicit transaction, and - as with
lockrestart_do() and commit_do() - returning -EINTR will cause the
transaction to be restarted, at the same key.
This patch converts a bunch of code that was open coding this to these
new macros, saving a substantial amount of code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
When we find an extent past an inode's i_size, we need to do the
deletion in the inode's snapshot (which will emit a whiteout if
necessary); and we also need to note that we now have an a key at that
position and snapshot, so that we don't go into an infinite loop.
Also, switch to walking inodes in reverse older, oldest snapshot to
newest, so that we emit the fewest whiteouts possible.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Fsck now checks for keys in different snapshot IDs that are now
redundant due to other snapshots being deleted - it needs to for its own
algorithms to not get confused.
When it detects this it should re-run the post snapshot deletion cleanup
- this patch does that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
|
|
- Bunch of refactoring, and move some code out of
bch2_snapshots_start() and into bch2_snapshots_check(), for constency
with the rest of fsck
- Interior snapshot nodes no longer point to a subvolume; this is so we
don't end up with dangling subvol references when deleting or require
scanning the full snapshots btree.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This makes the snapshots_seen data structure fsck private and improves
it; we now also track the equivalence class for each snapshot id we've
seen, which means we can detect when snapshot deletion hasn't finished
or run correctly (which will otherwise confuse fsck).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
fsck doesn't want to run while we're cleaning up deleted snapshots - if
that work needs to be done, we want it to have finished before fsck
runs, otherwise fsck will get confused when it finds multiple keys in
the same snapshot ID equivalence class (i.e. the mechanism that
snapshot deletion uses for cleaning up redundant keys).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
We should never see an inode marked as unlinked that's a subvolume root
(or a directory) in fsck, but even if we do it's not correct for fsck to
delete the subvolume: subvolumes are owned by dirents, and if we find a
dangling subvolume (not marked as unlinked) we want fsck to reattach it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
snapshots_seen is becoming private to fsck, and snapshot_id_list is
actually what the data update path needs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Snapshots being deleted won't in general have a corresponding subvolume:
this fixes a spurious fsck error where we'd complain about a snapshot
pointing to a missing subvolume - but the subvolume had been deleted,
and the snapshot was pending deletion as well.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Better/more descriptive naming, and prep for adding
nested_lockrestart_do() and nested_commit_do().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
These messages log the updates we're doing in bch2_check_fix_ptrs(),
which is useful when debugging but not usually needed.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
There's no need to print fsck errors for errors that are expected, and
the user has already opted to repair.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
A lock ordering transaction restart can rarely happen in
bch2_btree_path_traverse_all() due to btree_key_cache_fill() creating
new paths at a lower lock order than the current path being traversed.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Dominique Martinet <asmadeus@codewreck.org>
|
|
This is to aid in adding mempools, in the next patch.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Dominique Martinet <asmadeus@codewreck.org>
|
|
An upcoming patch is going to require passing the client through
p9_req_put() -> p9_req_free(), but that's awkward with the kref
indirection - so this patch switches to using refcount_t directly.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Dominique Martinet <asmadeus@codewreck.org>
|
|
This isn't done very often, but it is legitimate
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
bch2_check_alloc_key() was failing to check buckets that didn't have
alloc keys yet (because they'd never been used) - they still need to be
added to the freespace btree.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
- In check_alloc_key(), previously we were re-initializing iterators
for the need_discard and freespace btrees for every alloc key we
checked. But this was causing us to redo lookups into the journal
keys every time, since those lookups are cached in struct btree_iter.
This initializes the iterators in bch2_check_alloc_info and passes
them into check_alloc_key().
- Make the looping more consistent/efficient in bch2_check_alloc_info()
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This runs before we go rw for journal replay, but after we're allowed to
go rw. It might be time to consider killing BTREE_INSERT_LAZY_RW,
though.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
We need to track these down, so let's make them noisier.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
We'd like to make these errors fatal and more noisy, but first we need
to silence the ones that aren't actually errors.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
- invalidate_one_bucket() now returns 1 when we don't have any buckets
on this device to invalidate, ensuring we don't spin
- the tracepoint invocation is moved to after the transaction commit,
and we now include the number of cached sectors in the tracepoint
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This switches that assertion to a bch2_trans_inconsistent() call, as it
should be.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
We can rebuild alloc info if these btree roots are missing - no need to
bail out and say the filesystem is unrecoverable
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
If a btree node is unreadable, it's the topology repair that fixes that
and it's kicked off by btree_gc, so btree_gc needs to touch every node
and very that they can be read.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
__dev_available() now calculates available buckets correctly. Previously
it would almost always return 0 when we have cached data.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
|
|
If we were at the end of the node, when breaking out of the loop we'd
pop the assertion on line 446 when cur wasn't NULL.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
-o verbose is very useful, and we're starting to use it more for runtime
debug statements - making it possible to enable at runtime is a no
brainer.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
If we fail to queue the work item because it's already in process, we
need to drop the ref we just took.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
There's no point reading an extent in order to move it if the write is
going to fail because we're shutting down. This patch changes the move
path so that moving_io now owns a ref on c->writes - as a bonus,
rebalance and copygc will now notice that we're shutting down and exit
quicker.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This tweaks the copygc path to keep the same moving_ctxt across multiple
evacuate bucket calls, meaning we can pipeline across buckets - should
be a nice performance boost.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
- add bch2_moving_ctxt_(init|exit)
- split out __bch2_evacutae_bucket() which takes an existing
moving_ctxt, this will be used for improving copygc performance by
pipelining across multiple buckets
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Originally, the btree key cache code would always allocate new entries
by reusing from the recently-freed list, if that list wasn't empty. But
that behaviour was dropped, for lock contention reasons.
But it seems that entries stranded on the freed list have been
contributing to some of our oom issues, because long running btree
transactions will prevent them from being freed.
This patch re-adds allocating from the freed list, but it also adds
percpu buffers to solve the lock contention issues - and the new percpu
freed lists will improve the evict paths, too.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
|
|
This adds a new option, move_bytes_in_flight, for configuring the amount
of IO in flight by copygc/rebalance - users with many devices in their
filesystem will want to increase this.
In the future we should be smarter about this, but this is an easy
improvement.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
We have a hardcoded maximum on number of pointers in an extent that's
used by some other data structures - notably bch_devs_list - but we
weren't actually checking for it. Oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
If we're trying to get a ref and the refcount has been killed, it means
we're doing an emergency shutdown - we always want tryget_live().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
move_ratelimit() now has a bool that specifies whether we want to
wait for copygc to finish.
When copygc is running, we're probably low on free buckets instead
of consuming the remaining buckets, we want to wait for copygc to
finish.
This should help with performance, and run away bucket fragmentation.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
|
|
This patch significantly cleans up and simplifies the data_update
interface. Instead of only being able to specify a single pointer by
device to rewrite, we're now able to specify any or all of the pointers
in the original extent to be rewrited, as a bitmask.
data_cmd is no more: the various pred functions now just return true if
the extent should be moved/updated. All the data_update path does is
rewrite existing replicas, or add new ones.
This fixes a bug where with background compression on replicated
filesystems, where rebalance -> data_update would incorrectly drop the
wrong old replica, and keep trying to recompress an extent pointer and
each time failing to drop the right replica. Oops.
Now, the data update path doesn't look at the io options to decide which
pointers to keep and which to drop - it only goes off of the
data_update_options passed to it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
We're seeing checksum errors in the bch2_rechecksum_bio() path - give it
a better error message to help track this down.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This moves btree_transactions from sysfs to debugfs, and makes it more
verbose: now we also include the backtrace of each task, since we
generally need this for debugging deadlocks.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
When inserting a key type that's not valid for a given btree, we should
print out which btree we were inserting into.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
We were only allowing 4 devices in a dev_list, not 16.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This improves the "copygc requested to run but no buckets found" to show
the device that requires copygc to be run on - we'll definitely need to
improve this more.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This is the start of reorganizing the data IO paths. The plan is to also
break apart io.c into data_read.c and data_write.c, and migrate_write
will be renamed to the data_update path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Previously, dev_buckets_available() only counted buckets that are
eligible to be allocated right now - i.e. buckets that don't have cached
data, or need discard, or need gc gens, etc.
But most users of this function want to know how many buckets are
eligible to be allocated from without moving data around - copygc,
allocator striping, which means we should be including cached data
buckets etc.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Previously, copygc needed to walk the entire extents & reflink btrees to
find extents that needed to be moved.
Now that we have backpointers, this patch implements
bch2_evacuate_bucket() in the move code, which copygc now uses for
evacuating mostly empty buckets.
Also, thanks to the new backpointers code, copygc can now move btree
nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|