Age | Commit message (Collapse) | Author |
|
This patch significantly cleans up and simplifies the data_update
interface. Instead of only being able to specify a single pointer by
device to rewrite, we're now able to specify any or all of the pointers
in the original extent to be rewrited, as a bitmask.
data_cmd is no more: the various pred functions now just return true if
the extent should be moved/updated. All the data_update path does is
rewrite existing replicas, or add new ones.
This fixes a bug where with background compression on replicated
filesystems, where rebalance -> data_update would incorrectly drop the
wrong old replica, and keep trying to recompress an extent pointer and
each time failing to drop the right replica. Oops.
Now, the data update path doesn't look at the io options to decide which
pointers to keep and which to drop - it only goes off of the
data_update_options passed to it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
When inserting a key type that's not valid for a given btree, we should
print out which btree we were inserting into.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
We were only allowing 4 devices in a dev_list, not 16.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This improves the "copygc requested to run but no buckets found" to show
the device that requires copygc to be run on - we'll definitely need to
improve this more.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This is the start of reorganizing the data IO paths. The plan is to also
break apart io.c into data_read.c and data_write.c, and migrate_write
will be renamed to the data_update path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Previously, dev_buckets_available() only counted buckets that are
eligible to be allocated right now - i.e. buckets that don't have cached
data, or need discard, or need gc gens, etc.
But most users of this function want to know how many buckets are
eligible to be allocated from without moving data around - copygc,
allocator striping, which means we should be including cached data
buckets etc.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Previously, copygc needed to walk the entire extents & reflink btrees to
find extents that needed to be moved.
Now that we have backpointers, this patch implements
bch2_evacuate_bucket() in the move code, which copygc now uses for
evacuating mostly empty buckets.
Also, thanks to the new backpointers code, copygc can now move btree
nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This patch adds backpointers: we now have a reverse index from device
and offset on that device (specifically, offset within a bucket) back to
btree nodes and (non cached) data extents.
The first 40 backpointers within a bucket are stored in the alloc key;
after that backpointers spill over to the next backpointers btree. This
is to help avoid performance regressions from additional btree updates
on large streaming workloads.
This patch adds all the code for creating, checking and repairing
backpointers. The next patch in the series is going to use backpointers
for copygc - finally getting rid of the need to scan all extents to do
copygc.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
With backpointers, alloc keys have gotten bigger, so we're needing more
memory here.
We're probably going to need to go with something more sophisticated
than a bump allocator, but - let's see if we can avoid doing that just
yet.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Repair now checks if overlapping extents exist in the same snapshot
and calls update_trans_update_extent to do the repair work.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
|
|
Like the previous patch for bucket invalidates, add another counter for
a core allocator path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
b->written wasn't being reset to 0 in the btree node read retry path,
causing decrypting & validation of previously read bsets to not be
re-run - ouch.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Like bch2_do_discards(), we should check if this needs to be done when
going rw.
Also, add some sysfs code for debugging bucket invalidation.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Printbufs recently switched to using string_get_size() for printing
integers in human readable units. This updates __bch2_strtoh() to parse
numbers printed by string_get_size() - we now have to handle floating
point numbers, and new unit suffixes.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
bch2_dev_freespace_init() was using __bch2_trans_do() incorrectly, and
calling bch2_bucket_do_index() with a stale alloc key.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This implements a new printbuf version of d_path()/mangle_path(), which
will replace the seq_buf version.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
We were forgetting to clear the read_in_flight flag - oops. This also
fixes it to not call bch2_fatal_error() before topology repair has had a
chance to do its thing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
We had a bug where btree_and_journal_iter would return the same key
twice - after deleting it (perhaps because it was present in both the
btree and the journal?)
This reworks btree_and_journal_iter to track the current position, much
like btree_paths, which makes the logic considerably simpler and more
robust.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
|
|
cmd_list_journal wasn't correctly listing the most recent journal
entries as blacklisted - because in the recovery path when just reading
the journal, we were failing to add those to the blacklist table.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Lately we've been doing a lot of debugging by looking at the journal to
see what was changed, and by what code path. This patch adds a new
journal entry type for recording overwrites, so that we don't have to
search backwards through the journal to see what was being overwritten
in order to work out what the triggers were supposed to be doing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This takes copying the payload out of bch2_journal_add_entry(), which
means we can use it for journal_transaction_name() - also prep work for
journalling overwrites.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This is the last piece for btree key cache coherency:
We already have:
- btree iterator code checks the key cache when iterating over a cached
btree
- update path ensures that updates go to the key cache when updating a
cached btree
But for iterating over a cached btree to work, we need to ensure that if
a key exists in the key cache, it also exists in the btree - otherwise
the iterator code will skip past it and not check the key cache.
This patch implements that last piece: on a key cache update, if
creating a new key, we now also update the underlying btree.
This fixes a device removal bug, where deleting alloc info wasn't
correctly deleting all keys associated with a given device. It also
means we should be able to re-enable the key cache for inodes.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
bch2_opt_parse() was failing to generate error messages in error path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
When do_encrypt() was passed a vmalloc address and the buffer spanned
more than a single page, we were encrypting/decrypting completely
different pages than the ones intended.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
|
|
Factor out a new helper.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
One of the init calls had a ; instead of a ?:, and errors after that got
dropped - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Right now, we print an error message on btree node read error, and we
print that we're retrying, but we don't explicitly say if the retry
succeeded - this makes things a little clearer.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
We want this option always enabled - it costs practically nothing and
it's an essential debugging tool. It's enabled by default, but old
filesystems are still out there that need debugging, so let's just force
it on and kill the option.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Previously, on every btree_iter_peek() operation we were searching the
journal keys, doing a full binary search - which was slow.
This patch fixes that by saving our position in the journal keys, so
that we only do a full binary search when moving our position backwards
or a large jump forwards.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This logging improvement helps see when the previous fsck pass has
completed.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Daniel Hill <daniel@gluo.nz>
|
|
flush_dcache_page() is not a noop on arm, but we were using
virt_to_page() instead of vmalloc_to_page() for an address on the kernel
stack - vmalloc memory, leading to an oops in flush_dcache_page().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
The only difference key_type_logon and key_type_user is that
key_type_logon keys can't be read by userspace.
However, userspace has actually been adding keys to both the logon and
user keychains, because userspace fsck requires the keychain interface -
so we might as well just use user and drop the logon keychain.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
- Drop old unneeded parameter for whether we're in initial GC - which
was from when btree updates had to be done differently before we
went RW.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Per Dave Chinner and the xfs folks, .writepage is no longer needed, and
it's better not to define it if .writepages is the intended path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Rust FFI lacks support for unnamed structs and unions. The space
saved in bch_option is not enough to be significant.
Signed-off-by: Brett Holman <bholman.devel@gmail.com>
|
|
This is pretty expensive, and we've tested sufficiently with it now that
it doesn't need to be on by default.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
When merging extents, we have to check that we won't overflow size
fields in any CRC entries - but the check for this was wrong, because in
the loop it was in we weren't keeping a pointer to the (packed, encoded)
CRC field.
Fix this by moving it to its own loop.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Bkeys have gotten a lot bigger since this code was written and now are
often formatted across multiple lines - while the reason a bkey is
invalid will still be short and fit on a single line. This patch prints
the error bfore the bkey, making it a bit more readable.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
journal_iters_fix() was incorrectly rewinding iterators past keys they
had already returned, leading to those keys being double counted in the
bch2_gc() path - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
btree updates before going RW are expensive if they're in random order,
since they use the list of keys for journal replay to insert, which is
just a gap buffer.
This patch improves the bucket invalidate path so that if
bch2_check_lrus() hasn't finished it only prints warnings instead of
doing an emergency shutdown, which means we can now set BCH_FS_MAY_GO_RW
before bch2_check_lrus().
Also, the filesystem state bits are reorganized a bit.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This adds a new superblock field for persisting counters
and adds a sysfs interface in counters/ exposing these counters.
The superblock field is ignored by older versions letting us avoid
an on disk version bump.
Each sysfs file outputs a counter that tracks since filesystem
creation and a counter for the current mount session.
Signed-off-by: Daniel Hill <daniel@gluo.nz>
|
|
Delete some obsolete tracepoints, organize alloc tracepoints better,
make a few tracepoints more consistent.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
We shouldn't kick journal reclaim unnecessarily, it's got its own timer
for that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Can't take btree node locks while holding btree_reserve_cache_lock - it
would be nice if we could check this with lockdep.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|