Age | Commit message (Collapse) | Author |
|
MAP_END_KEY was created for cases (like cache_lookup_fn) where if we _don't_
find the key we're looking for, we still need to do something with that btree
node.
This has since come up in various places (inode/dirent creation in particular),
but it was only just now that I realized that in all these cases what we really
want to do is iterate over the _keyspace_, not just the keys that happen to be
present.
So, that's what MAP_HOLES does now - the code that implements it is gross, but
it _considerably_ simplifies all the users.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Previously, if a btree map fn invalidated an iterator, it would force
bch_btree_map_keys()/bch_btree_map_nodes() to redo the lookup by returning
-EINTR.
But we're going to rework bch_btree_map_keys() to make sure it passes every key
to the map fn precisely once, so we need to differentiate between cases where
the map fn does need to be called again (for which it will still return -EINTR)
from cases where it's done with that key, but the iterator was still
invalidated.
This also means the map_fn doesn't need to know about bch_btree_insert_node() or
soemthing else it calls invalidating the iterator.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Bucket allocation in bcache used to be asynchronous, ages and ages ago, but for
awhile I was making various asynchronous stuff synchronous.
With moving gc and now tiering becoming more important, and also the addition of
refcounted struct open_bucket to get rid of the bucket refcount, this has turned
out to be impractical - to avoid deadlocks we'd need a ridiculous number of
workqueues.
We can avoid the deadlocks by going back to making various things asynchronous -
the conversion is pretty straightforward.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
If moving GC runs out of buckets to write btree nodes to, it will block
waiting for the allocator to produce more buckets. But the allocator
might be waiting for btree GC to finish, which will not yield any more
buckets until moving GC is done, etc.
To avoid this scenario, we reserve an equal number of buckets for btree
node writes from moving GC as the size of the free_inc list. This
ensures that for every btree node that's re-written from moving GC, a
bucket is returned to free_inc. So if moving GC uses up its entire
reserve, it will free up free_inc, allowing the allocator to write out
prios and gens and re-distribute free_inc among the various free lists.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This allows us to have a per-cache device PD controller, which is closer
to the behavior that we want here.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
The previous algorithm always set read_prio to max, and then decremented all
bucket prios every so many bytes. This required looping over all the buckets
which could take awhile as cache sizes get larger. The new algorithm instead
uses a circular clock, a u16 thats incremented every so often, and sets
read_prio to this clock when read from. This doesn't require looping over all
the buckets to increment the clock every time.
Occasionally, because data may be in the cache for
an arbitrary amount of time, we may need to loop over all the buckets to
rescale the prios, and create more space for new prios.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
need to redo readahead
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Generational grouping is a technique that attempts to keep
hot and cold data separate. It essentially just groups
data based on age, to keep hot and cold data together. Previous
patches have sorted the data we are compacting into genertaions,
stored in GC_GEN. This patch does the actual writing to different
buckets.
To allow for generation sorting, moving_gc needed its own sector
allocation, as the standard allocator sticks data sticks new writes
into whatever bucket pops out of the lru. Moving gc_alloc_sectors
is new function created to handle this allocation. It will take
the GC_GEN, and allocate sectors in the appropriate bucket for that
get. Each cache keeps an array of gc_open_buckets to allocate
sectors from.
To mark the use of the gc_sector allocator, data_insert_op received
a new field moving_gc to denote such a write.
note: removed write_prio from bch_alloc_seectors because it was
only use to distinguish gc writes from others. This is not
needed because gc_writes use a different allocator.
note: this doesn't support multiple cache devices. It will only move
ptr[0] and ignore the rest.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This mostly invoves a new function to handle the submission of bios
to multiple ptrs in a key. It simply clones the original bio, and submits
it to all the devs a key points to.
The moving_gc write path needed changes to handle moving a single key
while preserving ptrs to other devices, and to handle the possibility of
moving multiple keys. To do this, I detached the allocation paths of
foreground and gc wrties. Perhaps later they can be realigned but for
now this is simplest.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
bucket->pin has been a _continual_ source of consternation and bugs; it prevents
buckets from being garbage collected (originally it was for preventing buckets
from being reused while we were reading from them too), but the ownership
semantics were always... hazy, at best. But it's finally gone!
Now, struct open_bucket is the primary mechanism for preventing something from
being garbage collected until a pointer to it ends up in the btree or wherever
garbage collection will eventually find it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
More prep work for killing bucket->pin...
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Prep work for finally getting rid of bucket->pin!
The way we're gonig to get rid of bucket pin is by having buckets owned by
something that the garbage collector can find until after we've inserted the new
keys that point to that bucket into the btree.
This adds code for allocating struct open_buckets, and freeing them when their
refcount goes to 0, and reworks bch_alloc_sectors() for the new way of
allocating buckets and to pass the pointer to the struct open_bucket back to
bch_data_insert_start().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
The next patch is going to make struct open_bucket something with a refcount
that gets allocated, and the refs will be dropped after we do the index update
(in bch_data_insert_keys()); we can't drop the ref (i.e. freeing it) from the
same workqueue as we do the allocation - that would block frees - so add a new
workqueue for bch_data_insert_keys().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
bch_keybuf_check_overlapping() is used as an optimization to if possible keep
copygc from moving data around that we're about to overwrite (note that it's not
an optimization when used on the writeback keybuf! that's critical for cache
coherency).
Anyways, it goes with writing data to the cache, so stick it there.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Having it implicitly set by whether or not the key is dirty or not is terrible,
and not even what we want for other users of bch_data_insert().
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Changed bch_data_insert_start to use a key for header info tracking,
so it can bkey_copy into the new key; add also add bch_data_insert_op_init() to
ensure required fields are initialized.
We're ending up with various places that need to write some data to the cache,
but already have the key that it should be inserted with (e.g. copygc, the
upcoming tiering code, potentially various other fun stuff) - so instead of
taking a key, breaking the fields out to set up data_insert_op, and having
data_insert() reassemble them... just use a damn bkey.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Added foreground and gc write tracking to the ewma cache
stats. Also outputs a percetage of foreground writes.
Signed-off-by: Nicholas Swenson <nks@daterainc.com>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Trying to pull more stuff out of bcache.h
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
struct cache_set is way too big, but we can at least _attempt_ to organize it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Now, if you ask for MAP_END_KEY you get passed NULL for the key at the end of
each btree node - this seemed uglier to me at the time than what was done before
this patch, but the old behaviour led to a bunch of bugs in the new inode/dirent
code that's still out of tree. Now, if MAP_END_KEY is misused it'll be caught
right away with a null pointer deref.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
The logic for not asking for too much journal space at a time isn't needed
anymore because of the journalling rework in the last patch - for leaf nodes,
bch_btree_insert_keys() only asks for space in the journal for one key at a
time.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Previously, journalling an index update and doing the index update were two
different operations. This was problematic even in the current code, and was
going to be a major issue for future work; basically, any index update that
required traversing and locking the btree before we know what's actually being
done (which currently includes replace operations) couldn't make use of the
journal.
Now, any index update that uses bch_btree_insert_node() gets journalled - i.e.
everything (at least to leaf nodes, for now).
This also means the order that index updates happen in is preserved in the
journal, which was a bug waiting to happen (if it wasn't a bug already).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Prep work for moving where keys are journalled to within the btree insertion
code: This adds bch_journal_write_get() and bch_journal_write_put(), which in
the next patch will be used from the btree code to journal keys immediately
after they've been added to a btree node, while that node is still locked.
This also does some general refactoring, and it also changes the journalling
code to not require a workqueue for anything important (in particular, the next
journal write if one needed to go out had to be kicked off from the system_wq).
This will help to avoid deadlocks when the journalling code is being used from
more interesting contexts.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Remove unneeded NULL test.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@ expression x; @@
-if (x != NULL)
\(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This reverts commit 77b5a08427e87514c33730afc18cd02c9475e2c3 - the patch was
never mailed out to any mailing lists or the maintainer.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
bio_free_pages is introduced in commit 1dfa0f68c040
("block: add a helper to free bio bounce buffer pages"),
we can reuse the func in other modules after it was
imported.
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.
No intended functional changes in this commit.
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Separate the op from the rq_flag_bits and have bcache
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
We currently set REQ_WRITE/WRITE for all non READ IOs
like discard, flush, writesame, etc. In the next patches where we
no longer set up the op as a bitmap, we will not be able to
detect a operation direction like writesame by testing if REQ_WRITE is
set.
This has bcache use the op_is_write helper which will do the right
thing.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
|
|
The bcache driver has always accepted arbitrarily large bios and split
them internally. Now that every driver must accept arbitrarily large
bios this code isn't nessecary anymore.
Cc: linux-bcache@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message]
Signed-off-by: Dongsu Park <dpark@posteo.net>
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This is horribly confusing, it breaks the flow of the code without
it being apparent in the caller.
Signed-off-by: Jens Axboe <axboe@fb.com>
Acked-by: Christoph Hellwig <hch@lst.de>
|
|
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.
This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it. c files
which need access to more backing-dev details now include
backing-dev.h directly. This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.
v2: fs/fat build failure fixed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Struct bio has a reference count that controls when it can be freed.
Most uses cases is allocating the bio, which then returns with a
single reference to it, doing IO, and then dropping that single
reference. We can remove this atomic_dec_and_test() in the completion
path, if nobody else is holding a reference to the bio.
If someone does call bio_get() on the bio, then we flag the bio as
now having valid count and that we must properly honor the reference
count when it's being put.
Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
accounting
Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Kent Overstreet <kmo@datera.io>
Signed-off-by: Jens Axboe <axboe@fb.com>
|