summaryrefslogtreecommitdiff
path: root/drivers/md/bcache/request.c
AgeCommit message (Collapse)Author
2017-01-18bcache: MAP_END_KEY -> MAP_HOLESKent Overstreet
MAP_END_KEY was created for cases (like cache_lookup_fn) where if we _don't_ find the key we're looking for, we still need to do something with that btree node. This has since come up in various places (inode/dirent creation in particular), but it was only just now that I realized that in all these cases what we really want to do is iterate over the _keyspace_, not just the keys that happen to be present. So, that's what MAP_HOLES does now - the code that implements it is gross, but it _considerably_ simplifies all the users. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: better iterator invalidationKent Overstreet
Previously, if a btree map fn invalidated an iterator, it would force bch_btree_map_keys()/bch_btree_map_nodes() to redo the lookup by returning -EINTR. But we're going to rework bch_btree_map_keys() to make sure it passes every key to the map fn precisely once, so we need to differentiate between cases where the map fn does need to be called again (for which it will still return -EINTR) from cases where it's done with that key, but the iterator was still invalidated. This also means the map_fn doesn't need to know about bch_btree_insert_node() or soemthing else it calls invalidating the iterator. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Better sanity checking in bch_data_insert()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Make bucket allocation asynchronous againKent Overstreet
Bucket allocation in bcache used to be asynchronous, ages and ages ago, but for awhile I was making various asynchronous stuff synchronous. With moving gc and now tiering becoming more important, and also the addition of refcounted struct open_bucket to get rid of the bucket refcount, this has turned out to be impractical - to avoid deadlocks we'd need a ridiculous number of workqueues. We can avoid the deadlocks by going back to making various things asynchronous - the conversion is pretty straightforward. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: separate btree node reserve for moving GCSlava Pestov
If moving GC runs out of buckets to write btree nodes to, it will block waiting for the allocator to produce more buckets. But the allocator might be waiting for btree GC to finish, which will not yield any more buckets until moving GC is done, etc. To avoid this scenario, we reserve an equal number of buckets for btree node writes from moving GC as the size of the free_inc list. This ensures that for every btree node that's re-written from moving GC, a bucket is returned to free_inc. So if moving GC uses up its entire reserve, it will free up free_inc, allowing the allocator to write out prios and gens and re-distribute free_inc among the various free lists. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: put moving GC into its own per-cache threadSlava Pestov
This allows us to have a per-cache device PD controller, which is closer to the behavior that we want here. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: better ewma_add() macroKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: priorities algorith changed to CLOCKNicholas Swenson
The previous algorithm always set read_prio to max, and then decremented all bucket prios every so many bytes. This required looping over all the buckets which could take awhile as cache sizes get larger. The new algorithm instead uses a circular clock, a u16 thats incremented every so often, and sets read_prio to this clock when read from. This doesn't require looping over all the buckets to increment the clock every time. Occasionally, because data may be in the cache for an arbitrary amount of time, we may need to loop over all the buckets to rescale the prios, and create more space for new prios. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: mark extents with 0 ptrs deletedKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Add bch_read()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Store inodes in a btreeKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: multiple btreesKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: New bkey versionKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Add cache_promote() (XXX: unfinished)Kent Overstreet
need to redo readahead Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Treat extra pointers as cached copiesKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Copy data to slower tier in backgroundKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Assign cache devices to different tiersKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Generational garbage collectionNicholas Swenson
Generational grouping is a technique that attempts to keep hot and cold data separate. It essentially just groups data based on age, to keep hot and cold data together. Previous patches have sorted the data we are compacting into genertaions, stored in GC_GEN. This patch does the actual writing to different buckets. To allow for generation sorting, moving_gc needed its own sector allocation, as the standard allocator sticks data sticks new writes into whatever bucket pops out of the lru. Moving gc_alloc_sectors is new function created to handle this allocation. It will take the GC_GEN, and allocate sectors in the appropriate bucket for that get. Each cache keeps an array of gc_open_buckets to allocate sectors from. To mark the use of the gc_sector allocator, data_insert_op received a new field moving_gc to denote such a write. note: removed write_prio from bch_alloc_seectors because it was only use to distinguish gc writes from others. This is not needed because gc_writes use a different allocator. note: this doesn't support multiple cache devices. It will only move ptr[0] and ignore the rest. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Replicate writes to multiple cache devsNicholas Swenson
This mostly invoves a new function to handle the submission of bios to multiple ptrs in a key. It simply clones the original bio, and submits it to all the devs a key points to. The moving_gc write path needed changes to handle moving a single key while preserving ptrs to other devices, and to handle the possibility of moving multiple keys. To do this, I detached the allocation paths of foreground and gc wrties. Perhaps later they can be realigned but for now this is simplest. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: bch_extent_normalize()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: bch_extent_pick_ptr()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Kill bucket pinKent Overstreet
bucket->pin has been a _continual_ source of consternation and bugs; it prevents buckets from being garbage collected (originally it was for preventing buckets from being reused while we were reading from them too), but the ownership semantics were always... hazy, at best. But it's finally gone! Now, struct open_bucket is the primary mechanism for preventing something from being garbage collected until a pointer to it ends up in the btree or wherever garbage collection will eventually find it. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: GC now marks open bucketsKent Overstreet
More prep work for killing bucket->pin... Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Add refcount to open_bucketsKent Overstreet
Prep work for finally getting rid of bucket->pin! The way we're gonig to get rid of bucket pin is by having buckets owned by something that the garbage collector can find until after we've inserted the new keys that point to that bucket into the btree. This adds code for allocating struct open_buckets, and freeing them when their refcount goes to 0, and reworks bch_alloc_sectors() for the new way of allocating buckets and to pass the pointer to the struct open_bucket back to bch_data_insert_start(). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Add a workqueue for btree insertionsKent Overstreet
The next patch is going to make struct open_bucket something with a refcount that gets allocated, and the refs will be dropped after we do the index update (in bch_data_insert_keys()); we can't drop the ref (i.e. freeing it) from the same workqueue as we do the allocation - that would block frees - so add a new workqueue for bch_data_insert_keys(). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Centralize bch_keybuf_check_overlapping()Kent Overstreet
bch_keybuf_check_overlapping() is used as an optimization to if possible keep copygc from moving data around that we're about to overwrite (note that it's not an optimization when used on the writeback keybuf! that's critical for cache coherency). Anyways, it goes with writing data to the cache, so stick it there. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Make wait explicitly set by data_insert callerNicholas Swenson
Having it implicitly set by whether or not the key is dirty or not is terrible, and not even what we want for other users of bch_data_insert(). Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: bch_data_insert path refactoringKent Overstreet
Changed bch_data_insert_start to use a key for header info tracking, so it can bkey_copy into the new key; add also add bch_data_insert_op_init() to ensure required fields are initialized. We're ending up with various places that need to write some data to the cache, but already have the key that it should be inserted with (e.g. copygc, the upcoming tiering code, potentially various other fun stuff) - so instead of taking a key, breaking the fields out to set up data_insert_op, and having data_insert() reassemble them... just use a damn bkey. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: foreground/background write ratio in sysfsNicholas Swenson
Added foreground and gc write tracking to the ewma cache stats. Also outputs a percetage of foreground writes. Signed-off-by: Nicholas Swenson <nks@daterainc.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Split out alloc.hKent Overstreet
Trying to pull more stuff out of bcache.h Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Reorganize struct cache_setKent Overstreet
struct cache_set is way too big, but we can at least _attempt_ to organize it. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: make MAP_END_KEY less error proneKent Overstreet
Now, if you ask for MAP_END_KEY you get passed NULL for the key at the end of each btree node - this seemed uglier to me at the time than what was done before this patch, but the old behaviour led to a bunch of bugs in the new inode/dirent code that's still out of tree. Now, if MAP_END_KEY is misused it'll be caught right away with a null pointer deref. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Drop unneeded bch_keylist_realloc() wrapperKent Overstreet
The logic for not asking for too much journal space at a time isn't needed anymore because of the journalling rework in the last patch - for leaf nodes, bch_btree_insert_keys() only asks for space in the journal for one key at a time. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Journal when we insert keys, not beforeKent Overstreet
Previously, journalling an index update and doing the index update were two different operations. This was problematic even in the current code, and was going to be a major issue for future work; basically, any index update that required traversing and locking the btree before we know what's actually being done (which currently includes replace operations) couldn't make use of the journal. Now, any index update that uses bch_btree_insert_node() gets journalled - i.e. everything (at least to leaf nodes, for now). This also means the order that index updates happen in is preserved in the journal, which was a bug waiting to happen (if it wasn't a bug already). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Massive journalling reworkKent Overstreet
Prep work for moving where keys are journalled to within the btree insertion code: This adds bch_journal_write_get() and bch_journal_write_put(), which in the next patch will be used from the btree code to journal keys immediately after they've been added to a btree node, while that node is still locked. This also does some general refactoring, and it also changes the journalling code to not require a workqueue for anything important (in particular, the next journal write if one needed to go out had to be kicked off from the system_wq). This will help to avoid deadlocks when the journalling code is being used from more interesting contexts. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: drop null test before destroy functionsJulia Lawall
Remove unneeded NULL test. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x; @@ -if (x != NULL) \(kmem_cache_destroy\|mempool_destroy\|dma_pool_destroy\)(x); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18bcache: Split out keybuf code into keybuf.cKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2017-01-18Revert "bcache: don't embed 'return' statements in closure macros"Kent Overstreet
This reverts commit 77b5a08427e87514c33730afc18cd02c9475e2c3 - the patch was never mailed out to any mailing lists or the maintainer. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2016-09-22block: export bio_free_pages to other modulesGuoqing Jiang
bio_free_pages is introduced in commit 1dfa0f68c040 ("block: add a helper to free bio bounce buffer pages"), we can reuse the func in other modules after it was imported. Cc: Christoph Hellwig <hch@infradead.org> Cc: Jens Axboe <axboe@fb.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Shaohua Li <shli@fb.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07block: rename bio bi_rw to bi_opfJens Axboe
Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSHMike Christie
To avoid confusion between REQ_OP_FLUSH, which is handled by request_fn drivers, and upper layers requesting the block layer perform a flush sequence along with possibly a WRITE, this patch renames REQ_FLUSH to REQ_PREFLUSH. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07bcache: use bio op accessorsMike Christie
Separate the op from the rq_flag_bits and have bcache set/get the bio using bio_set_op_attrs/bio_op. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07bcache: use op_is_write instead of checking for REQ_WRITEMike Christie
We currently set REQ_WRITE/WRITE for all non READ IOs like discard, flush, writesame, etc. In the next patches where we no longer set up the op as a bitmap, we will not be able to detect a operation direction like writesame by testing if REQ_WRITE is set. This has bcache use the op_is_write helper which will do the right thing. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-07block: change ->make_request_fn() and users to return a queue cookieJens Axboe
No functional changes in this patch, but it prepares us for returning a more useful cookie related to the IO that was queued up. Signed-off-by: Jens Axboe <axboe@fb.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com>
2015-08-13bcache: remove driver private bio splitting codeKent Overstreet
The bcache driver has always accepted arbitrarily large bios and split them internally. Now that every driver must accept arbitrarily large bios this code isn't nessecary anymore. Cc: linux-bcache@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29block: add a bi_error field to struct bioChristoph Hellwig
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-11bcache: don't embed 'return' statements in closure macrosJens Axboe
This is horribly confusing, it breaks the flow of the code without it being apparent in the caller. Signed-off-by: Jens Axboe <axboe@fb.com> Acked-by: Christoph Hellwig <hch@lst.de>
2015-06-02writeback: separate out include/linux/backing-dev-defs.hTejun Heo
With the planned cgroup writeback support, backing-dev related declarations will be more widely used across block and cgroup; unfortunately, including backing-dev.h from include/linux/blkdev.h makes cyclic include dependency quite likely. This patch separates out backing-dev-defs.h which only has the essential definitions and updates blkdev.h to include it. c files which need access to more backing-dev details now include backing-dev.h directly. This takes backing-dev.h off the common include dependency chain making it a lot easier to use it across block and cgroup. v2: fs/fat build failure fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-05bio: skip atomic inc/dec of ->bi_cnt for most use casesJens Axboe
Struct bio has a reference count that controls when it can be freed. Most uses cases is allocating the bio, which then returns with a single reference to it, doing IO, and then dropping that single reference. We can remove this atomic_dec_and_test() in the completion path, if nobody else is holding a reference to the bio. If someone does call bio_get() on the bio, then we flag the bio as now having valid count and that we must properly honor the reference count when it's being put. Tested-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24md/bcache: use generic io stats accounting functions to simplify io stat ↵Gu Zheng
accounting Use generic io stats accounting help functions (generic_{start,end}_io_acct) to simplify io stat accounting. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: Kent Overstreet <kmo@datera.io> Signed-off-by: Jens Axboe <axboe@fb.com>