diff options
author | Kent Overstreet <kent.overstreet@gmail.com> | 2015-06-02 04:26:28 -0700 |
---|---|---|
committer | Kent Overstreet <kent.overstreet@gmail.com> | 2015-06-02 04:26:28 -0700 |
commit | da3a3a3763d766d2799723e10a97629b23d82b25 (patch) | |
tree | 2c6868c4666494ef2575814499daf32cb7295f4f | |
parent | 2ce449ea9afae7f0ae63e22bfc1ad951a175eb62 (diff) |
more guide
-rw-r--r-- | BcacheGuide.mdwn | 339 |
1 files changed, 332 insertions, 7 deletions
diff --git a/BcacheGuide.mdwn b/BcacheGuide.mdwn index dd45d91..a123a34 100644 --- a/BcacheGuide.mdwn +++ b/BcacheGuide.mdwn @@ -1,20 +1,20 @@ [[!toc levels=3 startlevel=2]] -# THE PROGRAMMER'S GUIDE TO BCACHE: +# The Programmer's Guide to bcache: This document is intended to cover the design and core concepts of bcache, and serve as a guide to programmers trying to become familiar with the bcache codebase. It'll try to cover how various features are built ontop of the core concepts and how everything fits together. - -## KEYS, VALUES AND INDEXES: +## Keys, values and indexes: The core abstraction in bcache is a key value store ("the btree"). To the btree code the keys it indexes are just large, fixed sized bitstrings/integers, with opaque variable sized values. -Essentially aside from some low level machinery is stored as key/value pairs: +Aside from some low level machinery, essentially all metadata is stored as +key/value pairs: - Extents are one type of key where the value is a list of pointers to the actual data. @@ -237,10 +237,17 @@ other filesystems that has them - a variable sized chunk of data. From the point of view of the index, they aren't positions, they're ranges or half open intervals (note that a 0 size extent doesn't overlap with anything). -Bcache is also purely copy on write - it never does update in place. This is a -very fundamental part of the design, and quite useful. +Bcache's extents are indexed by inode:offset, and their size is stored in the +size field in struct bkey. The offset and size are both in 512 byte sectors (as +are the pointer offsets). The offset field denotes the _end_ position of the +extent within the file - a key with offset 8 and size 8 points to the data for +sectors 0 through 7. + +(This oddity was a result of bcache's btree being designed for extents first, +and non extent keys coming later - it makes searching for extents within a +certain range cleaner when iterating in ascending order). -Bcache's extents are a list of one or more pointers - if there's more than one +Inside the value is a list of one or more pointers - if there's more than one pointer, they point to replicated data (or possibly one of the copies is on a faster device and is considered cached). @@ -489,3 +496,321 @@ another btree). This does mean that we can't reuse any of the buckets in the stripe until each of them are empty - but this shouldn't be much trouble, we just have to teach copygc about it when it's considering which buckets to evacuate. + +#### Copying garbage collection: + +#### Tiering: + +## Btree internals: + +At a high level, bcache's btree is a copy on write b+ tree. The main difference +between bcache's b+ tree and others is the nodes are very large (256k is +typical) and log structured. Like other COW b+ trees, updating a node may +require recursively rewriting every node up to the root; however, most updates +(to both leaf nodes and interior nodes) can be done with only an append, until +we've written to the full amount of space we originally reserved for the node. + +A single btree node log entry is represented as a header and a list of bkeys, +where the bkeys are all contiguous in memory and in sorted order: + + struct bset { + /* some fields ommitted */ + ... + u16 u64s; + struct bkey_packed start[0]; + }; + +Since bkeys are variable length, it's not possible to access keys randomly +without other data structures - only iterate sequentially via bkey_next(). + +A btree node thus contains multiple independent bsets that on lookup all must be +searched and iterated over. At any given time there will be zero or more bsets +that have been written out, and a single dirty bset that new keys are being +inserted into. + +As the btree is modified we must maintain the invariant in memory that there are +no duplicates (keys that compare as equal), excluding keys marked as deleted. +When an insertion overwrites an existing key, we will mark the existing key as +deleted (by setting k->type = KEY_TYPE_DELETED) - but until the entire node is +rewritten the old key will still exist on disk. To handle this, when a btree +node is read, the first thing we do is a mergesort of all the bsets it contains, +and as part of the mergesort duplicate keys are found and the older bkeys are +dropped - carefully matching the same changes we made in memory when doing the +insertions. + +This hopefully explains how the lack of deletion as a primitive is a result of +the way the btree is implemented - it's not possible to delete a key except by +inserting a whiteout, which will be dropped when the btree node eventually fills +up and is rewritten. + +Once a bset has been written out it may also be sorted, in memory, with other +bsets that have also been written out - we do so periodically so that a given +btree node will have only a few (at most three) bsets in memory: the one being +inserted into will be at most 8 or 16k, and the rest roughly forming a geometric +progression size, so that sorting the entire node is relatively infrequent. + +This resorting/compacting in memory is one of the main ways bcache is able to +efficiently use such large btree nodes. The other main trick is to take +advantage of the fact that bsets that have been written out are, aside from +resorts, constant; we precompute lookup tables that would be too inefficient to +use if they had to be modified for insertions. The bcache code refers to these +lookup tables as the auxiliary search trees. + +### Locking + +Bcache doesn't use read/write locks for btree nodes - the locks it uses have +three states: shared, intent and exclusive (SIX locks). Shared and exclusive +correspond to read and write, while intent is sort of in the middle - intent +locks conflict with other intent locks (like write locks), but they don't +conflict with read locks. + +The problem intent locks solve is that with a regular read/write lock, a read +lock can't be upgraded to a write lock - that would lead to deadlock when +multiple threads with read locks tried to upgrade. With a complicated enough +data structure, updates will need to hold write locks for exclusion with other +updates for much longer than the part where they do the actual modification that +needs exclusion from readers. + +For example, consider the case of a btree split. The update starts at a leaf +node, and discovers it has to do a split. But before starting the split it has +to acquire a write lock on the parent node, primarily to avoid a deadlock with +other splits: it needs at least a read lock on the parent (roughly in order to +lock the path to the child node), but it couldn't then upgrade that read lock to +a write lock in order to update the parent with the pointers to the new children +because that would deadlock with threads splitting sibling leaf nodes. + +Intent locks solve this problem. When doing a split it suffices to acquire an +intent lock on the parent - write locks are only ever held modifying the in +memory btree contents (which is a much shorter duration than the entire split, +which requires waiting for the new nodes to be written to disk). + +Intent locks with only three states do introduce another deadlock, though: + + Thread A Thread B + read | Parent | intent + intent | Child | intent + +Thread B is splitting the child node: it's allocated new nodes and written them +out, and now needs to take a write lock on the parent in order to add the +pointers to the new nodes (after which it will free the old child). + +Thread A just wants to insert into the child node - it has a read lock on the +parent node and it's looked up the child node, and now it's waiting (on thread +B) to get an intent lock on the child. + +But thread A has blocked thread B from taking its write lock in order to update +the parent node, and thread B can't drop its intent lock on the child until +after the new nodes are visible and it has freed the child node. + +The way this deadlock is handled is by enforcing (in bch_btree_node_get()) that +we drop read locks on parent nodes _before_ taking intent locks on child nodes - +this might cause us to race have the btree node freed out from under us before +we lock it, so we check for that after grabbing the intent lock and redo the +traversal if necessary. + +One other thing worth mentioning is bcache's btree node locks have embedded +sequence numbers, which are incremented when taking and releasing write locks +(much like seqlocks). This allows us to aggressively drop locks (because we'll +usually be able to retake the lock), and we also use it for a try_upgrade() - if +we discover we need an intent lock (e.g. for a split, or because the caller is +inserting into a leaf node they didn't get an intent lock for) we'll usually be +able to get it without having to unwind and redo the traversal. + +### Journaling + +Bcache's journal is foremost an optimization for the btree. COW btrees do not +require journals for on disk consistency - but having a journal allows us to +coalesce random updates across multiple btree nodes and flush the nodes +asynchronously. + +The journal is a purely logical log, a list of insertions - bkeys - to reinsert +on recovery in the same order they're present in the journal. Provided every +index update is journaled (a critical invariant here), redoing those insertions +is an idempotant operation. See bch_journal_replay_key() in journal.c - the +journal uses the same insert path as everything else and doesn't know anything +about the structure of the btree. + +It's critical that keys appear in the journal in the same order as the +insertions happened, so both are done under the btree node's write lock: see +bch_btree_insert_and_journal() in btree.c, which calls bch_bset_insert() at the +top (inserting into the btree node in memory) and bch_journal_add_keys() to +journal the same key at the bottom. + +At this point in the btree insertion path, we've already marked any key(s) that +we overlapped with (possibly more than one for extents) as deleted - i.e. the +btree node is inconsistent so the insertion must happen before dropping our +write lock. + +So we also have some machinery for journal reservations: we reserve some amount +of space in the journal (in units of u64s, as always) midway through the +insertion path (in bch_btree_insert_keys()). The journal reservation machinery +(as well as adding the keys to the journal) is entirely lockless in the +fastpath. + +### Sequential consistency + +As mentioned in the first section on indexing, bcache's b+ tree provides the +guarantee that ordering of updates is always preserved - whether to different +nodes or different btrees (i.e. dirents vs. inodes). + +The journal helps here as well. Ordering is always preserved in the journal +itself: if a key at time t made it to the journal and was present after unclean +shutdown/recovery, all the keys journalled prior to time t will either be in the +journal, or in btree nodes that were flushed to disk. + +The btree by itself does not provide ordering though - if updates happened to +two different leaf nodes, the leaf nodes could have been flushed in any order - +and importantly, either of them could have been written before the last journal +entry that contained keys for that btree node write. + +That is, while typically the journal write will happen before the btree node is +flushed - we don't prevent the btree node from being flushed right away, and we +certainly don't want to: since flushing btree nodes is required both to reclaim +memory and to reclaim space in the journal, just the mere though of the +potential deadlocks is rather horrifying. + +Instead, we have a rather neat trick: in struct bset (the common header for a +single btree node write/log entry) we track the most recent sequence number of +all the journal entries the keys in this bset went into. + +Then, on recovery when we're first walking the btree if we find a bset with a +higher journal sequence number than the most recent journal entry we actually +found - we merely ignore it. + +The bset we ignore will likely also contain keys in older journal entries - +however, those journal entries will all be in the set we are replaying because +they were considered dirty until after the bset was written, and they were +marked as dirty on disk until a journal entry was written after the bset's write +completed - which didn't happen. Thus, ignoring those bsets cannot cause us to +lose anything that won't be replayed. + +There is a decent amount of extra machinery required to make this scheme work - +when we find bsets newer than the newest journal entry we have to blacklist the +journal sequence number they referred to - and we have to mark it as +blacklisted, so that on the next recovery we don't think there's a journal entry +missing - but that is the central idea. + +## Auxiliary search trees + +The code for doing lookups, insertions, etc. within a btree node is relatively +separated from the btree code itself, in bset.c - there's a struct +btree_node_iter separate from struct btree_iter (the btree iterator contains one +btree node iterator per level of the btree). + +The bulk of the machinery is the auxiliary search trees - the data structures +for efficiently searching within a bset. + +There are two different data structures and lookup paths. For the bset that's +currently being inserted into, we maintain a simple table in an array, with one +entry per cacheline of data in the original bset, that tracks the offset of the +first key in that cacheline. This is enough to do a binary search (and then a +linear search when we're down to a single cacheline), and it's much cheaper to +keep up to date. + +For the const bsets, we construct a binary search tree in an array (same layout +as is used for heaps) where each node corresponds to one cacheline of data in +the original bset, and the first key within that cacheline (note that the +auxiliary search tree is not full, i.e. not of size (2^n) - 1). Walking down the +auxilialy search tree thus corresponds roughly to doing a binary search on the +original bset - but it has the advantage of much friendlier memory access +patterns, since at every iteration the children of the current node are adjacent +in memory (and all the grandchildren, and all the great grandchildren) - meaning +unlike with a binary search it's possible to prefetch. + +Then there are a couple tricks we use to make these nodes as small as possible: + + - Because each node in the auxiliary search tree corresponds to precisely one + cacheline, we don't have to store a full pointer to the original key - if we + can compute given a node's position in the array/tree its index in an inorder + traversal, we only have to store the key's offset within that cacheline. + + This is done by to_inorder() in bset.c, and it's mostly just shifts and bit + operations. + + - Observe that as we're doing the lookup and walking down the tree, we have + constrained the keys we're going to compare against to lie within a certain + range [l, r). + + Then l and r will be equal in some number of their high bits (possibly 0); + the keys we'll be comparing against and our search key will all be equal in + the same bits - meaning we don't have to compare against, or store, any bits + after that position. + + We also don't have to store all the low bits, either - we need to store + enough bits to correctly pivot on the key the current node points to (call it + m); i.e. we need to store enough bits to tell m apart from the key + immeditiately prior to m (call it p). We're not looking for strict equality + comparisons here, we're going to follow this up with a linear search anyways. + + So the node in the auxiliary search tree (roughly) needs to store the bits + from where l and r first differed to where m and p first differed - and + usually that's not going to be very many bits. The full struct bkey has 160 + bit keys, but 16 bit keys in the auxiliary search tree will suffice > 99% of + the time. + + Lastly, since we'd really like these nodes to be fixed size - we just pick a + size and then when we're constructing the auxiliary search tree check if we + weren't able to construct a node, and flag it; the lookup code will fall back + to comparing against the original key. Provided this happens rarely enough, + the performance impact will be negligible. + + The auxiliary search trees were an enormous improvement to bcache's + performance when they were introduced - before they were introduced the + lookup code was a simple binary search (eons ago when keys were still fixed + size). On random lookups with a large btree the auxiliary search trees are + easily over an order of magnitude faster. + +## Bkey packing + +Bkeys have gotten rather large, with 64 bit inodes, 64 bit offsets, a snapshot +field that isn't being used yet, a 32 bit size field that's only used by +extents... That's a lot of wasted bits. + +The packing mechanism was added to address this, and as an added bonus it should +allow us to change or add fields to struct bkey (if we ever want to) without +incompatible on disk format changes (forwards and backwards!). + +The main constraint is lookup performance - any packing scheme that required an +unpack for comparisons would probably be right out. So if we want to be able to +compare packed keys, the requirement is that it be order preserving. That is, we +need to define some function pack(), such that + + bkey_cmp(l, r) == bkey_cmp(pack(l), pack(r)) + +The way it works is for every btree node we define a packed format (and we +recalculate a new packed format when splitting/rewriting nodes); then keys that +are packed in the same format may be compared without unpacking them. + +The packed format itself consists of, for each field in struct bkey + + - the size of that field in the packed key, in bits + - an offset, to be subtracted before packing and when unpacking + +Then, provided keys fit into the packed format (their key fields are not less +than the field offset, and after subtracting the offset their fields fit into +the number of bits in the packed keys) packing is order preserving - i.e. +packing is allowed to fail. + +Then in the lookup path, we pack the search key into the btree node's packed +format - even better, this helps the auxiliary search tree code since it no +longer has to deal with long strings of zeroes between the inode and the offset. + +For extents, keys in the packed format are usually only 8 bytes - vs. 32 bytes +for struct bkey. And with no replication, checksumming or compression, the value +is only 8 bytes - so packing can reduce our metadata size by more than half. + +The other interesting thing about this mechanism is that it makes our key format +self describing - we aren't restricted to just defining a format for packed +keys, we can define one for our in memory representation, struct bkey, as well +(and we do). + +Then as long as we have the format corresponding to a key, we can use it - we +should just be careful not to drop fields it has that we don't understand (by +converting it to our current in memory representation and back). + +All that's left to do is to store formats (besides the btree node-local format) +in the superblock. Right now the btree node local format is defined to be format +0 and the in memory format - struct bkey itself - is format 1, +BKEY_FORMAT_CURRENT. Changing struct bkey would be as simple as defining a new +format, and making sure the new format gets added to existing superblocks when +it's used. |