more guide

author: Kent Overstreet <kent.overstreet@gmail.com> 2015-06-02 04:26:28 -0700
committer: Kent Overstreet <kent.overstreet@gmail.com> 2015-06-02 04:26:28 -0700
commit: da3a3a3763d766d2799723e10a97629b23d82b25 (patch)
tree: 2c6868c4666494ef2575814499daf32cb7295f4f
parent: 2ce449ea9afae7f0ae63e22bfc1ad951a175eb62 (diff)
1 files changed, 332 insertions, 7 deletions
diff --git a/BcacheGuide.mdwn b/BcacheGuide.mdwn
index dd45d91..a123a34 100644
--- a/BcacheGuide.mdwn
+++ b/BcacheGuide.mdwn
@@ -1,20 +1,20 @@
 [[!toc levels=3 startlevel=2]]
 
-# THE PROGRAMMER'S GUIDE TO BCACHE:
+# The Programmer's Guide to bcache:
 
 This document is intended to cover the design and core concepts of bcache, and
 serve as a guide to programmers trying to become familiar with the bcache
 codebase. It'll try to cover how various features are built ontop of the core
 concepts and how everything fits together.
 
-
-## KEYS, VALUES AND INDEXES:
+## Keys, values and indexes:
 
 The core abstraction in bcache is a key value store ("the btree"). To the btree
 code the keys it indexes are just large, fixed sized bitstrings/integers, with
 opaque variable sized values.
 
-Essentially aside from some low level machinery is stored as key/value pairs:
+Aside from some low level machinery, essentially all metadata is stored as
+key/value pairs:
 
  - Extents are one type of key where the value is a list of pointers to the
    actual data.
@@ -237,10 +237,17 @@ other filesystems that has them - a variable sized chunk of data. From the point
 of view of the index, they aren't positions, they're ranges or half open
 intervals (note that a 0 size extent doesn't overlap with anything).
 
-Bcache is also purely copy on write - it never does update in place. This is a
-very fundamental part of the design, and quite useful.
+Bcache's extents are indexed by inode:offset, and their size is stored in the
+size field in struct bkey. The offset and size are both in 512 byte sectors (as
+are the pointer offsets). The offset field denotes the _end_ position of the
+extent within the file - a key with offset 8 and size 8 points to the data for
+sectors 0 through 7.
+
+(This oddity was a result of bcache's btree being designed for extents first,
+and non extent keys coming later - it makes searching for extents within a
+certain range cleaner when iterating in ascending order).
 
-Bcache's extents are a list of one or more pointers - if there's more than one
+Inside the value is a list of one or more pointers - if there's more than one
 pointer, they point to replicated data (or possibly one of the copies is on a
 faster device and is considered cached).
 
@@ -489,3 +496,321 @@ another btree).
 This does mean that we can't reuse any of the buckets in the stripe until each
 of them are empty - but this shouldn't be much trouble, we just have to teach
 copygc about it when it's considering which buckets to evacuate.
+
+#### Copying garbage collection:
+
+#### Tiering:
+
+## Btree internals:
+
+At a high level, bcache's btree is a copy on write b+ tree. The main difference
+between bcache's b+ tree and others is the nodes are very large (256k is
+typical) and log structured. Like other COW b+ trees, updating a node may
+require recursively rewriting every node up to the root; however, most updates
+(to both leaf nodes and interior nodes) can be done with only an append, until
+we've written to the full amount of space we originally reserved for the node.
+
+A single btree node log entry is represented as a header and a list of bkeys,
+where the bkeys are all contiguous in memory and in sorted order:
+
+	struct bset {
+		/* some fields ommitted */
+		...
+		u16			u64s;
+		struct bkey_packed	start[0];
+	};
+
+Since bkeys are variable length, it's not possible to access keys randomly
+without other data structures - only iterate sequentially via bkey_next().
+
+A btree node thus contains multiple independent bsets that on lookup all must be
+searched and iterated over. At any given time there will be zero or more bsets
+that have been written out, and a single dirty bset that new keys are being
+inserted into.
+
+As the btree is modified we must maintain the invariant in memory that there are
+no duplicates (keys that compare as equal), excluding keys marked as deleted.
+When an insertion overwrites an existing key, we will mark the existing key as
+deleted (by setting k->type = KEY_TYPE_DELETED) - but until the entire node is
+rewritten the old key will still exist on disk. To handle this, when a btree
+node is read, the first thing we do is a mergesort of all the bsets it contains,
+and as part of the mergesort duplicate keys are found and the older bkeys are
+dropped - carefully matching the same changes we made in memory when doing the
+insertions.
+
+This hopefully explains how the lack of deletion as a primitive is a result of
+the way the btree is implemented - it's not possible to delete a key except by
+inserting a whiteout, which will be dropped when the btree node eventually fills
+up and is rewritten.
+
+Once a bset has been written out it may also be sorted, in memory, with other
+bsets that have also been written out - we do so periodically so that a given
+btree node will have only a few (at most three) bsets in memory: the one being
+inserted into will be at most 8 or 16k, and the rest roughly forming a geometric
+progression size, so that sorting the entire node is relatively infrequent.
+
+This resorting/compacting in memory is one of the main ways bcache is able to
+efficiently use such large btree nodes. The other main trick is to take
+advantage of the fact that bsets that have been written out are, aside from
+resorts, constant; we precompute lookup tables that would be too inefficient to
+use if they had to be modified for insertions. The bcache code refers to these
+lookup tables as the auxiliary search trees.
+
+### Locking
+
+Bcache doesn't use read/write locks for btree nodes - the locks it uses have
+three states: shared, intent and exclusive (SIX locks). Shared and exclusive
+correspond to read and write, while intent is sort of in the middle - intent
+locks conflict with other intent locks (like write locks), but they don't
+conflict with read locks.
+
+The problem intent locks solve is that with a regular read/write lock, a read
+lock can't be upgraded to a write lock - that would lead to deadlock when
+multiple threads with read locks tried to upgrade. With a complicated enough
+data structure, updates will need to hold write locks for exclusion with other
+updates for much longer than the part where they do the actual modification that
+needs exclusion from readers.
+
+For example, consider the case of a btree split. The update starts at a leaf
+node, and discovers it has to do a split. But before starting the split it has
+to acquire a write lock on the parent node, primarily to avoid a deadlock with
+other splits: it needs at least a read lock on the parent (roughly in order to
+lock the path to the child node), but it couldn't then upgrade that read lock to
+a write lock in order to update the parent with the pointers to the new children
+because that would deadlock with threads splitting sibling leaf nodes.
+
+Intent locks solve this problem. When doing a split it suffices to acquire an
+intent lock on the parent - write locks are only ever held modifying the in
+memory btree contents (which is a much shorter duration than the entire split,
+which requires waiting for the new nodes to be written to disk).
+
+Intent locks with only three states do introduce another deadlock, though:
+
+	Thread A			Thread B
+	read		| Parent |	intent
+	intent		| Child	 |	intent
+
+Thread B is splitting the child node: it's allocated new nodes and written them
+out, and now needs to take a write lock on the parent in order to add the
+pointers to the new nodes (after which it will free the old child).
+
+Thread A just wants to insert into the child node - it has a read lock on the
+parent node and it's looked up the child node, and now it's waiting (on thread
+B) to get an intent lock on the child.
+
+But thread A has blocked thread B from taking its write lock in order to update
+the parent node, and thread B can't drop its intent lock on the child until
+after the new nodes are visible and it has freed the child node.
+
+The way this deadlock is handled is by enforcing (in bch_btree_node_get()) that
+we drop read locks on parent nodes _before_ taking intent locks on child nodes -
+this might cause us to race have the btree node freed out from under us before
+we lock it, so we check for that after grabbing the intent lock and redo the
+traversal if necessary.
+
+One other thing worth mentioning is bcache's btree node locks have embedded
+sequence numbers, which are incremented when taking and releasing write locks
+(much like seqlocks). This allows us to aggressively drop locks (because we'll
+usually be able to retake the lock), and we also use it for a try_upgrade() - if
+we discover we need an intent lock (e.g. for a split, or because the caller is
+inserting into a leaf node they didn't get an intent lock for) we'll usually be
+able to get it without having to unwind and redo the traversal.
+
+### Journaling
+
+Bcache's journal is foremost an optimization for the btree. COW btrees do not
+require journals for on disk consistency - but having a journal allows us to
+coalesce random updates across multiple btree nodes and flush the nodes
+asynchronously.
+
+The journal is a purely logical log, a list of insertions - bkeys - to reinsert
+on recovery in the same order they're present in the journal. Provided every
+index update is journaled (a critical invariant here), redoing those insertions
+is an idempotant operation. See bch_journal_replay_key() in journal.c - the
+journal uses the same insert path as everything else and doesn't know anything
+about the structure of the btree.
+
+It's critical that keys appear in the journal in the same order as the
+insertions happened, so both are done under the btree node's write lock: see
+bch_btree_insert_and_journal() in btree.c, which calls bch_bset_insert() at the
+top (inserting into the btree node in memory) and bch_journal_add_keys() to
+journal the same key at the bottom.
+
+At this point in the btree insertion path, we've already marked any key(s) that
+we overlapped with (possibly more than one for extents) as deleted - i.e. the
+btree node is inconsistent so the insertion must happen before dropping our
+write lock.
+
+So we also have some machinery for journal reservations: we reserve some amount
+of space in the journal (in units of u64s, as always) midway through the
+insertion path (in bch_btree_insert_keys()). The journal reservation machinery
+(as well as adding the keys to the journal) is entirely lockless in the
+fastpath.
+
+### Sequential consistency
+
+As mentioned in the first section on indexing, bcache's b+ tree provides the
+guarantee that ordering of updates is always preserved - whether to different
+nodes or different btrees (i.e. dirents vs. inodes).
+
+The journal helps here as well. Ordering is always preserved in the journal
+itself: if a key at time t made it to the journal and was present after unclean
+shutdown/recovery, all the keys journalled prior to time t will either be in the
+journal, or in btree nodes that were flushed to disk.
+
+The btree by itself does not provide ordering though - if updates happened to
+two different leaf nodes, the leaf nodes could have been flushed in any order -
+and importantly, either of them could have been written before the last journal
+entry that contained keys for that btree node write.
+
+That is, while typically the journal write will happen before the btree node is
+flushed - we don't prevent the btree node from being flushed right away, and we
+certainly don't want to: since flushing btree nodes is required both to reclaim
+memory and to reclaim space in the journal, just the mere though of the
+potential deadlocks is rather horrifying.
+
+Instead, we have a rather neat trick: in struct bset (the common header for a
+single btree node write/log entry) we track the most recent sequence number of
+all the journal entries the keys in this bset went into.
+
+Then, on recovery when we're first walking the btree if we find a bset with a
+higher journal sequence number than the most recent journal entry we actually
+found - we merely ignore it.
+
+The bset we ignore will likely also contain keys in older journal entries -
+however, those journal entries will all be in the set we are replaying because
+they were considered dirty until after the bset was written, and they were
+marked as dirty on disk until a journal entry was written after the bset's write
+completed - which didn't happen. Thus, ignoring those bsets cannot cause us to
+lose anything that won't be replayed.
+
+There is a decent amount of extra machinery required to make this scheme work -
+when we find bsets newer than the newest journal entry we have to blacklist the
+journal sequence number they referred to - and we have to mark it as
+blacklisted, so that on the next recovery we don't think there's a journal entry
+missing - but that is the central idea.
+
+## Auxiliary search trees
+
+The code for doing lookups, insertions, etc. within a btree node is relatively
+separated from the btree code itself, in bset.c - there's a struct
+btree_node_iter separate from struct btree_iter (the btree iterator contains one
+btree node iterator per level of the btree).
+
+The bulk of the machinery is the auxiliary search trees - the data structures
+for efficiently searching within a bset.
+
+There are two different data structures and lookup paths. For the bset that's
+currently being inserted into, we maintain a simple table in an array, with one
+entry per cacheline of data in the original bset, that tracks the offset of the
+first key in that cacheline. This is enough to do a binary search (and then a
+linear search when we're down to a single cacheline), and it's much cheaper to
+keep up to date.
+
+For the const bsets, we construct a binary search tree in an array (same layout
+as is used for heaps) where each node corresponds to one cacheline of data in
+the original bset, and the first key within that cacheline (note that the
+auxiliary search tree is not full, i.e. not of size (2^n) - 1). Walking down the
+auxilialy search tree thus corresponds roughly to doing a binary search on the
+original bset - but it has the advantage of much friendlier memory access
+patterns, since at every iteration the children of the current node are adjacent
+in memory (and all the grandchildren, and all the great grandchildren) - meaning
+unlike with a binary search it's possible to prefetch.
+
+Then there are a couple tricks we use to make these nodes as small as possible:
+
+ - Because each node in the auxiliary search tree corresponds to precisely one
+   cacheline, we don't have to store a full pointer to the original key - if we
+   can compute given a node's position in the array/tree its index in an inorder
+   traversal, we only have to store the key's offset within that cacheline.
+
+   This is done by to_inorder() in bset.c, and it's mostly just shifts and bit
+   operations.
+
+ - Observe that as we're doing the lookup and walking down the tree, we have
+   constrained the keys we're going to compare against to lie within a certain
+   range [l, r).
+
+   Then l and r will be equal in some number of their high bits (possibly 0);
+   the keys we'll be comparing against and our search key will all be equal in
+   the same bits - meaning we don't have to compare against, or store, any bits
+   after that position.
+
+   We also don't have to store all the low bits, either - we need to store
+   enough bits to correctly pivot on the key the current node points to (call it
+   m); i.e. we need to store enough bits to tell m apart from the key
+   immeditiately prior to m (call it p). We're not looking for strict equality
+   comparisons here, we're going to follow this up with a linear search anyways.
+
+   So the node in the auxiliary search tree (roughly) needs to store the bits
+   from where l and r first differed to where m and p first differed - and
+   usually that's not going to be very many bits. The full struct bkey has 160
+   bit keys, but 16 bit keys in the auxiliary search tree will suffice > 99% of
+   the time.
+
+   Lastly, since we'd really like these nodes to be fixed size - we just pick a
+   size and then when we're constructing the auxiliary search tree check if we
+   weren't able to construct a node, and flag it; the lookup code will fall back
+   to comparing against the original key. Provided this happens rarely enough,
+   the performance impact will be negligible.
+
+   The auxiliary search trees were an enormous improvement to bcache's
+   performance when they were introduced - before they were introduced the
+   lookup code was a simple binary search (eons ago when keys were still fixed
+   size). On random lookups with a large btree the auxiliary search trees are
+   easily over an order of magnitude faster.
+
+## Bkey packing
+
+Bkeys have gotten rather large, with 64 bit inodes, 64 bit offsets, a snapshot
+field that isn't being used yet, a 32 bit size field that's only used by
+extents... That's a lot of wasted bits.
+
+The packing mechanism was added to address this, and as an added bonus it should
+allow us to change or add fields to struct bkey (if we ever want to) without
+incompatible on disk format changes (forwards and backwards!).
+
+The main constraint is lookup performance - any packing scheme that required an
+unpack for comparisons would probably be right out. So if we want to be able to
+compare packed keys, the requirement is that it be order preserving. That is, we
+need to define some function pack(), such that
+
+	bkey_cmp(l, r) == bkey_cmp(pack(l), pack(r))
+
+The way it works is for every btree node we define a packed format (and we
+recalculate a new packed format when splitting/rewriting nodes); then keys that
+are packed in the same format may be compared without unpacking them.
+
+The packed format itself consists of, for each field in struct bkey
+
+ - the size of that field in the packed key, in bits
+ - an offset, to be subtracted before packing and when unpacking
+
+Then, provided keys fit into the packed format (their key fields are not less
+than the field offset, and after subtracting the offset their fields fit into
+the number of bits in the packed keys) packing is order preserving - i.e.
+packing is allowed to fail.
+
+Then in the lookup path, we pack the search key into the btree node's packed
+format - even better, this helps the auxiliary search tree code since it no
+longer has to deal with long strings of zeroes between the inode and the offset.
+
+For extents, keys in the packed format are usually only 8 bytes - vs. 32 bytes
+for struct bkey. And with no replication, checksumming or compression, the value
+is only 8 bytes - so packing can reduce our metadata size by more than half.
+
+The other interesting thing about this mechanism is that it makes our key format
+self describing - we aren't restricted to just defining a format for packed
+keys, we can define one for our in memory representation, struct bkey, as well
+(and we do).
+
+Then as long as we have the format corresponding to a key, we can use it - we
+should just be careful not to drop fields it has that we don't understand (by
+converting it to our current in memory representation and back).
+
+All that's left to do is to store formats (besides the btree node-local format)
+in the superblock. Right now the btree node local format is defined to be format
+0 and the in memory format - struct bkey itself - is format 1,
+BKEY_FORMAT_CURRENT. Changing struct bkey would be as simple as defining a new
+format, and making sure the new format gets added to existing superblocks when
+it's used.
author	Kent Overstreet <kent.overstreet@gmail.com>	2015-06-02 04:26:28 -0700
committer	Kent Overstreet <kent.overstreet@gmail.com>	2015-06-02 04:26:28 -0700
commit	da3a3a3763d766d2799723e10a97629b23d82b25 (patch)
tree	2c6868c4666494ef2575814499daf32cb7295f4f
parent	2ce449ea9afae7f0ae63e22bfc1ad951a175eb62 (diff)