bcache guide

author: Kent Overstreet <kent.overstreet@gmail.com> 2015-05-30 02:42:27 -0700
committer: Kent Overstreet <kent.overstreet@gmail.com> 2015-05-30 02:42:27 -0700
commit: 265733f273762038577a0340657b6c9cd9a62c4f (patch)
tree: 384877bbc7f696cb1a10e4714b5fb6960892acf5
parent: 445b4770a0b52981b219b55e62258cd2ff5d1a2d (diff)
1 files changed, 364 insertions, 0 deletions
diff --git a/BcacheGuide.mdwn b/BcacheGuide.mdwn
new file mode 100644
index 0000000..e60a33f
--- /dev/null
+++ b/BcacheGuide.mdwn
@@ -0,0 +1,364 @@
+[[!toc startlevel=2]]
+
+# THE PROGRAMMER'S GUIDE TO BCACHE:
+
+This document is intended to cover the design and core concepts of bcache, and
+serve as a guide to programmers trying to become familiar with the bcache
+codebase. It'll try to cover how various features are built ontop of the core
+concepts and how everything fits together.
+
+
+## KEYS, VALUES AND INDEXES:
+
+The core abstraction in bcache is a key value store ("the btree"). To the btree
+code the keys it indexes are just large, fixed sized bitstrings/integers, with
+opaque variable sized values.
+
+Essentially aside from some low level machinery is stored as key/value pairs:
+
+ - Extents are one type of key where the value is a list of pointers to the
+   actual data.
+
+ - Inodes are stored indexed by inode number, where the value is the inode
+   struct.
+
+ - Dirents and xattrs are stored with one key per dirent/xattr.
+
+### Keys (bkey and bpos) 
+
+We have separate structs for a search key (struct bpos), and the outer container
+that holds a key and a value (struct bkey).
+
+Here are their definitions:
+
+	struct bpos {
+		u64		inode;
+		u64		offset;
+		u32		snapshot;
+	};
+
+	struct bkey {
+		u8		u64s;
+		u8		format;
+		u8		type;
+
+		u8		pad;
+		u32		version;
+		u32		size;
+		struct bpos	p;
+	};
+
+The high bits of the search key are the inode field and the low bits are the
+snapshot field (note that on little endian the order in the struct will actually
+be reversed).
+
+Not all code uses all fields of the search key (and snapshot is currently
+unused), and some code uses the fields in different ways; it's up to the user of
+the btree code to use the fields however they see fit. However, the inode field
+generally corresponds to an inode number (of a file in a filesystem), and for
+extents the offset field corresponds to the offset within that file.
+
+Some notes on the bkey fields:
+
+ - The u64s field is the size of the combined key and value, in units of u64s.
+   It's not generally used directly: use bkey_val_u64s() or bkey_val_bytes() for
+   a higher level interface.
+
+ - The format field is internal to the btree implementation. It's for packed
+   bkeys - bkeys are not usually stored in the btree in the in memory. Only very
+   low level btree code need be concerned with this, however.
+
+ - The type field denotes the type of the value. For example, the directory
+   entry code defines two different value types, bch_dirent (type BCH_DIRENT,
+   128) and bch_dirent_whiteout (BCH_DIRENT_WHITEOUT, 129). 
+
+ - The size field is only used by extents (0 for all other keys)
+
+### Values:
+
+Values are stored inline with bkeys. Each value type (for all the types that
+have nonempty values) will have a struct definition, and then macros generate a
+variety of accessor functions and wrapper types.
+
+Also, while values are _stored_ inline with the keys, due to packing they aren't
+typically passed around that way; when accessing a key stored in the btree the
+btree code will unpack the key and return a wrapper type that contains pointers
+to the unpacked key and the value.
+
+The scheme for the wrapper types is:
+	bkey_i		key with inline value
+	bkey_s		key with split value (wrapper with pointers to key and value)
+	bkey_s_c	constent key with split value
+
+The code has the wrapper types above for generic keys with values, and value
+types have the corresponding wrapper types generated with pointers to the value
+of the correct type.
+
+For example, here's the struct definition for extended attributes:
+
+	struct bch_xattr {
+		struct bch_val		v;
+		u8			x_type;
+		u8			x_name_len;
+		u16			x_val_len;
+		u8			x_name[];
+	};
+
+The wrapper types will be bkey_i_xattr, bkey_s_xattr, and bkey_s_c_xattr. When
+the xattr code is doing a lookup in the btree, the btree iterator will return
+(via e.g. btree_iter_peek()) a key of type bkey_s_c, and then the xattr code
+will use bkey_s_c_to_xattr(k) (after switching on the type field) to get to the
+xattr itself.
+
+Code should always switch() off of or check the bkey type field before
+converting to their particular type; the accessors all assert that the type
+field is the expected one, so you'll get a BUG_ON() if you don't.
+
+### Indexing:
+
+Keys are indexed in a small number of btrees: one btree for extents, another for
+inodes, another for dirents, etc.
+
+We can add new btree types for new features without on disk format changes -
+many planned features will be added this way (e.g. for reed solomon/raid6, we'll
+be adding another btree to index stripes by physical device and lba. More on
+this later).
+
+Some important properties/invariants:
+
+ - There are no duplicate keys; insertions implicitly overwrite existing keys.
+
+   Related to this, deletion is not exposed as a primitive: instead, deletion is
+   done by inserting a key of type KEY_TYPE_DELETED. (The btree code internally
+   handles overwrites by setting the old key to KEY_TYPE_DELETED, which will
+   then be deleted when the relevant btree node is compacted. Old deleted keys
+   are never visible outside the btree code).
+
+ - Ordering of insertions/updates is _always_ preserved, across unclean
+   shutdowns and without any need for flushes.
+
+   This is important for the filesystem code. Creating a file or a new hardlink
+   requires two operations:
+
+     - create new inode/increment inode refcount
+     - create new dirent
+
+   Bcache doesn't (currently) have transactions, but by doing these operations
+   in the correct order we can guarantee that after an unclean shutdown we'll
+   never have dirents pointing to nonexistent inodes - we might leak inode
+   references, but it's straightforward to garbage collect those at runtime.
+
+Lookups/insertions are done via the btree iterator, defined in btree.h:
+
+	struct btree_iter;
+
+	void bch_btree_iter_init(struct btree_iter *,
+				 struct cache_set *,
+				 enum btree_id,
+				 struct bpos);
+	int bch_btree_iter_unlock(struct btree_iter *);
+
+	struct bkey_s_c bch_btree_iter_peek(struct btree_iter *);
+	void bch_btree_iter_advance_pos(struct btree_iter *);
+
+ - A "cache_set" corresponds to a single filesystem/volume - the name coming
+   from bcache's starting point as a block cache, and turned into a cache_set
+   instead of just a cache when it gained the ability to manage multiple
+   devices.
+
+ - The btree_id is one of BTREE_ID_EXTENTS, BTREE_ID_INODES, etc.
+
+ - The bpos argument is the position to start iterating from.
+
+bch_btree_iter_unlock() unlocks any btree nodes the iterator has locked; if
+there was an error reading in a btree node it'll be returned here.
+
+bch_btree_iter_peek() returns the next key after the iterator's current
+position.
+
+bch_btree_iter_advance_pos() advance's the iterator's position to immediately
+after the last key returned.
+
+Lookup isn't provided as a primitive because most usage is geared more towards
+iterating from a given position, but we now have enough to implement it:
+
+	int lookup(struct bpos search)
+	{
+		struct btree_iter iter;
+		struct bkey_s_c k;
+		int ret = 0;
+
+		bch_btree_iter_init(&iter, c, BTREE_ID_EXAMPLE, search);
+
+		k = bch_btree_iter_peek(&iter);
+		if (!k.k || bkey_cmp(k.k->p, search)
+			ret = -EEXIST;
+			
+		bch_btree_iter_unlock(&iter);
+		return ret;
+	}
+
+If we want to iterate over every key in the btree (or just from a given point),
+there's the convenience macro for_each_btree_key():
+
+	struct btree_iter iter;
+	struct bkey_s_c k;
+
+	for_each_btree_key(&iter, c, BTREE_ID_EXAMPLE, POS_MIN, k)
+		printk("got %llu:%llu\n", k.k->p.inode, k.k->p.offset);
+
+	bch_btree_iter_unlock(&iter);
+
+Insertion is most often done with bch_btree_insert_at(), which takes an iterator
+and inserts at the iterator's current position. This is often used in
+conjunction with bch_btree_iter_peek_with_holes(), which returns a key
+representing every valid position (of KEY_TYPE_DELETED if nothing was found).
+
+This is highly useful for the inode and dirent code. For example, to create a
+new inode, the inode code can search for an empty position and then use
+bch_btree_insert_at() when it finds one, and the btree node's lock will guard
+against races with other inode creations.
+
+Note that it might not be possible to do the insert without dropping locks,
+e.g. if a split was required; the BTREE_INSERT_ATOMIC flag indicates that the
+insert shouldn't be done in this case and bch_btree_insert_at() will return
+-EINTR instead, and the caller will loop and retry the operation.
+
+The extents code also uses a similar mechanism to implement a cmpxchg() like
+operation which is used by things like copygc that move data around in the
+background - the index update will only succeed if the original key was present,
+which guards against races with foreground writes.
+
+
+## Extents, pointers, data:
+
+Bcache is extent based, not block based; its extents are much like extents in
+other filesystems that has them - a variable sized chunk of data. From the point
+of view of the index, they aren't positions, they're ranges or half open
+intervals (note that a 0 size extent doesn't overlap with anything).
+
+Bcache's extents are a list of one or more pointers - if there's more than one
+pointer, they point to replicated data (or possibly one of the copies is on a
+faster device and is considered cached).
+
+Here's a simplified version (in the latest version the extent format has gotten
+considerably more complicated in order to incorporate checksums and compression
+information, but the concept is still the same):
+
+	struct bch_extent_ptr {
+		__u64			offset:48,
+					dev:8,
+					gen:8;
+	};
+
+	struct bch_extent {
+		struct bch_val		v;
+
+		struct bch_extent_ptr	ptr[0]
+	};
+
+ - The device field is the index of one of the devices in the cache set (i.e.
+   volume/filesystem).
+
+ - The offset field is the offset of the start of the pointed-to data, counting
+   from the start of the device in units of 512 byte sectors.
+
+ - "gen" is a generation number that must match the generation number of the
+   bucket the pointer points into for the pointer to be consider valid (not
+   stale).
+
+### Buckets and generations:
+
+This mechanism comes from bcache's origins as just a block cache.
+
+The devices bcache manages are divided up into fixed sized buckets (typically
+anywhere from 128k to 2M). The core of the allocator works in terms of buckets
+(there's a sector allocator on top, similar to how the slab allocator is built
+on top of the page allocator in Linux).
+
+When a bucket is allocated, it is written to once sequentially: then, it is
+never written to again until the entire bucket is reused. When a bucket is
+reused, its generation number is incremented (and the new generation number
+persisted before discarding it or writing to it again). If the bucket still
+contained any cached data, incrementing the generation number is the mechanism
+that invalidates any still live pointers pointing into that bucket (in
+programming language terminology, bcache's pointers are weak pointers).
+
+This has a number of implications:
+
+ - Invalidating clean cached data is very cheap - there's no cost to keeping a
+   device full of clean cached data.
+
+ - We don't persist fine grained allocation information: we only persist the
+   current generation number of each bucket, and at runtime we maintain in
+   memory counts of the number of live dirty and cached sectors in each bucket -
+   these counts are updated on the fly as the index is updated and old extents
+   are overwritten and new ones added.
+
+   This is a performance tradeoff - it's a fairly significant performance win at
+   runtime but it costs us at startup time. Eventually, we'll probably implement
+   a reserve that we can allocate out of at startup so we can do the initial
+   mark and sweep in the background.
+
+   This does mean that at startup we have to walk the extents btree once to
+   repopulate these counts before we can do any allocation.
+
+ - If there's a lot of internal fragmentation, we do require copying garbage
+   collection to compact data - rewriting it into new buckets.
+
+ - Since the generation number is only 8 bits, we do have to guard against
+   wraparound - this isn't really a performance issue since wraparound requires
+   rewriting the same bucket many times (and incoming writes will be distributed
+   across many buckets). We do occasionally have to walk the extents to update
+   the "oldest known generation number" for every bucket, but the mark and sweep
+   GC code runs concurrently with everything else, except for the allocator if
+   the freelist becomes completely drained before it finishes (the part of GC
+   that updates per bucket sector counts mostly isn't required anymore, it may
+   be removed at some point).
+
+
+### IO path:
+
+By this point, the IO path should be fairly straightforward.
+
+The core read path starts in bch_read(), in io.c. The algorithm goes like:
+
+ - Iterate over the extents btree (with for_each_btree_key_with_holes()),
+   starting from the first sector in the request.
+
+ - Check the type of the returned key:
+
+   - Hole? (KEY_TYPE_DELETED) - just return zeroes
+
+   - Extent? Check for a pointer we can read from.
+
+      - If they're all stale, it was a cached extent (i.e. we were caching
+        another block device), handle it like a hole.
+
+      - If the relevant device is missing/not online, return an error.
+
+      - Ok, we have a pointer we can read from. If the extent is smaller than
+	the request, split the request (with bio_split()), and issue the request
+	to the appropriate device:sector.
+
+      - Iterate until entire request has been handled.
+
+The write path is harder to follow because it's highly asynchronous, but the
+basic algorithm is fairly simple there too:
+
+ - Set up a key that'll represent the data we're going to write (not the
+   pointers yet).
+
+ - Allocate some space to write to (with bch_alloc_sectors()); add the
+   pointer(s) to the space we allocated to the key we set up previously.
+
+ - Were we able to allocate as much space as we wanted? If not, split both the
+   key and the request.
+
+ - Issue the write to the space we just allocated
+
+ - Loop until we've allocated all the space we want, building up a list of keys
+   to insert.
+
+ - Finally, after the data write(s) complete, insert the keys we created into
+   the extents btree.
author	Kent Overstreet <kent.overstreet@gmail.com>	2015-05-30 02:42:27 -0700
committer	Kent Overstreet <kent.overstreet@gmail.com>	2015-05-30 02:42:27 -0700
commit	265733f273762038577a0340657b6c9cd9a62c4f (patch)
tree	384877bbc7f696cb1a10e4714b5fb6960892acf5
parent	445b4770a0b52981b219b55e62258cd2ff5d1a2d (diff)