summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKent Overstreet <kent.overstreet@gmail.com>2015-05-30 02:42:27 -0700
committerKent Overstreet <kent.overstreet@gmail.com>2015-05-30 02:42:27 -0700
commit265733f273762038577a0340657b6c9cd9a62c4f (patch)
tree384877bbc7f696cb1a10e4714b5fb6960892acf5
parent445b4770a0b52981b219b55e62258cd2ff5d1a2d (diff)
bcache guide
-rw-r--r--BcacheGuide.mdwn364
1 files changed, 364 insertions, 0 deletions
diff --git a/BcacheGuide.mdwn b/BcacheGuide.mdwn
new file mode 100644
index 0000000..e60a33f
--- /dev/null
+++ b/BcacheGuide.mdwn
@@ -0,0 +1,364 @@
+[[!toc startlevel=2]]
+
+# THE PROGRAMMER'S GUIDE TO BCACHE:
+
+This document is intended to cover the design and core concepts of bcache, and
+serve as a guide to programmers trying to become familiar with the bcache
+codebase. It'll try to cover how various features are built ontop of the core
+concepts and how everything fits together.
+
+
+## KEYS, VALUES AND INDEXES:
+
+The core abstraction in bcache is a key value store ("the btree"). To the btree
+code the keys it indexes are just large, fixed sized bitstrings/integers, with
+opaque variable sized values.
+
+Essentially aside from some low level machinery is stored as key/value pairs:
+
+ - Extents are one type of key where the value is a list of pointers to the
+ actual data.
+
+ - Inodes are stored indexed by inode number, where the value is the inode
+ struct.
+
+ - Dirents and xattrs are stored with one key per dirent/xattr.
+
+### Keys (bkey and bpos)
+
+We have separate structs for a search key (struct bpos), and the outer container
+that holds a key and a value (struct bkey).
+
+Here are their definitions:
+
+ struct bpos {
+ u64 inode;
+ u64 offset;
+ u32 snapshot;
+ };
+
+ struct bkey {
+ u8 u64s;
+ u8 format;
+ u8 type;
+
+ u8 pad;
+ u32 version;
+ u32 size;
+ struct bpos p;
+ };
+
+The high bits of the search key are the inode field and the low bits are the
+snapshot field (note that on little endian the order in the struct will actually
+be reversed).
+
+Not all code uses all fields of the search key (and snapshot is currently
+unused), and some code uses the fields in different ways; it's up to the user of
+the btree code to use the fields however they see fit. However, the inode field
+generally corresponds to an inode number (of a file in a filesystem), and for
+extents the offset field corresponds to the offset within that file.
+
+Some notes on the bkey fields:
+
+ - The u64s field is the size of the combined key and value, in units of u64s.
+ It's not generally used directly: use bkey_val_u64s() or bkey_val_bytes() for
+ a higher level interface.
+
+ - The format field is internal to the btree implementation. It's for packed
+ bkeys - bkeys are not usually stored in the btree in the in memory. Only very
+ low level btree code need be concerned with this, however.
+
+ - The type field denotes the type of the value. For example, the directory
+ entry code defines two different value types, bch_dirent (type BCH_DIRENT,
+ 128) and bch_dirent_whiteout (BCH_DIRENT_WHITEOUT, 129).
+
+ - The size field is only used by extents (0 for all other keys)
+
+### Values:
+
+Values are stored inline with bkeys. Each value type (for all the types that
+have nonempty values) will have a struct definition, and then macros generate a
+variety of accessor functions and wrapper types.
+
+Also, while values are _stored_ inline with the keys, due to packing they aren't
+typically passed around that way; when accessing a key stored in the btree the
+btree code will unpack the key and return a wrapper type that contains pointers
+to the unpacked key and the value.
+
+The scheme for the wrapper types is:
+ bkey_i key with inline value
+ bkey_s key with split value (wrapper with pointers to key and value)
+ bkey_s_c constent key with split value
+
+The code has the wrapper types above for generic keys with values, and value
+types have the corresponding wrapper types generated with pointers to the value
+of the correct type.
+
+For example, here's the struct definition for extended attributes:
+
+ struct bch_xattr {
+ struct bch_val v;
+ u8 x_type;
+ u8 x_name_len;
+ u16 x_val_len;
+ u8 x_name[];
+ };
+
+The wrapper types will be bkey_i_xattr, bkey_s_xattr, and bkey_s_c_xattr. When
+the xattr code is doing a lookup in the btree, the btree iterator will return
+(via e.g. btree_iter_peek()) a key of type bkey_s_c, and then the xattr code
+will use bkey_s_c_to_xattr(k) (after switching on the type field) to get to the
+xattr itself.
+
+Code should always switch() off of or check the bkey type field before
+converting to their particular type; the accessors all assert that the type
+field is the expected one, so you'll get a BUG_ON() if you don't.
+
+### Indexing:
+
+Keys are indexed in a small number of btrees: one btree for extents, another for
+inodes, another for dirents, etc.
+
+We can add new btree types for new features without on disk format changes -
+many planned features will be added this way (e.g. for reed solomon/raid6, we'll
+be adding another btree to index stripes by physical device and lba. More on
+this later).
+
+Some important properties/invariants:
+
+ - There are no duplicate keys; insertions implicitly overwrite existing keys.
+
+ Related to this, deletion is not exposed as a primitive: instead, deletion is
+ done by inserting a key of type KEY_TYPE_DELETED. (The btree code internally
+ handles overwrites by setting the old key to KEY_TYPE_DELETED, which will
+ then be deleted when the relevant btree node is compacted. Old deleted keys
+ are never visible outside the btree code).
+
+ - Ordering of insertions/updates is _always_ preserved, across unclean
+ shutdowns and without any need for flushes.
+
+ This is important for the filesystem code. Creating a file or a new hardlink
+ requires two operations:
+
+ - create new inode/increment inode refcount
+ - create new dirent
+
+ Bcache doesn't (currently) have transactions, but by doing these operations
+ in the correct order we can guarantee that after an unclean shutdown we'll
+ never have dirents pointing to nonexistent inodes - we might leak inode
+ references, but it's straightforward to garbage collect those at runtime.
+
+Lookups/insertions are done via the btree iterator, defined in btree.h:
+
+ struct btree_iter;
+
+ void bch_btree_iter_init(struct btree_iter *,
+ struct cache_set *,
+ enum btree_id,
+ struct bpos);
+ int bch_btree_iter_unlock(struct btree_iter *);
+
+ struct bkey_s_c bch_btree_iter_peek(struct btree_iter *);
+ void bch_btree_iter_advance_pos(struct btree_iter *);
+
+ - A "cache_set" corresponds to a single filesystem/volume - the name coming
+ from bcache's starting point as a block cache, and turned into a cache_set
+ instead of just a cache when it gained the ability to manage multiple
+ devices.
+
+ - The btree_id is one of BTREE_ID_EXTENTS, BTREE_ID_INODES, etc.
+
+ - The bpos argument is the position to start iterating from.
+
+bch_btree_iter_unlock() unlocks any btree nodes the iterator has locked; if
+there was an error reading in a btree node it'll be returned here.
+
+bch_btree_iter_peek() returns the next key after the iterator's current
+position.
+
+bch_btree_iter_advance_pos() advance's the iterator's position to immediately
+after the last key returned.
+
+Lookup isn't provided as a primitive because most usage is geared more towards
+iterating from a given position, but we now have enough to implement it:
+
+ int lookup(struct bpos search)
+ {
+ struct btree_iter iter;
+ struct bkey_s_c k;
+ int ret = 0;
+
+ bch_btree_iter_init(&iter, c, BTREE_ID_EXAMPLE, search);
+
+ k = bch_btree_iter_peek(&iter);
+ if (!k.k || bkey_cmp(k.k->p, search)
+ ret = -EEXIST;
+
+ bch_btree_iter_unlock(&iter);
+ return ret;
+ }
+
+If we want to iterate over every key in the btree (or just from a given point),
+there's the convenience macro for_each_btree_key():
+
+ struct btree_iter iter;
+ struct bkey_s_c k;
+
+ for_each_btree_key(&iter, c, BTREE_ID_EXAMPLE, POS_MIN, k)
+ printk("got %llu:%llu\n", k.k->p.inode, k.k->p.offset);
+
+ bch_btree_iter_unlock(&iter);
+
+Insertion is most often done with bch_btree_insert_at(), which takes an iterator
+and inserts at the iterator's current position. This is often used in
+conjunction with bch_btree_iter_peek_with_holes(), which returns a key
+representing every valid position (of KEY_TYPE_DELETED if nothing was found).
+
+This is highly useful for the inode and dirent code. For example, to create a
+new inode, the inode code can search for an empty position and then use
+bch_btree_insert_at() when it finds one, and the btree node's lock will guard
+against races with other inode creations.
+
+Note that it might not be possible to do the insert without dropping locks,
+e.g. if a split was required; the BTREE_INSERT_ATOMIC flag indicates that the
+insert shouldn't be done in this case and bch_btree_insert_at() will return
+-EINTR instead, and the caller will loop and retry the operation.
+
+The extents code also uses a similar mechanism to implement a cmpxchg() like
+operation which is used by things like copygc that move data around in the
+background - the index update will only succeed if the original key was present,
+which guards against races with foreground writes.
+
+
+## Extents, pointers, data:
+
+Bcache is extent based, not block based; its extents are much like extents in
+other filesystems that has them - a variable sized chunk of data. From the point
+of view of the index, they aren't positions, they're ranges or half open
+intervals (note that a 0 size extent doesn't overlap with anything).
+
+Bcache's extents are a list of one or more pointers - if there's more than one
+pointer, they point to replicated data (or possibly one of the copies is on a
+faster device and is considered cached).
+
+Here's a simplified version (in the latest version the extent format has gotten
+considerably more complicated in order to incorporate checksums and compression
+information, but the concept is still the same):
+
+ struct bch_extent_ptr {
+ __u64 offset:48,
+ dev:8,
+ gen:8;
+ };
+
+ struct bch_extent {
+ struct bch_val v;
+
+ struct bch_extent_ptr ptr[0]
+ };
+
+ - The device field is the index of one of the devices in the cache set (i.e.
+ volume/filesystem).
+
+ - The offset field is the offset of the start of the pointed-to data, counting
+ from the start of the device in units of 512 byte sectors.
+
+ - "gen" is a generation number that must match the generation number of the
+ bucket the pointer points into for the pointer to be consider valid (not
+ stale).
+
+### Buckets and generations:
+
+This mechanism comes from bcache's origins as just a block cache.
+
+The devices bcache manages are divided up into fixed sized buckets (typically
+anywhere from 128k to 2M). The core of the allocator works in terms of buckets
+(there's a sector allocator on top, similar to how the slab allocator is built
+on top of the page allocator in Linux).
+
+When a bucket is allocated, it is written to once sequentially: then, it is
+never written to again until the entire bucket is reused. When a bucket is
+reused, its generation number is incremented (and the new generation number
+persisted before discarding it or writing to it again). If the bucket still
+contained any cached data, incrementing the generation number is the mechanism
+that invalidates any still live pointers pointing into that bucket (in
+programming language terminology, bcache's pointers are weak pointers).
+
+This has a number of implications:
+
+ - Invalidating clean cached data is very cheap - there's no cost to keeping a
+ device full of clean cached data.
+
+ - We don't persist fine grained allocation information: we only persist the
+ current generation number of each bucket, and at runtime we maintain in
+ memory counts of the number of live dirty and cached sectors in each bucket -
+ these counts are updated on the fly as the index is updated and old extents
+ are overwritten and new ones added.
+
+ This is a performance tradeoff - it's a fairly significant performance win at
+ runtime but it costs us at startup time. Eventually, we'll probably implement
+ a reserve that we can allocate out of at startup so we can do the initial
+ mark and sweep in the background.
+
+ This does mean that at startup we have to walk the extents btree once to
+ repopulate these counts before we can do any allocation.
+
+ - If there's a lot of internal fragmentation, we do require copying garbage
+ collection to compact data - rewriting it into new buckets.
+
+ - Since the generation number is only 8 bits, we do have to guard against
+ wraparound - this isn't really a performance issue since wraparound requires
+ rewriting the same bucket many times (and incoming writes will be distributed
+ across many buckets). We do occasionally have to walk the extents to update
+ the "oldest known generation number" for every bucket, but the mark and sweep
+ GC code runs concurrently with everything else, except for the allocator if
+ the freelist becomes completely drained before it finishes (the part of GC
+ that updates per bucket sector counts mostly isn't required anymore, it may
+ be removed at some point).
+
+
+### IO path:
+
+By this point, the IO path should be fairly straightforward.
+
+The core read path starts in bch_read(), in io.c. The algorithm goes like:
+
+ - Iterate over the extents btree (with for_each_btree_key_with_holes()),
+ starting from the first sector in the request.
+
+ - Check the type of the returned key:
+
+ - Hole? (KEY_TYPE_DELETED) - just return zeroes
+
+ - Extent? Check for a pointer we can read from.
+
+ - If they're all stale, it was a cached extent (i.e. we were caching
+ another block device), handle it like a hole.
+
+ - If the relevant device is missing/not online, return an error.
+
+ - Ok, we have a pointer we can read from. If the extent is smaller than
+ the request, split the request (with bio_split()), and issue the request
+ to the appropriate device:sector.
+
+ - Iterate until entire request has been handled.
+
+The write path is harder to follow because it's highly asynchronous, but the
+basic algorithm is fairly simple there too:
+
+ - Set up a key that'll represent the data we're going to write (not the
+ pointers yet).
+
+ - Allocate some space to write to (with bch_alloc_sectors()); add the
+ pointer(s) to the space we allocated to the key we set up previously.
+
+ - Were we able to allocate as much space as we wanted? If not, split both the
+ key and the request.
+
+ - Issue the write to the space we just allocated
+
+ - Loop until we've allocated all the space we want, building up a list of keys
+ to insert.
+
+ - Finally, after the data write(s) complete, insert the keys we created into
+ the extents btree.