diff options
author | Kent Overstreet <kent.overstreet@gmail.com> | 2015-05-30 02:42:27 -0700 |
---|---|---|
committer | Kent Overstreet <kent.overstreet@gmail.com> | 2015-05-30 02:42:27 -0700 |
commit | 265733f273762038577a0340657b6c9cd9a62c4f (patch) | |
tree | 384877bbc7f696cb1a10e4714b5fb6960892acf5 | |
parent | 445b4770a0b52981b219b55e62258cd2ff5d1a2d (diff) |
bcache guide
-rw-r--r-- | BcacheGuide.mdwn | 364 |
1 files changed, 364 insertions, 0 deletions
diff --git a/BcacheGuide.mdwn b/BcacheGuide.mdwn new file mode 100644 index 0000000..e60a33f --- /dev/null +++ b/BcacheGuide.mdwn @@ -0,0 +1,364 @@ +[[!toc startlevel=2]] + +# THE PROGRAMMER'S GUIDE TO BCACHE: + +This document is intended to cover the design and core concepts of bcache, and +serve as a guide to programmers trying to become familiar with the bcache +codebase. It'll try to cover how various features are built ontop of the core +concepts and how everything fits together. + + +## KEYS, VALUES AND INDEXES: + +The core abstraction in bcache is a key value store ("the btree"). To the btree +code the keys it indexes are just large, fixed sized bitstrings/integers, with +opaque variable sized values. + +Essentially aside from some low level machinery is stored as key/value pairs: + + - Extents are one type of key where the value is a list of pointers to the + actual data. + + - Inodes are stored indexed by inode number, where the value is the inode + struct. + + - Dirents and xattrs are stored with one key per dirent/xattr. + +### Keys (bkey and bpos) + +We have separate structs for a search key (struct bpos), and the outer container +that holds a key and a value (struct bkey). + +Here are their definitions: + + struct bpos { + u64 inode; + u64 offset; + u32 snapshot; + }; + + struct bkey { + u8 u64s; + u8 format; + u8 type; + + u8 pad; + u32 version; + u32 size; + struct bpos p; + }; + +The high bits of the search key are the inode field and the low bits are the +snapshot field (note that on little endian the order in the struct will actually +be reversed). + +Not all code uses all fields of the search key (and snapshot is currently +unused), and some code uses the fields in different ways; it's up to the user of +the btree code to use the fields however they see fit. However, the inode field +generally corresponds to an inode number (of a file in a filesystem), and for +extents the offset field corresponds to the offset within that file. + +Some notes on the bkey fields: + + - The u64s field is the size of the combined key and value, in units of u64s. + It's not generally used directly: use bkey_val_u64s() or bkey_val_bytes() for + a higher level interface. + + - The format field is internal to the btree implementation. It's for packed + bkeys - bkeys are not usually stored in the btree in the in memory. Only very + low level btree code need be concerned with this, however. + + - The type field denotes the type of the value. For example, the directory + entry code defines two different value types, bch_dirent (type BCH_DIRENT, + 128) and bch_dirent_whiteout (BCH_DIRENT_WHITEOUT, 129). + + - The size field is only used by extents (0 for all other keys) + +### Values: + +Values are stored inline with bkeys. Each value type (for all the types that +have nonempty values) will have a struct definition, and then macros generate a +variety of accessor functions and wrapper types. + +Also, while values are _stored_ inline with the keys, due to packing they aren't +typically passed around that way; when accessing a key stored in the btree the +btree code will unpack the key and return a wrapper type that contains pointers +to the unpacked key and the value. + +The scheme for the wrapper types is: + bkey_i key with inline value + bkey_s key with split value (wrapper with pointers to key and value) + bkey_s_c constent key with split value + +The code has the wrapper types above for generic keys with values, and value +types have the corresponding wrapper types generated with pointers to the value +of the correct type. + +For example, here's the struct definition for extended attributes: + + struct bch_xattr { + struct bch_val v; + u8 x_type; + u8 x_name_len; + u16 x_val_len; + u8 x_name[]; + }; + +The wrapper types will be bkey_i_xattr, bkey_s_xattr, and bkey_s_c_xattr. When +the xattr code is doing a lookup in the btree, the btree iterator will return +(via e.g. btree_iter_peek()) a key of type bkey_s_c, and then the xattr code +will use bkey_s_c_to_xattr(k) (after switching on the type field) to get to the +xattr itself. + +Code should always switch() off of or check the bkey type field before +converting to their particular type; the accessors all assert that the type +field is the expected one, so you'll get a BUG_ON() if you don't. + +### Indexing: + +Keys are indexed in a small number of btrees: one btree for extents, another for +inodes, another for dirents, etc. + +We can add new btree types for new features without on disk format changes - +many planned features will be added this way (e.g. for reed solomon/raid6, we'll +be adding another btree to index stripes by physical device and lba. More on +this later). + +Some important properties/invariants: + + - There are no duplicate keys; insertions implicitly overwrite existing keys. + + Related to this, deletion is not exposed as a primitive: instead, deletion is + done by inserting a key of type KEY_TYPE_DELETED. (The btree code internally + handles overwrites by setting the old key to KEY_TYPE_DELETED, which will + then be deleted when the relevant btree node is compacted. Old deleted keys + are never visible outside the btree code). + + - Ordering of insertions/updates is _always_ preserved, across unclean + shutdowns and without any need for flushes. + + This is important for the filesystem code. Creating a file or a new hardlink + requires two operations: + + - create new inode/increment inode refcount + - create new dirent + + Bcache doesn't (currently) have transactions, but by doing these operations + in the correct order we can guarantee that after an unclean shutdown we'll + never have dirents pointing to nonexistent inodes - we might leak inode + references, but it's straightforward to garbage collect those at runtime. + +Lookups/insertions are done via the btree iterator, defined in btree.h: + + struct btree_iter; + + void bch_btree_iter_init(struct btree_iter *, + struct cache_set *, + enum btree_id, + struct bpos); + int bch_btree_iter_unlock(struct btree_iter *); + + struct bkey_s_c bch_btree_iter_peek(struct btree_iter *); + void bch_btree_iter_advance_pos(struct btree_iter *); + + - A "cache_set" corresponds to a single filesystem/volume - the name coming + from bcache's starting point as a block cache, and turned into a cache_set + instead of just a cache when it gained the ability to manage multiple + devices. + + - The btree_id is one of BTREE_ID_EXTENTS, BTREE_ID_INODES, etc. + + - The bpos argument is the position to start iterating from. + +bch_btree_iter_unlock() unlocks any btree nodes the iterator has locked; if +there was an error reading in a btree node it'll be returned here. + +bch_btree_iter_peek() returns the next key after the iterator's current +position. + +bch_btree_iter_advance_pos() advance's the iterator's position to immediately +after the last key returned. + +Lookup isn't provided as a primitive because most usage is geared more towards +iterating from a given position, but we now have enough to implement it: + + int lookup(struct bpos search) + { + struct btree_iter iter; + struct bkey_s_c k; + int ret = 0; + + bch_btree_iter_init(&iter, c, BTREE_ID_EXAMPLE, search); + + k = bch_btree_iter_peek(&iter); + if (!k.k || bkey_cmp(k.k->p, search) + ret = -EEXIST; + + bch_btree_iter_unlock(&iter); + return ret; + } + +If we want to iterate over every key in the btree (or just from a given point), +there's the convenience macro for_each_btree_key(): + + struct btree_iter iter; + struct bkey_s_c k; + + for_each_btree_key(&iter, c, BTREE_ID_EXAMPLE, POS_MIN, k) + printk("got %llu:%llu\n", k.k->p.inode, k.k->p.offset); + + bch_btree_iter_unlock(&iter); + +Insertion is most often done with bch_btree_insert_at(), which takes an iterator +and inserts at the iterator's current position. This is often used in +conjunction with bch_btree_iter_peek_with_holes(), which returns a key +representing every valid position (of KEY_TYPE_DELETED if nothing was found). + +This is highly useful for the inode and dirent code. For example, to create a +new inode, the inode code can search for an empty position and then use +bch_btree_insert_at() when it finds one, and the btree node's lock will guard +against races with other inode creations. + +Note that it might not be possible to do the insert without dropping locks, +e.g. if a split was required; the BTREE_INSERT_ATOMIC flag indicates that the +insert shouldn't be done in this case and bch_btree_insert_at() will return +-EINTR instead, and the caller will loop and retry the operation. + +The extents code also uses a similar mechanism to implement a cmpxchg() like +operation which is used by things like copygc that move data around in the +background - the index update will only succeed if the original key was present, +which guards against races with foreground writes. + + +## Extents, pointers, data: + +Bcache is extent based, not block based; its extents are much like extents in +other filesystems that has them - a variable sized chunk of data. From the point +of view of the index, they aren't positions, they're ranges or half open +intervals (note that a 0 size extent doesn't overlap with anything). + +Bcache's extents are a list of one or more pointers - if there's more than one +pointer, they point to replicated data (or possibly one of the copies is on a +faster device and is considered cached). + +Here's a simplified version (in the latest version the extent format has gotten +considerably more complicated in order to incorporate checksums and compression +information, but the concept is still the same): + + struct bch_extent_ptr { + __u64 offset:48, + dev:8, + gen:8; + }; + + struct bch_extent { + struct bch_val v; + + struct bch_extent_ptr ptr[0] + }; + + - The device field is the index of one of the devices in the cache set (i.e. + volume/filesystem). + + - The offset field is the offset of the start of the pointed-to data, counting + from the start of the device in units of 512 byte sectors. + + - "gen" is a generation number that must match the generation number of the + bucket the pointer points into for the pointer to be consider valid (not + stale). + +### Buckets and generations: + +This mechanism comes from bcache's origins as just a block cache. + +The devices bcache manages are divided up into fixed sized buckets (typically +anywhere from 128k to 2M). The core of the allocator works in terms of buckets +(there's a sector allocator on top, similar to how the slab allocator is built +on top of the page allocator in Linux). + +When a bucket is allocated, it is written to once sequentially: then, it is +never written to again until the entire bucket is reused. When a bucket is +reused, its generation number is incremented (and the new generation number +persisted before discarding it or writing to it again). If the bucket still +contained any cached data, incrementing the generation number is the mechanism +that invalidates any still live pointers pointing into that bucket (in +programming language terminology, bcache's pointers are weak pointers). + +This has a number of implications: + + - Invalidating clean cached data is very cheap - there's no cost to keeping a + device full of clean cached data. + + - We don't persist fine grained allocation information: we only persist the + current generation number of each bucket, and at runtime we maintain in + memory counts of the number of live dirty and cached sectors in each bucket - + these counts are updated on the fly as the index is updated and old extents + are overwritten and new ones added. + + This is a performance tradeoff - it's a fairly significant performance win at + runtime but it costs us at startup time. Eventually, we'll probably implement + a reserve that we can allocate out of at startup so we can do the initial + mark and sweep in the background. + + This does mean that at startup we have to walk the extents btree once to + repopulate these counts before we can do any allocation. + + - If there's a lot of internal fragmentation, we do require copying garbage + collection to compact data - rewriting it into new buckets. + + - Since the generation number is only 8 bits, we do have to guard against + wraparound - this isn't really a performance issue since wraparound requires + rewriting the same bucket many times (and incoming writes will be distributed + across many buckets). We do occasionally have to walk the extents to update + the "oldest known generation number" for every bucket, but the mark and sweep + GC code runs concurrently with everything else, except for the allocator if + the freelist becomes completely drained before it finishes (the part of GC + that updates per bucket sector counts mostly isn't required anymore, it may + be removed at some point). + + +### IO path: + +By this point, the IO path should be fairly straightforward. + +The core read path starts in bch_read(), in io.c. The algorithm goes like: + + - Iterate over the extents btree (with for_each_btree_key_with_holes()), + starting from the first sector in the request. + + - Check the type of the returned key: + + - Hole? (KEY_TYPE_DELETED) - just return zeroes + + - Extent? Check for a pointer we can read from. + + - If they're all stale, it was a cached extent (i.e. we were caching + another block device), handle it like a hole. + + - If the relevant device is missing/not online, return an error. + + - Ok, we have a pointer we can read from. If the extent is smaller than + the request, split the request (with bio_split()), and issue the request + to the appropriate device:sector. + + - Iterate until entire request has been handled. + +The write path is harder to follow because it's highly asynchronous, but the +basic algorithm is fairly simple there too: + + - Set up a key that'll represent the data we're going to write (not the + pointers yet). + + - Allocate some space to write to (with bch_alloc_sectors()); add the + pointer(s) to the space we allocated to the key we set up previously. + + - Were we able to allocate as much space as we wanted? If not, split both the + key and the request. + + - Issue the write to the space we just allocated + + - Loop until we've allocated all the space we want, building up a list of keys + to insert. + + - Finally, after the data write(s) complete, insert the keys we created into + the extents btree. |