diff options
author | Kent Overstreet <kent.overstreet@gmail.com> | 2015-06-19 02:45:58 -0700 |
---|---|---|
committer | Kent Overstreet <kent.overstreet@gmail.com> | 2015-06-19 02:45:58 -0700 |
commit | 1ec9612af747c98dcecabaa829d3e42c24f72e9b (patch) | |
tree | 7bc555d14225b0385d5900eac48acf149ed4f309 | |
parent | acced30e87ef6a736f0677ecd472b5654e80f2d8 (diff) |
guide
-rw-r--r-- | BcacheGuide.mdwn | 137 |
1 files changed, 126 insertions, 11 deletions
diff --git a/BcacheGuide.mdwn b/BcacheGuide.mdwn index f626a86..f3f6be7 100644 --- a/BcacheGuide.mdwn +++ b/BcacheGuide.mdwn @@ -4,12 +4,12 @@ This document is intended to cover the design and core concepts of bcache, and serve as a guide to programmers trying to become familiar with the bcache -codebase. It'll try to cover how various features are built ontop of the core +codebase. It'll try to cover how various features are built on top of the core concepts and how everything fits together. [[!toc levels=3 startlevel=2]] -## Keys, values and indexes: +## Introduction The core abstraction in bcache is a key value store ("the btree"). To the btree code the keys it indexes are just large, fixed sized bitstrings/integers, with @@ -26,6 +26,29 @@ key/value pairs: - Dirents and xattrs are stored with one key per dirent/xattr. +The bulk of the complexity in bcache is in the implementation of the btree; +posix filesystem semantics are implemented directly on top of it, with the btree +doing all the heavy lifting with respect to e.g. locking and on disk +consistency. + +In general, bcache eschews heavyweight transactions in favor of ordering +filesystem updates. In this, bcachefs's implementation is in spirit not unlike +softupdates - however, since bcachefs is working with logical index updates and +not physical disk writes, it works out to be drastically simpler. The btree +guarantees that ordering is preserved of all index updates, without ever needing +flushes - see the section on [[sequential consistency|BcacheGuide/#index10h3]]. + +We use this approach is used for creating/deleting files and links, and for +logical atomicity of appends/truncates. For those unfamiliar with softupdates, +this means that when creating files, we create the inode and then create the +dirent, and when creating a hardlink we increment i_nlinks and then create the +new dirent - and the order is reversed on deletion. On unclean shutdown we might +leak inode references - but this is easily handled with a garbage collection +pass. + +We do have some support for transactional updates, however - primarily for +rename(). + ### Keys (bkey and bpos) We have separate structs for a search key (`struct bpos`), and the outer @@ -117,7 +140,7 @@ Code should always `switch()` off of or check the bkey type field before converting to their particular type; the accessors all assert that the type field is the expected one, so you'll get a `BUG_ON()` if you don't. -### Indexing: +### Btree interface: Keys are indexed in a small number of btrees: one btree for extents, another for inodes, another for dirents, etc. @@ -146,10 +169,10 @@ Some important properties/invariants: - create new inode/increment inode refcount - create new dirent - Bcache doesn't (currently) have transactions, but by doing these operations - in the correct order we can guarantee that after an unclean shutdown we'll - never have dirents pointing to nonexistent inodes - we might leak inode - references, but it's straightforward to garbage collect those at runtime. + By doing these operations in the correct order we can guarantee that after an + unclean shutdown we'll never have dirents pointing to nonexistent inodes - we + might leak inode references, but it's straightforward to garbage collect those + at runtime. Lookups/insertions are done via the btree iterator, defined in btree.h: @@ -212,6 +235,8 @@ there's the convenience macro `for_each_btree_key()`: bch_btree_iter_unlock(&iter); +#### Updates: + Insertion is most often done with `bch_btree_insert_at()`, which takes an iterator and inserts at the iterator's current position. This is often used in conjunction with `bch_btree_iter_peek_with_holes()`, which returns a key @@ -233,7 +258,61 @@ operation which is used by things like copygc that move data around in the background - the index update will only succeed if the original key was present, which guards against races with foreground writes. -## Extents, pointers, data: +#### Linked iterators and multiple atomic updates: + +These mechanisms are quite new, and the interfaces will likely change and become +more refined in the future: + +Two or more iterators (up to a small fixed number) can be linked with +`bch_btree_iter_link()` - this causes them to share locks, so that if they are +used to take locks on the same node that would ordinarily conflict (e.g. holding +a read lock on a leaf node with one iterator while updating the same node via +another iterator) won't deadlock. + +This makes it possible to get a consistent view of multiple positions in the +btree (or different btrees) at the same time. + +Furthermore, updates via one iterator will not invalidate linked iterators - but +be careful, a /pointer/ returned via e.g. `bch_btree_iter_peek()` will still be +invalidated by an update (via `bch_btree_insert_at()`). But the other iterator +will be kept internally consistent, and it won't have to drop locks or redo any +traversals to continue iterating. + +Then, updates to multiple (linked) iterators may be performed with +`bch_btree_insert_at_multi()`: it takes a list of iterators, and keys to insert +at each iterator's position. Either all of the updates will be performed without +dropping any locks, or else none of them (the `BTREE_INSERT_ATOMIC` flag is +implicit here, as it makes no sense to use `bch_btree_insert_at_multi()` when +it's not required). Additionally, the updates will all use the same journal +reservation, so the update will be atomic on disk as well. + +There are some locking considerations that are not yet hidden/abstracted away +yet: + + - It is not permissible to hold a read lock on any btree node while taking an + intent lock on another. This is because of a deadlock it would cause (an even + more obscure version of the one documented in section on + [[locking|BcacheGuide/#index8h3]] which I will hopefully document in the + future). + + This is easy to work around - just traverse the iterator taking intent locks + first (with peek(), or `bch_btree_iter_traverse()` directly if more + convenient), and if looping unlock the iterator holding read locks before + re-traversing the intent lock iterator. Since the iterator code will first + attempt to relock nodes (checking the remembered sequence number against the + lock's current sequence number), there's no real cost to unlocking the read + iterator. + + - If linked iterators are only being used to take read locks there are no + ordering requirements, since read locks do not conflict with read locks - + however, when taking intent locks the iterator pointing to the lower position + must be traversed first. + + I haven't yet come up with a satisfactory way of having the iterator code + handle this - currently it's up to the code using linked iterators. See + `bch_dirent_rename()` for an example. + +## Extents, pointers, and data: Bcache is extent based, not block based; its extents are much like extents in other filesystems that has them - a variable sized chunk of data. From the point @@ -332,7 +411,15 @@ This has a number of implications: ### IO path: -By this point, the IO path should be fairly straightforward. +XXX: describe what consumes bch_read() and bch_write() + +The main entry points to the IO code are + + - `bch_read()` + + - `bch_read_extent()` + + - `bch_write()` The core read path starts in `bch_read()`, in io.c. The algorithm goes like: @@ -500,9 +587,9 @@ This does mean that we can't reuse any of the buckets in the stripe until each of them are empty - but this shouldn't be much trouble, we just have to teach copygc about it when it's considering which buckets to evacuate. -#### Copying garbage collection: +### Copying garbage collection: -#### Tiering: +### Tiering: ### Allocation: @@ -821,3 +908,31 @@ in the superblock. Right now the btree node local format is defined to be format `BKEY_FORMAT_CURRENT`. Changing struct bkey would be as simple as defining a new format, and making sure the new format gets added to existing superblocks when it's used. + +## Error handling + +This section assumes we have no fallback via replication to try (i.e. retry the +write to another device) - or we have, and that's failed: we're at the point +where we're forced to return errors up the stack. + +Error handling of data writes is nothing special - we can return an error for +just the particular IO and be done with it. + +Error handling for metadata IO is a more interesting topic. Fundamentally, if a +metadata write fails (be it journal or btree), we can't fail return errors for +some subset of the outstanding operations - we have to stop everything. + +This isn't peculiar to bcache - other journaling filesystems have the same +behaviour. There's a few reasons; partly it comes down to the fact that by the +time we do the actual IO, the index update (dirent creation, inode update) has +long completed so we can't return an error and unwind, and we can't +realistically backtrack from the contents of a journal write or a btree node +write to the in memory state that's now inconsistent. + +So we have to consider the entire in memory state of the filesystem inconsistent +with what's on disk, and the only thing we can really do is just emergency +remount RO. + +### Journal IO errors + +### Btree IO errors |