guide

author: Kent Overstreet <kent.overstreet@gmail.com> 2015-06-19 02:45:58 -0700
committer: Kent Overstreet <kent.overstreet@gmail.com> 2015-06-19 02:45:58 -0700
commit: 1ec9612af747c98dcecabaa829d3e42c24f72e9b (patch)
tree: 7bc555d14225b0385d5900eac48acf149ed4f309
parent: acced30e87ef6a736f0677ecd472b5654e80f2d8 (diff)
1 files changed, 126 insertions, 11 deletions
diff --git a/BcacheGuide.mdwn b/BcacheGuide.mdwn
index f626a86..f3f6be7 100644
--- a/BcacheGuide.mdwn
+++ b/BcacheGuide.mdwn
@@ -4,12 +4,12 @@
 
 This document is intended to cover the design and core concepts of bcache, and
 serve as a guide to programmers trying to become familiar with the bcache
-codebase. It'll try to cover how various features are built ontop of the core
+codebase. It'll try to cover how various features are built on top of the core
 concepts and how everything fits together.
 
 [[!toc levels=3 startlevel=2]]
 
-## Keys, values and indexes:
+## Introduction
 
 The core abstraction in bcache is a key value store ("the btree"). To the btree
 code the keys it indexes are just large, fixed sized bitstrings/integers, with
@@ -26,6 +26,29 @@ key/value pairs:
 
  - Dirents and xattrs are stored with one key per dirent/xattr.
 
+The bulk of the complexity in bcache is in the implementation of the btree;
+posix filesystem semantics are implemented directly on top of it, with the btree
+doing all the heavy lifting with respect to e.g. locking and on disk
+consistency.
+
+In general, bcache eschews heavyweight transactions in favor of ordering
+filesystem updates. In this, bcachefs's implementation is in spirit not unlike
+softupdates - however, since bcachefs is working with logical index updates and
+not physical disk writes, it works out to be drastically simpler. The btree
+guarantees that ordering is preserved of all index updates, without ever needing
+flushes - see the section on [[sequential consistency|BcacheGuide/#index10h3]].
+
+We use this approach is used for creating/deleting files and links, and for
+logical atomicity of appends/truncates. For those unfamiliar with softupdates,
+this means that when creating files, we create the inode and then create the
+dirent, and when creating a hardlink we increment i_nlinks and then create the
+new dirent - and the order is reversed on deletion. On unclean shutdown we might
+leak inode references - but this is easily handled with a garbage collection
+pass.
+
+We do have some support for transactional updates, however - primarily for
+rename().
+
 ### Keys (bkey and bpos) 
 
 We have separate structs for a search key (`struct bpos`), and the outer
@@ -117,7 +140,7 @@ Code should always `switch()` off of or check the bkey type field before
 converting to their particular type; the accessors all assert that the type
 field is the expected one, so you'll get a `BUG_ON()` if you don't.
 
-### Indexing:
+### Btree interface:
 
 Keys are indexed in a small number of btrees: one btree for extents, another for
 inodes, another for dirents, etc.
@@ -146,10 +169,10 @@ Some important properties/invariants:
      - create new inode/increment inode refcount
      - create new dirent
 
-   Bcache doesn't (currently) have transactions, but by doing these operations
-   in the correct order we can guarantee that after an unclean shutdown we'll
-   never have dirents pointing to nonexistent inodes - we might leak inode
-   references, but it's straightforward to garbage collect those at runtime.
+   By doing these operations in the correct order we can guarantee that after an
+   unclean shutdown we'll never have dirents pointing to nonexistent inodes - we
+   might leak inode references, but it's straightforward to garbage collect those
+   at runtime.
 
 Lookups/insertions are done via the btree iterator, defined in btree.h:
 
@@ -212,6 +235,8 @@ there's the convenience macro `for_each_btree_key()`:
 
         bch_btree_iter_unlock(&iter);
 
+#### Updates:
+
 Insertion is most often done with `bch_btree_insert_at()`, which takes an
 iterator and inserts at the iterator's current position. This is often used in
 conjunction with `bch_btree_iter_peek_with_holes()`, which returns a key
@@ -233,7 +258,61 @@ operation which is used by things like copygc that move data around in the
 background - the index update will only succeed if the original key was present,
 which guards against races with foreground writes.
 
-## Extents, pointers, data:
+#### Linked iterators and multiple atomic updates:
+
+These mechanisms are quite new, and the interfaces will likely change and become
+more refined in the future:
+
+Two or more iterators (up to a small fixed number) can be linked with
+`bch_btree_iter_link()` - this causes them to share locks, so that if they are
+used to take locks on the same node that would ordinarily conflict (e.g. holding
+a read lock on a leaf node with one iterator while updating the same node via
+another iterator) won't deadlock.
+
+This makes it possible to get a consistent view of multiple positions in the
+btree (or different btrees) at the same time.
+
+Furthermore, updates via one iterator will not invalidate linked iterators - but
+be careful, a /pointer/ returned via e.g. `bch_btree_iter_peek()` will still be
+invalidated by an update (via `bch_btree_insert_at()`). But the other iterator
+will be kept internally consistent, and it won't have to drop locks or redo any
+traversals to continue iterating.
+
+Then, updates to multiple (linked) iterators may be performed with
+`bch_btree_insert_at_multi()`: it takes a list of iterators, and keys to insert
+at each iterator's position. Either all of the updates will be performed without
+dropping any locks, or else none of them (the `BTREE_INSERT_ATOMIC` flag is
+implicit here, as it makes no sense to use `bch_btree_insert_at_multi()` when
+it's not required). Additionally, the updates will all use the same journal
+reservation, so the update will be atomic on disk as well.
+
+There are some locking considerations that are not yet hidden/abstracted away
+yet:
+
+ - It is not permissible to hold a read lock on any btree node while taking an
+   intent lock on another. This is because of a deadlock it would cause (an even
+   more obscure version of the one documented in section on
+   [[locking|BcacheGuide/#index8h3]] which I will hopefully document in the
+   future).
+
+   This is easy to work around - just traverse the iterator taking intent locks
+   first (with peek(), or `bch_btree_iter_traverse()` directly if more
+   convenient), and if looping unlock the iterator holding read locks before
+   re-traversing the intent lock iterator. Since the iterator code will first
+   attempt to relock nodes (checking the remembered sequence number against the
+   lock's current sequence number), there's no real cost to unlocking the read
+   iterator.
+
+ - If linked iterators are only being used to take read locks there are no
+   ordering requirements, since read locks do not conflict with read locks -
+   however, when taking intent locks the iterator pointing to the lower position
+   must be traversed first.
+
+   I haven't yet come up with a satisfactory way of having the iterator code
+   handle this - currently it's up to the code using linked iterators. See
+   `bch_dirent_rename()` for an example.
+
+## Extents, pointers, and data:
 
 Bcache is extent based, not block based; its extents are much like extents in
 other filesystems that has them - a variable sized chunk of data. From the point
@@ -332,7 +411,15 @@ This has a number of implications:
 
 ### IO path:
 
-By this point, the IO path should be fairly straightforward.
+XXX: describe what consumes bch_read() and bch_write()
+
+The main entry points to the IO code are
+
+ - `bch_read()`
+
+ - `bch_read_extent()`
+
+ - `bch_write()`
 
 The core read path starts in `bch_read()`, in io.c. The algorithm goes like:
 
@@ -500,9 +587,9 @@ This does mean that we can't reuse any of the buckets in the stripe until each
 of them are empty - but this shouldn't be much trouble, we just have to teach
 copygc about it when it's considering which buckets to evacuate.
 
-#### Copying garbage collection:
+### Copying garbage collection:
 
-#### Tiering:
+### Tiering:
 
 ### Allocation:
 
@@ -821,3 +908,31 @@ in the superblock. Right now the btree node local format is defined to be format
 `BKEY_FORMAT_CURRENT`. Changing struct bkey would be as simple as defining a new
 format, and making sure the new format gets added to existing superblocks when
 it's used.
+
+## Error handling
+
+This section assumes we have no fallback via replication to try (i.e. retry the
+write to another device) - or we have, and that's failed: we're at the point
+where we're forced to return errors up the stack.
+
+Error handling of data writes is nothing special - we can return an error for
+just the particular IO and be done with it.
+
+Error handling for metadata IO is a more interesting topic. Fundamentally, if a
+metadata write fails (be it journal or btree), we can't fail return errors for
+some subset of the outstanding operations - we have to stop everything.
+
+This isn't peculiar to bcache - other journaling filesystems have the same
+behaviour. There's a few reasons; partly it comes down to the fact that by the
+time we do the actual IO, the index update (dirent creation, inode update) has
+long completed so we can't return an error and unwind, and we can't
+realistically backtrack from the contents of a journal write or a btree node
+write to the in memory state that's now inconsistent.
+
+So we have to consider the entire in memory state of the filesystem inconsistent
+with what's on disk, and the only thing we can really do is just emergency
+remount RO.
+
+### Journal IO errors
+
+### Btree IO errors
author	Kent Overstreet <kent.overstreet@gmail.com>	2015-06-19 02:45:58 -0700
committer	Kent Overstreet <kent.overstreet@gmail.com>	2015-06-19 02:45:58 -0700
commit	1ec9612af747c98dcecabaa829d3e42c24f72e9b (patch)
tree	7bc555d14225b0385d5900eac48acf149ed4f309
parent	acced30e87ef6a736f0677ecd472b5654e80f2d8 (diff)