summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKent Overstreet <kent.overstreet@gmail.com>2015-06-02 05:34:33 -0700
committerKent Overstreet <kent.overstreet@gmail.com>2015-06-02 05:34:33 -0700
commit0b86c0668ea8267e65681ff801051d53dfd8e0e4 (patch)
treeca6c07f433f6b22e34c093dee0caaa8b7d11eb2a
parent6b9eed94aa0dc1e257ea66b9ca4a5c9cb4f116aa (diff)
css
-rw-r--r--BcacheGuide.css21
-rw-r--r--BcacheGuide.mdwn248
-rw-r--r--local.css11
3 files changed, 148 insertions, 132 deletions
diff --git a/BcacheGuide.css b/BcacheGuide.css
new file mode 100644
index 0000000..fde3b78
--- /dev/null
+++ b/BcacheGuide.css
@@ -0,0 +1,21 @@
+
+body {
+ font-family: "Liberation Sans", Verdana, "Bitstream Vera Sans", sans-serif;
+ color: black;
+ background: white;
+ margin-left: 2em;
+ margin-right: 2em;
+}
+
+h1,h2,h3,h4,h5 {
+ font-weight: 400;
+}
+
+code {
+ tab-size: 8
+}
+
+#pagebody {
+ max-width: 35em;
+ font-size: small;
+}
diff --git a/BcacheGuide.mdwn b/BcacheGuide.mdwn
index a123a34..a2865be 100644
--- a/BcacheGuide.mdwn
+++ b/BcacheGuide.mdwn
@@ -1,4 +1,4 @@
-[[!toc levels=3 startlevel=2]]
+[[!meta stylesheet=BcacheGuide]
# The Programmer's Guide to bcache:
@@ -7,6 +7,8 @@ serve as a guide to programmers trying to become familiar with the bcache
codebase. It'll try to cover how various features are built ontop of the core
concepts and how everything fits together.
+[[!toc levels=3 startlevel=2]]
+
## Keys, values and indexes:
The core abstraction in bcache is a key value store ("the btree"). To the btree
@@ -26,27 +28,27 @@ key/value pairs:
### Keys (bkey and bpos)
-We have separate structs for a search key (struct bpos), and the outer container
-that holds a key and a value (struct bkey).
+We have separate structs for a search key (`struct bpos`), and the outer
+container that holds a key and a value (`struct bkey`).
Here are their definitions:
- struct bpos {
- u64 inode;
- u64 offset;
- u32 snapshot;
- };
-
- struct bkey {
- u8 u64s;
- u8 format;
- u8 type;
-
- u8 pad;
- u32 version;
- u32 size;
- struct bpos p;
- };
+ struct bpos {
+ u64 inode;
+ u64 offset;
+ u32 snapshot;
+ };
+
+ struct bkey {
+ u8 u64s;
+ u8 format;
+ u8 type;
+
+ u8 pad;
+ u32 version;
+ u32 size;
+ struct bpos p;
+ };
The high bits of the search key are the inode field and the low bits are the
snapshot field (note that on little endian the order in the struct will actually
@@ -61,16 +63,16 @@ extents the offset field corresponds to the offset within that file.
Some notes on the bkey fields:
- The u64s field is the size of the combined key and value, in units of u64s.
- It's not generally used directly: use bkey_val_u64s() or bkey_val_bytes() for
- a higher level interface.
+ It's not generally used directly: use `bkey_val_u64s()` or `bkey_val_bytes()`
+ for a higher level interface.
- The format field is internal to the btree implementation. It's for packed
bkeys - bkeys are not usually stored in the btree in the in memory. Only very
low level btree code need be concerned with this, however.
- The type field denotes the type of the value. For example, the directory
- entry code defines two different value types, bch_dirent (type BCH_DIRENT,
- 128) and bch_dirent_whiteout (BCH_DIRENT_WHITEOUT, 129).
+ entry code defines two different value types, `bch_dirent` (type
+ `BCH_DIRENT`, 128) and `bch_dirent_whiteout` (`BCH_DIRENT_WHITEOUT`, 129).
- The size field is only used by extents (0 for all other keys)
@@ -86,9 +88,9 @@ btree code will unpack the key and return a wrapper type that contains pointers
to the unpacked key and the value.
The scheme for the wrapper types is:
- bkey_i key with inline value
- bkey_s key with split value (wrapper with pointers to key and value)
- bkey_s_c constent key with split value
+ bkey_i key with inline value
+ bkey_s key with split value (wrapper with pointers to key and value)
+ bkey_s_c constent key with split value
The code has the wrapper types above for generic keys with values, and value
types have the corresponding wrapper types generated with pointers to the value
@@ -96,23 +98,23 @@ of the correct type.
For example, here's the struct definition for extended attributes:
- struct bch_xattr {
- struct bch_val v;
- u8 x_type;
- u8 x_name_len;
- u16 x_val_len;
- u8 x_name[];
- };
-
-The wrapper types will be bkey_i_xattr, bkey_s_xattr, and bkey_s_c_xattr. When
-the xattr code is doing a lookup in the btree, the btree iterator will return
-(via e.g. btree_iter_peek()) a key of type bkey_s_c, and then the xattr code
-will use bkey_s_c_to_xattr(k) (after switching on the type field) to get to the
-xattr itself.
-
-Code should always switch() off of or check the bkey type field before
+ struct bch_xattr {
+ struct bch_val v;
+ u8 x_type;
+ u8 x_name_len;
+ u16 x_val_len;
+ u8 x_name[];
+ };
+
+The wrapper types will be `bkey_i_xattr`, `bkey_s_xattr`, and `bkey_s_c_xattr`.
+When the xattr code is doing a lookup in the btree, the btree iterator will
+return (via e.g. `btree_iter_peek()`) a key of type `bkey_s_c`, and then the
+xattr code will use `bkey_s_c_to_xattr(k)` (after switching on the type field)
+to get to the xattr itself.
+
+Code should always `switch()` off of or check the bkey type field before
converting to their particular type; the accessors all assert that the type
-field is the expected one, so you'll get a BUG_ON() if you don't.
+field is the expected one, so you'll get a `BUG_ON()` if you don't.
### Indexing:
@@ -129,10 +131,10 @@ Some important properties/invariants:
- There are no duplicate keys; insertions implicitly overwrite existing keys.
Related to this, deletion is not exposed as a primitive: instead, deletion is
- done by inserting a key of type KEY_TYPE_DELETED. (The btree code internally
- handles overwrites by setting the old key to KEY_TYPE_DELETED, which will
- then be deleted when the relevant btree node is compacted. Old deleted keys
- are never visible outside the btree code).
+ done by inserting a key of type `KEY_TYPE_DELETED`. (The btree code
+ internally handles overwrites by setting the old key to `KEY_TYPE_DELETED`,
+ which will then be deleted when the relevant btree node is compacted. Old
+ deleted keys are never visible outside the btree code).
- Ordering of insertions/updates is _always_ preserved, across unclean
shutdowns and without any need for flushes.
@@ -150,86 +152,86 @@ Some important properties/invariants:
Lookups/insertions are done via the btree iterator, defined in btree.h:
- struct btree_iter;
+ struct btree_iter;
- void bch_btree_iter_init(struct btree_iter *,
- struct cache_set *,
- enum btree_id,
- struct bpos);
- int bch_btree_iter_unlock(struct btree_iter *);
+ void bch_btree_iter_init(struct btree_iter *,
+ struct cache_set *,
+ enum btree_id,
+ struct bpos);
+ int bch_btree_iter_unlock(struct btree_iter *);
- struct bkey_s_c bch_btree_iter_peek(struct btree_iter *);
- void bch_btree_iter_advance_pos(struct btree_iter *);
+ struct bkey_s_c bch_btree_iter_peek(struct btree_iter *);
+ void bch_btree_iter_advance_pos(struct btree_iter *);
- - A "cache_set" corresponds to a single filesystem/volume - the name coming
- from bcache's starting point as a block cache, and turned into a cache_set
+ - A "`cache_set`" corresponds to a single filesystem/volume - the name coming
+ from bcache's starting point as a block cache, and turned into a `cache_set`
instead of just a cache when it gained the ability to manage multiple
devices.
- - The btree_id is one of BTREE_ID_EXTENTS, BTREE_ID_INODES, etc.
+ - The `btree_id` is one of `BTREE_ID_EXTENTS`, `BTREE_ID_INODES`, etc.
- The bpos argument is the position to start iterating from.
-bch_btree_iter_unlock() unlocks any btree nodes the iterator has locked; if
+`bch_btree_iter_unlock()` unlocks any btree nodes the iterator has locked; if
there was an error reading in a btree node it'll be returned here.
-bch_btree_iter_peek() returns the next key after the iterator's current
+`bch_btree_iter_peek()` returns the next key after the iterator's current
position.
-bch_btree_iter_advance_pos() advance's the iterator's position to immediately
+`bch_btree_iter_advance_pos()` advance's the iterator's position to immediately
after the last key returned.
Lookup isn't provided as a primitive because most usage is geared more towards
iterating from a given position, but we now have enough to implement it:
- int lookup(struct bpos search)
- {
- struct btree_iter iter;
- struct bkey_s_c k;
- int ret = 0;
+ int lookup(struct bpos search)
+ {
+ struct btree_iter iter;
+ struct bkey_s_c k;
+ int ret = 0;
- bch_btree_iter_init(&iter, c, BTREE_ID_EXAMPLE, search);
+ bch_btree_iter_init(&iter, c, BTREE_ID_EXAMPLE, search);
- k = bch_btree_iter_peek(&iter);
- if (!k.k || bkey_cmp(k.k->p, search)
- ret = -EEXIST;
-
- bch_btree_iter_unlock(&iter);
- return ret;
- }
+ k = bch_btree_iter_peek(&iter);
+ if (!k.k || bkey_cmp(k.k->p, search)
+ ret = -EEXIST;
+
+ bch_btree_iter_unlock(&iter);
+ return ret;
+ }
If we want to iterate over every key in the btree (or just from a given point),
-there's the convenience macro for_each_btree_key():
+there's the convenience macro `for_each_btree_key()`:
- struct btree_iter iter;
- struct bkey_s_c k;
+ struct btree_iter iter;
+ struct bkey_s_c k;
- for_each_btree_key(&iter, c, BTREE_ID_EXAMPLE, POS_MIN, k)
- printk("got %llu:%llu\n", k.k->p.inode, k.k->p.offset);
+ for_each_btree_key(&iter, c, BTREE_ID_EXAMPLE, POS_MIN, k)
+ printk("got %llu:%llu\n", k.k->p.inode, k.k->p.offset);
- bch_btree_iter_unlock(&iter);
+ bch_btree_iter_unlock(&iter);
-Insertion is most often done with bch_btree_insert_at(), which takes an iterator
-and inserts at the iterator's current position. This is often used in
-conjunction with bch_btree_iter_peek_with_holes(), which returns a key
-representing every valid position (of KEY_TYPE_DELETED if nothing was found).
+Insertion is most often done with `bch_btree_insert_at()`, which takes an
+iterator and inserts at the iterator's current position. This is often used in
+conjunction with `bch_btree_iter_peek_with_holes()`, which returns a key
+representing every valid position (synthesizing one of `KEY_TYPE_DELETED` if
+nothing was found).
This is highly useful for the inode and dirent code. For example, to create a
new inode, the inode code can search for an empty position and then use
-bch_btree_insert_at() when it finds one, and the btree node's lock will guard
+`bch_btree_insert_at()` when it finds one, and the btree node's lock will guard
against races with other inode creations.
-Note that it might not be possible to do the insert without dropping locks,
-e.g. if a split was required; the BTREE_INSERT_ATOMIC flag indicates that the
-insert shouldn't be done in this case and bch_btree_insert_at() will return
--EINTR instead, and the caller will loop and retry the operation.
+Note that it might not be possible to do the insert without dropping locks, e.g.
+if a split was required; the `BTREE_INSERT_ATOMIC` flag indicates that the
+insert shouldn't be done in this case and `bch_btree_insert_at()` will return
+`-EINTR` instead, and the caller will loop and retry the operation.
-The extents code also uses a similar mechanism to implement a cmpxchg() like
+The extents code also uses a similar mechanism to implement a cmpxchg like
operation which is used by things like copygc that move data around in the
background - the index update will only succeed if the original key was present,
which guards against races with foreground writes.
-
## Extents, pointers, data:
Bcache is extent based, not block based; its extents are much like extents in
@@ -255,17 +257,17 @@ Here's a simplified version (in the latest version the extent format has gotten
considerably more complicated in order to incorporate checksums and compression
information, but the concept is still the same):
- struct bch_extent_ptr {
- __u64 offset:48,
- dev:8,
- gen:8;
- };
+ struct bch_extent_ptr {
+ __u64 offset:48,
+ dev:8,
+ gen:8;
+ };
- struct bch_extent {
- struct bch_val v;
+ struct bch_extent {
+ struct bch_val v;
- struct bch_extent_ptr ptr[0]
- };
+ struct bch_extent_ptr ptr[0]
+ };
- The device field is the index of one of the devices in the cache set (i.e.
volume/filesystem).
@@ -348,8 +350,8 @@ The core read path starts in bch_read(), in io.c. The algorithm goes like:
- If the relevant device is missing/not online, return an error.
- Ok, we have a pointer we can read from. If the extent is smaller than
- the request, split the request (with bio_split()), and issue the request
- to the appropriate device:sector.
+ the request, split the request (with bio_split()), and issue the request
+ to the appropriate device:sector.
- Iterate until entire request has been handled.
@@ -409,15 +411,15 @@ make sure the dead parts of the extent aren't overwritten.
So, for compressed or checksummed extents here's the additional information
we'll need to store in struct bch_extent:
- struct bch_extent_crc32 {
- __u32 type:1,
- offset:7,
- compressed_size:8,
- uncompressed_size:8,
- csum_type:4,
- compression_type:4;
- __u32 csum;
- };
+ struct bch_extent_crc32 {
+ __u32 type:1,
+ offset:7,
+ compressed_size:8,
+ uncompressed_size:8,
+ csum_type:4,
+ compression_type:4;
+ __u32 csum;
+ };
Now, when trimming an extent, instead of modifying the pointer offset to point
to the start of the new live region (as we still do for non
@@ -501,6 +503,10 @@ copygc about it when it's considering which buckets to evacuate.
#### Tiering:
+### Allocation:
+
+### Multiple devices:
+
## Btree internals:
At a high level, bcache's btree is a copy on write b+ tree. The main difference
@@ -513,12 +519,12 @@ we've written to the full amount of space we originally reserved for the node.
A single btree node log entry is represented as a header and a list of bkeys,
where the bkeys are all contiguous in memory and in sorted order:
- struct bset {
- /* some fields ommitted */
- ...
- u16 u64s;
- struct bkey_packed start[0];
- };
+ struct bset {
+ /* some fields ommitted */
+ ...
+ u16 u64s;
+ struct bkey_packed start[0];
+ };
Since bkeys are variable length, it's not possible to access keys randomly
without other data structures - only iterate sequentially via bkey_next().
@@ -586,9 +592,9 @@ which requires waiting for the new nodes to be written to disk).
Intent locks with only three states do introduce another deadlock, though:
- Thread A Thread B
- read | Parent | intent
- intent | Child | intent
+ Thread A Thread B
+ read | Parent | intent
+ intent | Child | intent
Thread B is splitting the child node: it's allocated new nodes and written them
out, and now needs to take a write lock on the parent in order to add the
@@ -775,7 +781,7 @@ unpack for comparisons would probably be right out. So if we want to be able to
compare packed keys, the requirement is that it be order preserving. That is, we
need to define some function pack(), such that
- bkey_cmp(l, r) == bkey_cmp(pack(l), pack(r))
+ bkey_cmp(l, r) == bkey_cmp(pack(l), pack(r))
The way it works is for every btree node we define a packed format (and we
recalculate a new packed format when splitting/rewriting nodes); then keys that
@@ -811,6 +817,6 @@ converting it to our current in memory representation and back).
All that's left to do is to store formats (besides the btree node-local format)
in the superblock. Right now the btree node local format is defined to be format
0 and the in memory format - struct bkey itself - is format 1,
-BKEY_FORMAT_CURRENT. Changing struct bkey would be as simple as defining a new
+`BKEY_FORMAT_CURRENT`. Changing struct bkey would be as simple as defining a new
format, and making sure the new format gets added to existing superblocks when
it's used.
diff --git a/local.css b/local.css
deleted file mode 100644
index 677a4e2..0000000
--- a/local.css
+++ /dev/null
@@ -1,11 +0,0 @@
-
-body {
- margin-top: 5em;
- margin-bottom: 6em;
- padding: 0;
- font-family: Georgia;
- *font-size: small;
- color: black;
- background: white;
- font-size: 1.01em;
-}