summaryrefslogtreecommitdiff
path: root/BcacheGuide.mdwn
blob: a2865bede6651b762bcc4455f4560fb0578998ac (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
[[!meta stylesheet=BcacheGuide]

# The Programmer's Guide to bcache:

This document is intended to cover the design and core concepts of bcache, and
serve as a guide to programmers trying to become familiar with the bcache
codebase. It'll try to cover how various features are built ontop of the core
concepts and how everything fits together.

[[!toc levels=3 startlevel=2]]

## Keys, values and indexes:

The core abstraction in bcache is a key value store ("the btree"). To the btree
code the keys it indexes are just large, fixed sized bitstrings/integers, with
opaque variable sized values.

Aside from some low level machinery, essentially all metadata is stored as
key/value pairs:

 - Extents are one type of key where the value is a list of pointers to the
   actual data.

 - Inodes are stored indexed by inode number, where the value is the inode
   struct.

 - Dirents and xattrs are stored with one key per dirent/xattr.

### Keys (bkey and bpos) 

We have separate structs for a search key (`struct bpos`), and the outer
container that holds a key and a value (`struct bkey`).

Here are their definitions:

    struct bpos {
            u64         inode;
            u64         offset;
            u32         snapshot;
    };
    
    struct bkey {
            u8          u64s;
            u8          format;
            u8          type;
    
            u8          pad;
            u32         version;
            u32         size;
            struct bpos p;
    };

The high bits of the search key are the inode field and the low bits are the
snapshot field (note that on little endian the order in the struct will actually
be reversed).

Not all code uses all fields of the search key (and snapshot is currently
unused), and some code uses the fields in different ways; it's up to the user of
the btree code to use the fields however they see fit. However, the inode field
generally corresponds to an inode number (of a file in a filesystem), and for
extents the offset field corresponds to the offset within that file.

Some notes on the bkey fields:

 - The u64s field is the size of the combined key and value, in units of u64s.
   It's not generally used directly: use `bkey_val_u64s()` or `bkey_val_bytes()`
   for a higher level interface.

 - The format field is internal to the btree implementation. It's for packed
   bkeys - bkeys are not usually stored in the btree in the in memory. Only very
   low level btree code need be concerned with this, however.

 - The type field denotes the type of the value. For example, the directory
   entry code defines two different value types, `bch_dirent` (type
   `BCH_DIRENT`, 128) and `bch_dirent_whiteout` (`BCH_DIRENT_WHITEOUT`, 129). 

 - The size field is only used by extents (0 for all other keys)

### Values:

Values are stored inline with bkeys. Each value type (for all the types that
have nonempty values) will have a struct definition, and then macros generate a
variety of accessor functions and wrapper types.

Also, while values are _stored_ inline with the keys, due to packing they aren't
typically passed around that way; when accessing a key stored in the btree the
btree code will unpack the key and return a wrapper type that contains pointers
to the unpacked key and the value.

The scheme for the wrapper types is:
        bkey_i          key with inline value
        bkey_s          key with split value (wrapper with pointers to key and value)
        bkey_s_c        constent key with split value

The code has the wrapper types above for generic keys with values, and value
types have the corresponding wrapper types generated with pointers to the value
of the correct type.

For example, here's the struct definition for extended attributes:

    struct bch_xattr {
            struct bch_val      v;
            u8                  x_type;
            u8                  x_name_len;
            u16                 x_val_len;
            u8                  x_name[];
    };

The wrapper types will be `bkey_i_xattr`, `bkey_s_xattr`, and `bkey_s_c_xattr`.
When the xattr code is doing a lookup in the btree, the btree iterator will
return (via e.g. `btree_iter_peek()`) a key of type `bkey_s_c`, and then the
xattr code will use `bkey_s_c_to_xattr(k)` (after switching on the type field)
to get to the xattr itself.

Code should always `switch()` off of or check the bkey type field before
converting to their particular type; the accessors all assert that the type
field is the expected one, so you'll get a `BUG_ON()` if you don't.

### Indexing:

Keys are indexed in a small number of btrees: one btree for extents, another for
inodes, another for dirents, etc.

We can add new btree types for new features without on disk format changes -
many planned features will be added this way (e.g. for reed solomon/raid6, we'll
be adding another btree to index stripes by physical device and lba. More on
this later).

Some important properties/invariants:

 - There are no duplicate keys; insertions implicitly overwrite existing keys.

   Related to this, deletion is not exposed as a primitive: instead, deletion is
   done by inserting a key of type `KEY_TYPE_DELETED`. (The btree code
   internally handles overwrites by setting the old key to `KEY_TYPE_DELETED`,
   which will then be deleted when the relevant btree node is compacted. Old
   deleted keys are never visible outside the btree code).

 - Ordering of insertions/updates is _always_ preserved, across unclean
   shutdowns and without any need for flushes.

   This is important for the filesystem code. Creating a file or a new hardlink
   requires two operations:

     - create new inode/increment inode refcount
     - create new dirent

   Bcache doesn't (currently) have transactions, but by doing these operations
   in the correct order we can guarantee that after an unclean shutdown we'll
   never have dirents pointing to nonexistent inodes - we might leak inode
   references, but it's straightforward to garbage collect those at runtime.

Lookups/insertions are done via the btree iterator, defined in btree.h:

    struct btree_iter;

    void bch_btree_iter_init(struct btree_iter *,
                             struct cache_set *,
                             enum btree_id,
                             struct bpos);
    int bch_btree_iter_unlock(struct btree_iter *);

    struct bkey_s_c bch_btree_iter_peek(struct btree_iter *);
    void bch_btree_iter_advance_pos(struct btree_iter *);

 - A "`cache_set`" corresponds to a single filesystem/volume - the name coming
   from bcache's starting point as a block cache, and turned into a `cache_set`
   instead of just a cache when it gained the ability to manage multiple
   devices.

 - The `btree_id` is one of `BTREE_ID_EXTENTS`, `BTREE_ID_INODES`, etc.

 - The bpos argument is the position to start iterating from.

`bch_btree_iter_unlock()` unlocks any btree nodes the iterator has locked; if
there was an error reading in a btree node it'll be returned here.

`bch_btree_iter_peek()` returns the next key after the iterator's current
position.

`bch_btree_iter_advance_pos()` advance's the iterator's position to immediately
after the last key returned.

Lookup isn't provided as a primitive because most usage is geared more towards
iterating from a given position, but we now have enough to implement it:

    int lookup(struct bpos search)
    {
            struct btree_iter iter;
            struct bkey_s_c k;
            int ret = 0;

            bch_btree_iter_init(&iter, c, BTREE_ID_EXAMPLE, search);

            k = bch_btree_iter_peek(&iter);
            if (!k.k || bkey_cmp(k.k->p, search)
                    ret = -EEXIST;
                    
            bch_btree_iter_unlock(&iter);
            return ret;
    }

If we want to iterate over every key in the btree (or just from a given point),
there's the convenience macro `for_each_btree_key()`:

    struct btree_iter iter;
    struct bkey_s_c k;

    for_each_btree_key(&iter, c, BTREE_ID_EXAMPLE, POS_MIN, k)
            printk("got %llu:%llu\n", k.k->p.inode, k.k->p.offset);

    bch_btree_iter_unlock(&iter);

Insertion is most often done with `bch_btree_insert_at()`, which takes an
iterator and inserts at the iterator's current position. This is often used in
conjunction with `bch_btree_iter_peek_with_holes()`, which returns a key
representing every valid position (synthesizing one of `KEY_TYPE_DELETED` if
nothing was found).

This is highly useful for the inode and dirent code. For example, to create a
new inode, the inode code can search for an empty position and then use
`bch_btree_insert_at()` when it finds one, and the btree node's lock will guard
against races with other inode creations.

Note that it might not be possible to do the insert without dropping locks, e.g.
if a split was required; the `BTREE_INSERT_ATOMIC` flag indicates that the
insert shouldn't be done in this case and `bch_btree_insert_at()` will return
`-EINTR` instead, and the caller will loop and retry the operation.

The extents code also uses a similar mechanism to implement a cmpxchg like
operation which is used by things like copygc that move data around in the
background - the index update will only succeed if the original key was present,
which guards against races with foreground writes.

## Extents, pointers, data:

Bcache is extent based, not block based; its extents are much like extents in
other filesystems that has them - a variable sized chunk of data. From the point
of view of the index, they aren't positions, they're ranges or half open
intervals (note that a 0 size extent doesn't overlap with anything).

Bcache's extents are indexed by inode:offset, and their size is stored in the
size field in struct bkey. The offset and size are both in 512 byte sectors (as
are the pointer offsets). The offset field denotes the _end_ position of the
extent within the file - a key with offset 8 and size 8 points to the data for
sectors 0 through 7.

(This oddity was a result of bcache's btree being designed for extents first,
and non extent keys coming later - it makes searching for extents within a
certain range cleaner when iterating in ascending order).

Inside the value is a list of one or more pointers - if there's more than one
pointer, they point to replicated data (or possibly one of the copies is on a
faster device and is considered cached).

Here's a simplified version (in the latest version the extent format has gotten
considerably more complicated in order to incorporate checksums and compression
information, but the concept is still the same):

    struct bch_extent_ptr {
            __u64                   offset:48,
                                    dev:8,
                                    gen:8;
    };

    struct bch_extent {
            struct bch_val          v;

            struct bch_extent_ptr   ptr[0]
    };

 - The device field is the index of one of the devices in the cache set (i.e.
   volume/filesystem).

 - The offset field is the offset of the start of the pointed-to data, counting
   from the start of the device in units of 512 byte sectors.

 - "gen" is a generation number that must match the generation number of the
   bucket the pointer points into for the pointer to be consider valid (not
   stale).

### Buckets and generations:

This mechanism comes from bcache's origins as just a block cache.

The devices bcache manages are divided up into fixed sized buckets (typically
anywhere from 128k to 2M). The core of the allocator works in terms of buckets
(there's a sector allocator on top, similar to how the slab allocator is built
on top of the page allocator in Linux).

When a bucket is allocated, it is written to once sequentially: then, it is
never written to again until the entire bucket is reused. When a bucket is
reused, its generation number is incremented (and the new generation number
persisted before discarding it or writing to it again). If the bucket still
contained any cached data, incrementing the generation number is the mechanism
that invalidates any still live pointers pointing into that bucket (in
programming language terminology, bcache's pointers are weak pointers).

This has a number of implications:

 - Invalidating clean cached data is very cheap - there's no cost to keeping a
   device full of clean cached data.

 - We don't persist fine grained allocation information: we only persist the
   current generation number of each bucket, and at runtime we maintain in
   memory counts of the number of live dirty and cached sectors in each bucket -
   these counts are updated on the fly as the index is updated and old extents
   are overwritten and new ones added.

   This is a performance tradeoff - it's a fairly significant performance win at
   runtime but it costs us at startup time. Eventually, we'll probably implement
   a reserve that we can allocate out of at startup so we can do the initial
   mark and sweep in the background.

   This does mean that at startup we have to walk the extents btree once to
   repopulate these counts before we can do any allocation.

 - If there's a lot of internal fragmentation, we do require copying garbage
   collection to compact data - rewriting it into new buckets.

 - Since the generation number is only 8 bits, we do have to guard against
   wraparound - this isn't really a performance issue since wraparound requires
   rewriting the same bucket many times (and incoming writes will be distributed
   across many buckets). We do occasionally have to walk the extents to update
   the "oldest known generation number" for every bucket, but the mark and sweep
   GC code runs concurrently with everything else, except for the allocator if
   the freelist becomes completely drained before it finishes (the part of GC
   that updates per bucket sector counts mostly isn't required anymore, it may
   be removed at some point).


### IO path:

By this point, the IO path should be fairly straightforward.

The core read path starts in bch_read(), in io.c. The algorithm goes like:

 - Iterate over the extents btree (with for_each_btree_key_with_holes()),
   starting from the first sector in the request.

 - Check the type of the returned key:

   - Hole? (KEY_TYPE_DELETED) - just return zeroes

   - Extent? Check for a pointer we can read from.

      - If they're all stale, it was a cached extent (i.e. we were caching
        another block device), handle it like a hole.

      - If the relevant device is missing/not online, return an error.

      - Ok, we have a pointer we can read from. If the extent is smaller than
        the request, split the request (with bio_split()), and issue the request
        to the appropriate device:sector.

      - Iterate until entire request has been handled.

The write path is harder to follow because it's highly asynchronous, but the
basic algorithm is fairly simple there too:

 - Set up a key that'll represent the data we're going to write (not the
   pointers yet).

 - Allocate some space to write to (with bch_alloc_sectors()); add the
   pointer(s) to the space we allocated to the key we set up previously.

 - Were we able to allocate as much space as we wanted? If not, split both the
   key and the request.

 - Issue the write to the space we just allocated

 - Loop until we've allocated all the space we want, building up a list of keys
   to insert.

 - Finally, after the data write(s) complete, insert the keys we created into
   the extents btree.

#### Checksumming/compression:

As mentioned previously, bch_extent got more complicated with data checksumming
and compression. 

For data checksumming, what we want to do is store the checksum with the key and
pointers, no the data - if you store the checksum with the data verifying the
checksum only tells you that you got the checksum that goes with that data, it
doesn't tell you if you got the data you actually wanted - perhaps the write
never happened and you're reading old data. There's a number of subtle ways data
can be corrupted that naively putting the checksum with the data won't protect
against, and there are workarounds for most of them but it's fundamentally not
the ideal approach.

So we want to store the checksum with the extent, but now we've got a different
problem - partially overwritten extents. When an extent is partially
overwritten, we don't have a checksum for the portion that's now live and we
can't compute it without reading that data in, which would probably not be
ideal.

So we need to remember what the original extent was, so that later when that
data is read we can read in the entire original extent, checksum that, and then
return only the part that was live.

Compression leads to a similar issue, where if the extent is compressed we won't
be able to part of it and decompress it, we have to be able to read the entire
extent.

One simplifying factor is that since buckets are reused all at once, as long as
any part of the original extent is live the rest of it will still be there - we
don't have to complicate our accounting and have two notions of live data to
make sure the dead parts of the extent aren't overwritten.

So, for compressed or checksummed extents here's the additional information
we'll need to store in struct bch_extent:

    struct bch_extent_crc32 {
            __u32                   type:1,
                                    offset:7,
                                    compressed_size:8,
                                    uncompressed_size:8,
                                    csum_type:4,
                                    compression_type:4;
            __u32                   csum;
    };

Now, when trimming an extent, instead of modifying the pointer offset to point
to the start of the new live region (as we still do for non
checksummed/compressed extents) we add to the offset field in the extent_crc
struct - representing the offset into the extent for the currently live portion.
The compressed_size and uncompressed_size fields store the original sizes of the
extent, while the size field in the key continues to represent the size of the
live portion of the extent.

There's another complicating factor is data migration, from copying garbage
collection and tiering - in general, when they need to move some data we can't
expect them to rewrite every replica of an extent at once (since copygc is
needed for making forward progress, that would lead to nasty deadlocks).

But if we're moving one of the replicas of an extent (or making another copy, as
in the case of promoting to a faster tier) we don't want to have to copy the
dead portions too - we'd never be able to get rid of the dead portions! Thus,
we need to handle extents where different pointers are in different formats.

Having one bch_extent_crc32 field per pointer (or bch_extent_crc64, for the case
where the user is using 64 bit checksums) would make our metadata excessively
large, unfortunately - if they're doing two or three way replication that would
roughly double the size of the extents btree, or triple it for 64 bit checksums.

So the new format struct bch_extent is a list of mixed pointers and extent_crc
fields: each extent_crc field corresponds to the pointers that come after it,
until the next extent_crc field. See include/uapi/linux/bcache.h for the gory
details, and extents.h for the various functions and macros for iterating over
and working with extent pointers and crc fields.


#### RAID5/6/Erasure coding:

This part hasn't been implemented yet - but here's the rough design for what
will hopefully be implemented in the near future:

Background: In conventional RAID, people have long noted the "RAID hole" and
wanted to solve it by moving RAID into the filesystem - this what ZFS does.

The hole is that it's impossible for block layer raid to atomically write a
stripe: while writing a stripe, there's inevitably going to be a window where
the P and Q blocks are inconsistent and recovery is impossible. The best that
they can do is order the writes so that the data blocks are written first, then
the P/Q blocks after, and then on unclean shutdown rebuild all the P/Q blocks
(with perhaps a bitmap to avoid having to resync the whole device, as md does).

The approach that ZFS takes is to fragment incoming writes into (variable sized)
stripes, so that it can write out the entire stripe all at once and never have a
pointer in its index pointing to an incomplete/inconsistent stripe.

This works, but fragmenting IO is painful for performance. Not only writes are
fragmented, but reads of the same data are fragmented too; and since the latency
of the read has to be the latency of the slowest fragment, this drives your
median latency to your tail latency. Worse, on disks you're spending a lot more
seeks than you were before; even on flash it's not ideal because since the
fragmenting is happening above all the scheduling, both on the host and the
device you're still subject to the tail latency effect.

What we want is to be able to erasure encode unrelated data together - while
avoiding update in place. This is problematic to do for foreground writes, but
if we restrict ourselves to erasure encoding data in the background - perhaps
even as it's being copied from the fast tier (flash) to the slow tier (disk) -
the problem is much easier.

The idea is to still replicate foreground writes, but keep track of the buckets
that contain replicated data. Then, a background job can take e.g. 5 buckets
that contain unrelated data, allocate two new buckets, and then compute the p/q
for the original five buckets and write them to the two new buckets. Then, for
every extent that points into those first five buckets (either with an index
scan, or by remembering keys from recent foreground writes), it'll update each
extent dropping their extra replicas and adding pointers to the p/q buckets -
with a bit set in each pointer telling the read path that to use those it has to
do a reconstruct read (which will entail looking up the stripe mapping in
another btree).

This does mean that we can't reuse any of the buckets in the stripe until each
of them are empty - but this shouldn't be much trouble, we just have to teach
copygc about it when it's considering which buckets to evacuate.

#### Copying garbage collection:

#### Tiering:

### Allocation:

### Multiple devices:

## Btree internals:

At a high level, bcache's btree is a copy on write b+ tree. The main difference
between bcache's b+ tree and others is the nodes are very large (256k is
typical) and log structured. Like other COW b+ trees, updating a node may
require recursively rewriting every node up to the root; however, most updates
(to both leaf nodes and interior nodes) can be done with only an append, until
we've written to the full amount of space we originally reserved for the node.

A single btree node log entry is represented as a header and a list of bkeys,
where the bkeys are all contiguous in memory and in sorted order:

        struct bset {
                /* some fields ommitted */
                ...
                u16                     u64s;
                struct bkey_packed      start[0];
        };

Since bkeys are variable length, it's not possible to access keys randomly
without other data structures - only iterate sequentially via bkey_next().

A btree node thus contains multiple independent bsets that on lookup all must be
searched and iterated over. At any given time there will be zero or more bsets
that have been written out, and a single dirty bset that new keys are being
inserted into.

As the btree is modified we must maintain the invariant in memory that there are
no duplicates (keys that compare as equal), excluding keys marked as deleted.
When an insertion overwrites an existing key, we will mark the existing key as
deleted (by setting k->type = KEY_TYPE_DELETED) - but until the entire node is
rewritten the old key will still exist on disk. To handle this, when a btree
node is read, the first thing we do is a mergesort of all the bsets it contains,
and as part of the mergesort duplicate keys are found and the older bkeys are
dropped - carefully matching the same changes we made in memory when doing the
insertions.

This hopefully explains how the lack of deletion as a primitive is a result of
the way the btree is implemented - it's not possible to delete a key except by
inserting a whiteout, which will be dropped when the btree node eventually fills
up and is rewritten.

Once a bset has been written out it may also be sorted, in memory, with other
bsets that have also been written out - we do so periodically so that a given
btree node will have only a few (at most three) bsets in memory: the one being
inserted into will be at most 8 or 16k, and the rest roughly forming a geometric
progression size, so that sorting the entire node is relatively infrequent.

This resorting/compacting in memory is one of the main ways bcache is able to
efficiently use such large btree nodes. The other main trick is to take
advantage of the fact that bsets that have been written out are, aside from
resorts, constant; we precompute lookup tables that would be too inefficient to
use if they had to be modified for insertions. The bcache code refers to these
lookup tables as the auxiliary search trees.

### Locking

Bcache doesn't use read/write locks for btree nodes - the locks it uses have
three states: shared, intent and exclusive (SIX locks). Shared and exclusive
correspond to read and write, while intent is sort of in the middle - intent
locks conflict with other intent locks (like write locks), but they don't
conflict with read locks.

The problem intent locks solve is that with a regular read/write lock, a read
lock can't be upgraded to a write lock - that would lead to deadlock when
multiple threads with read locks tried to upgrade. With a complicated enough
data structure, updates will need to hold write locks for exclusion with other
updates for much longer than the part where they do the actual modification that
needs exclusion from readers.

For example, consider the case of a btree split. The update starts at a leaf
node, and discovers it has to do a split. But before starting the split it has
to acquire a write lock on the parent node, primarily to avoid a deadlock with
other splits: it needs at least a read lock on the parent (roughly in order to
lock the path to the child node), but it couldn't then upgrade that read lock to
a write lock in order to update the parent with the pointers to the new children
because that would deadlock with threads splitting sibling leaf nodes.

Intent locks solve this problem. When doing a split it suffices to acquire an
intent lock on the parent - write locks are only ever held modifying the in
memory btree contents (which is a much shorter duration than the entire split,
which requires waiting for the new nodes to be written to disk).

Intent locks with only three states do introduce another deadlock, though:

        Thread A                        Thread B
        read            | Parent |      intent
        intent          | Child  |      intent

Thread B is splitting the child node: it's allocated new nodes and written them
out, and now needs to take a write lock on the parent in order to add the
pointers to the new nodes (after which it will free the old child).

Thread A just wants to insert into the child node - it has a read lock on the
parent node and it's looked up the child node, and now it's waiting (on thread
B) to get an intent lock on the child.

But thread A has blocked thread B from taking its write lock in order to update
the parent node, and thread B can't drop its intent lock on the child until
after the new nodes are visible and it has freed the child node.

The way this deadlock is handled is by enforcing (in bch_btree_node_get()) that
we drop read locks on parent nodes _before_ taking intent locks on child nodes -
this might cause us to race have the btree node freed out from under us before
we lock it, so we check for that after grabbing the intent lock and redo the
traversal if necessary.

One other thing worth mentioning is bcache's btree node locks have embedded
sequence numbers, which are incremented when taking and releasing write locks
(much like seqlocks). This allows us to aggressively drop locks (because we'll
usually be able to retake the lock), and we also use it for a try_upgrade() - if
we discover we need an intent lock (e.g. for a split, or because the caller is
inserting into a leaf node they didn't get an intent lock for) we'll usually be
able to get it without having to unwind and redo the traversal.

### Journaling

Bcache's journal is foremost an optimization for the btree. COW btrees do not
require journals for on disk consistency - but having a journal allows us to
coalesce random updates across multiple btree nodes and flush the nodes
asynchronously.

The journal is a purely logical log, a list of insertions - bkeys - to reinsert
on recovery in the same order they're present in the journal. Provided every
index update is journaled (a critical invariant here), redoing those insertions
is an idempotant operation. See bch_journal_replay_key() in journal.c - the
journal uses the same insert path as everything else and doesn't know anything
about the structure of the btree.

It's critical that keys appear in the journal in the same order as the
insertions happened, so both are done under the btree node's write lock: see
bch_btree_insert_and_journal() in btree.c, which calls bch_bset_insert() at the
top (inserting into the btree node in memory) and bch_journal_add_keys() to
journal the same key at the bottom.

At this point in the btree insertion path, we've already marked any key(s) that
we overlapped with (possibly more than one for extents) as deleted - i.e. the
btree node is inconsistent so the insertion must happen before dropping our
write lock.

So we also have some machinery for journal reservations: we reserve some amount
of space in the journal (in units of u64s, as always) midway through the
insertion path (in bch_btree_insert_keys()). The journal reservation machinery
(as well as adding the keys to the journal) is entirely lockless in the
fastpath.

### Sequential consistency

As mentioned in the first section on indexing, bcache's b+ tree provides the
guarantee that ordering of updates is always preserved - whether to different
nodes or different btrees (i.e. dirents vs. inodes).

The journal helps here as well. Ordering is always preserved in the journal
itself: if a key at time t made it to the journal and was present after unclean
shutdown/recovery, all the keys journalled prior to time t will either be in the
journal, or in btree nodes that were flushed to disk.

The btree by itself does not provide ordering though - if updates happened to
two different leaf nodes, the leaf nodes could have been flushed in any order -
and importantly, either of them could have been written before the last journal
entry that contained keys for that btree node write.

That is, while typically the journal write will happen before the btree node is
flushed - we don't prevent the btree node from being flushed right away, and we
certainly don't want to: since flushing btree nodes is required both to reclaim
memory and to reclaim space in the journal, just the mere though of the
potential deadlocks is rather horrifying.

Instead, we have a rather neat trick: in struct bset (the common header for a
single btree node write/log entry) we track the most recent sequence number of
all the journal entries the keys in this bset went into.

Then, on recovery when we're first walking the btree if we find a bset with a
higher journal sequence number than the most recent journal entry we actually
found - we merely ignore it.

The bset we ignore will likely also contain keys in older journal entries -
however, those journal entries will all be in the set we are replaying because
they were considered dirty until after the bset was written, and they were
marked as dirty on disk until a journal entry was written after the bset's write
completed - which didn't happen. Thus, ignoring those bsets cannot cause us to
lose anything that won't be replayed.

There is a decent amount of extra machinery required to make this scheme work -
when we find bsets newer than the newest journal entry we have to blacklist the
journal sequence number they referred to - and we have to mark it as
blacklisted, so that on the next recovery we don't think there's a journal entry
missing - but that is the central idea.

## Auxiliary search trees

The code for doing lookups, insertions, etc. within a btree node is relatively
separated from the btree code itself, in bset.c - there's a struct
btree_node_iter separate from struct btree_iter (the btree iterator contains one
btree node iterator per level of the btree).

The bulk of the machinery is the auxiliary search trees - the data structures
for efficiently searching within a bset.

There are two different data structures and lookup paths. For the bset that's
currently being inserted into, we maintain a simple table in an array, with one
entry per cacheline of data in the original bset, that tracks the offset of the
first key in that cacheline. This is enough to do a binary search (and then a
linear search when we're down to a single cacheline), and it's much cheaper to
keep up to date.

For the const bsets, we construct a binary search tree in an array (same layout
as is used for heaps) where each node corresponds to one cacheline of data in
the original bset, and the first key within that cacheline (note that the
auxiliary search tree is not full, i.e. not of size (2^n) - 1). Walking down the
auxilialy search tree thus corresponds roughly to doing a binary search on the
original bset - but it has the advantage of much friendlier memory access
patterns, since at every iteration the children of the current node are adjacent
in memory (and all the grandchildren, and all the great grandchildren) - meaning
unlike with a binary search it's possible to prefetch.

Then there are a couple tricks we use to make these nodes as small as possible:

 - Because each node in the auxiliary search tree corresponds to precisely one
   cacheline, we don't have to store a full pointer to the original key - if we
   can compute given a node's position in the array/tree its index in an inorder
   traversal, we only have to store the key's offset within that cacheline.

   This is done by to_inorder() in bset.c, and it's mostly just shifts and bit
   operations.

 - Observe that as we're doing the lookup and walking down the tree, we have
   constrained the keys we're going to compare against to lie within a certain
   range [l, r).

   Then l and r will be equal in some number of their high bits (possibly 0);
   the keys we'll be comparing against and our search key will all be equal in
   the same bits - meaning we don't have to compare against, or store, any bits
   after that position.

   We also don't have to store all the low bits, either - we need to store
   enough bits to correctly pivot on the key the current node points to (call it
   m); i.e. we need to store enough bits to tell m apart from the key
   immeditiately prior to m (call it p). We're not looking for strict equality
   comparisons here, we're going to follow this up with a linear search anyways.

   So the node in the auxiliary search tree (roughly) needs to store the bits
   from where l and r first differed to where m and p first differed - and
   usually that's not going to be very many bits. The full struct bkey has 160
   bit keys, but 16 bit keys in the auxiliary search tree will suffice > 99% of
   the time.

   Lastly, since we'd really like these nodes to be fixed size - we just pick a
   size and then when we're constructing the auxiliary search tree check if we
   weren't able to construct a node, and flag it; the lookup code will fall back
   to comparing against the original key. Provided this happens rarely enough,
   the performance impact will be negligible.

   The auxiliary search trees were an enormous improvement to bcache's
   performance when they were introduced - before they were introduced the
   lookup code was a simple binary search (eons ago when keys were still fixed
   size). On random lookups with a large btree the auxiliary search trees are
   easily over an order of magnitude faster.

## Bkey packing

Bkeys have gotten rather large, with 64 bit inodes, 64 bit offsets, a snapshot
field that isn't being used yet, a 32 bit size field that's only used by
extents... That's a lot of wasted bits.

The packing mechanism was added to address this, and as an added bonus it should
allow us to change or add fields to struct bkey (if we ever want to) without
incompatible on disk format changes (forwards and backwards!).

The main constraint is lookup performance - any packing scheme that required an
unpack for comparisons would probably be right out. So if we want to be able to
compare packed keys, the requirement is that it be order preserving. That is, we
need to define some function pack(), such that

        bkey_cmp(l, r) == bkey_cmp(pack(l), pack(r))

The way it works is for every btree node we define a packed format (and we
recalculate a new packed format when splitting/rewriting nodes); then keys that
are packed in the same format may be compared without unpacking them.

The packed format itself consists of, for each field in struct bkey

 - the size of that field in the packed key, in bits
 - an offset, to be subtracted before packing and when unpacking

Then, provided keys fit into the packed format (their key fields are not less
than the field offset, and after subtracting the offset their fields fit into
the number of bits in the packed keys) packing is order preserving - i.e.
packing is allowed to fail.

Then in the lookup path, we pack the search key into the btree node's packed
format - even better, this helps the auxiliary search tree code since it no
longer has to deal with long strings of zeroes between the inode and the offset.

For extents, keys in the packed format are usually only 8 bytes - vs. 32 bytes
for struct bkey. And with no replication, checksumming or compression, the value
is only 8 bytes - so packing can reduce our metadata size by more than half.

The other interesting thing about this mechanism is that it makes our key format
self describing - we aren't restricted to just defining a format for packed
keys, we can define one for our in memory representation, struct bkey, as well
(and we do).

Then as long as we have the format corresponding to a key, we can use it - we
should just be careful not to drop fields it has that we don't understand (by
converting it to our current in memory representation and back).

All that's left to do is to store formats (besides the btree node-local format)
in the superblock. Right now the btree node local format is defined to be format
0 and the in memory format - struct bkey itself - is format 1,
`BKEY_FORMAT_CURRENT`. Changing struct bkey would be as simple as defining a new
format, and making sure the new format gets added to existing superblocks when
it's used.