BcacheGuide.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938

[[!meta stylesheet=BcacheGuide rel="stylesheet" title="doc"]]

# The Programmer's Guide to bcache:

This document is intended to cover the design and core concepts of bcache, and
serve as a guide to programmers trying to become familiar with the bcache
codebase. It'll try to cover how various features are built on top of the core
concepts and how everything fits together.

[[!toc levels=3 startlevel=2]]

## Introduction

The core abstraction in bcache is a key value store ("the btree"). To the btree
code the keys it indexes are just large, fixed sized bitstrings/integers, with
opaque variable sized values.

Aside from some low level machinery, essentially all metadata is stored as
key/value pairs:

 - Extents are one type of key where the value is a list of pointers to the
   actual data.

 - Inodes are stored indexed by inode number, where the value is the inode
   struct.

 - Dirents and xattrs are stored with one key per dirent/xattr.

The bulk of the complexity in bcache is in the implementation of the btree;
posix filesystem semantics are implemented directly on top of it, with the btree
doing all the heavy lifting with respect to e.g. locking and on disk
consistency.

In general, bcache eschews heavyweight transactions in favor of ordering
filesystem updates. In this, bcachefs's implementation is in spirit not unlike
softupdates - however, since bcachefs is working with logical index updates and
not physical disk writes, it works out to be drastically simpler. The btree
guarantees that ordering is preserved of all index updates, without ever needing
flushes - see the section on [[sequential consistency|BcacheGuide/#index10h3]].

We use this approach is used for creating/deleting files and links, and for
logical atomicity of appends/truncates. For those unfamiliar with softupdates,
this means that when creating files, we create the inode and then create the
dirent, and when creating a hardlink we increment i_nlinks and then create the
new dirent - and the order is reversed on deletion. On unclean shutdown we might
leak inode references - but this is easily handled with a garbage collection
pass.

We do have some support for transactional updates, however - primarily for
rename().

### Keys (bkey and bpos) 

We have separate structs for a search key (`struct bpos`), and the outer
container that holds a key and a value (`struct bkey`).

Here are their definitions:

        struct bpos {
                u64         inode;
                u64         offset;
                u32         snapshot;
        };
        
        struct bkey {
                u8          u64s;
                u8          format;
                u8          type;
        
                u8          pad;
                u32         version;
                u32         size;
                struct bpos p;
        };

The high bits of the search key are the inode field and the low bits are the
snapshot field (note that on little endian the order in the struct will actually
be reversed).

Not all code uses all fields of the search key (and snapshot is currently
unused), and some code uses the fields in different ways; it's up to the user of
the btree code to use the fields however they see fit. However, the inode field
generally corresponds to an inode number (of a file in a filesystem), and for
extents the offset field corresponds to the offset within that file.

Some notes on the bkey fields:

 - The u64s field is the size of the combined key and value, in units of u64s.
   It's not generally used directly: use `bkey_val_u64s()` or `bkey_val_bytes()`
   for a higher level interface.

 - The format field is internal to the btree implementation. It's for packed
   bkeys - bkeys are not usually stored in the btree in the in memory. Only very
   low level btree code need be concerned with this, however.

 - The type field denotes the type of the value. For example, the directory
   entry code defines two different value types, `bch_dirent` (type
   `BCH_DIRENT`, 128) and `bch_dirent_whiteout` (`BCH_DIRENT_WHITEOUT`, 129). 

 - The size field is only used by extents (0 for all other keys)

### Values:

Values are stored inline with bkeys. Each value type (for all the types that
have nonempty values) will have a struct definition, and then macros generate a
variety of accessor functions and wrapper types.

Also, while values are _stored_ inline with the keys, due to packing they aren't
typically passed around that way; when accessing a key stored in the btree the
btree code will unpack the key and return a wrapper type that contains pointers
to the unpacked key and the value.

The scheme for the wrapper types is:

        bkey_i          key with inline value
        bkey_s          key with split value (wrapper with pointers to key and value)
        bkey_s_c        constent key with split value

The code has the wrapper types above for generic keys with values, and value
types have the corresponding wrapper types generated with pointers to the value
of the correct type.

For example, here's the struct definition for extended attributes:

        struct bch_xattr {
                struct bch_val      v;
                u8                  x_type;
                u8                  x_name_len;
                u16                 x_val_len;
                u8                  x_name[];
        };

The wrapper types will be `bkey_i_xattr`, `bkey_s_xattr`, and `bkey_s_c_xattr`.
When the xattr code is doing a lookup in the btree, the btree iterator will
return (via e.g. `btree_iter_peek()`) a key of type `bkey_s_c`, and then the
xattr code will use `bkey_s_c_to_xattr(k)` (after switching on the type field)
to get to the xattr itself.

Code should always `switch()` off of or check the bkey type field before
converting to their particular type; the accessors all assert that the type
field is the expected one, so you'll get a `BUG_ON()` if you don't.

### Btree interface:

Keys are indexed in a small number of btrees: one btree for extents, another for
inodes, another for dirents, etc.

We can add new btree types for new features without on disk format changes -
many planned features will be added this way (e.g. for reed solomon/raid6, we'll
be adding another btree to index stripes by physical device and lba. More on
this later).

Some important properties/invariants:

 - There are no duplicate keys; insertions implicitly overwrite existing keys.

   Related to this, deletion is not exposed as a primitive: instead, deletion is
   done by inserting a key of type `KEY_TYPE_DELETED`. (The btree code
   internally handles overwrites by setting the old key to `KEY_TYPE_DELETED`,
   which will then be deleted when the relevant btree node is compacted. Old
   deleted keys are never visible outside the btree code).

 - Ordering of insertions/updates is _always_ preserved, across unclean
   shutdowns and without any need for flushes.

   This is important for the filesystem code. Creating a file or a new hardlink
   requires two operations:

     - create new inode/increment inode refcount
     - create new dirent

   By doing these operations in the correct order we can guarantee that after an
   unclean shutdown we'll never have dirents pointing to nonexistent inodes - we
   might leak inode references, but it's straightforward to garbage collect those
   at runtime.

Lookups/insertions are done via the btree iterator, defined in btree.h:

        struct btree_iter;

        void bch_btree_iter_init(struct btree_iter *,
                                 struct cache_set *,
                                 enum btree_id,
                                 struct bpos);
        int bch_btree_iter_unlock(struct btree_iter *);

        struct bkey_s_c bch_btree_iter_peek(struct btree_iter *);
        void bch_btree_iter_advance_pos(struct btree_iter *);

 - A "`cache_set`" corresponds to a single filesystem/volume - the name coming
   from bcache's starting point as a block cache, and turned into a `cache_set`
   instead of just a cache when it gained the ability to manage multiple
   devices.

 - The `btree_id` is one of `BTREE_ID_EXTENTS`, `BTREE_ID_INODES`, etc.

 - The bpos argument is the position to start iterating from.

`bch_btree_iter_unlock()` unlocks any btree nodes the iterator has locked; if
there was an error reading in a btree node it'll be returned here.

`bch_btree_iter_peek()` returns the next key after the iterator's current
position.

`bch_btree_iter_advance_pos()` advance's the iterator's position to immediately
after the last key returned.

Lookup isn't provided as a primitive because most usage is geared more towards
iterating from a given position, but we now have enough to implement it:

        int lookup(struct bpos search)
        {
                struct btree_iter iter;
                struct bkey_s_c k;
                int ret = 0;

                bch_btree_iter_init(&iter, c, BTREE_ID_EXAMPLE, search);

                k = bch_btree_iter_peek(&iter);
                if (!k.k || bkey_cmp(k.k->p, search)
                        ret = -EEXIST;
                        
                bch_btree_iter_unlock(&iter);
                return ret;
        }

If we want to iterate over every key in the btree (or just from a given point),
there's the convenience macro `for_each_btree_key()`:

        struct btree_iter iter;
        struct bkey_s_c k;

        for_each_btree_key(&iter, c, BTREE_ID_EXAMPLE, POS_MIN, k)
                printk("got %llu:%llu\n", k.k->p.inode, k.k->p.offset);

        bch_btree_iter_unlock(&iter);

#### Updates:

Insertion is most often done with `bch_btree_insert_at()`, which takes an
iterator and inserts at the iterator's current position. This is often used in
conjunction with `bch_btree_iter_peek_with_holes()`, which returns a key
representing every valid position (synthesizing one of `KEY_TYPE_DELETED` if
nothing was found).

This is highly useful for the inode and dirent code. For example, to create a
new inode, the inode code can search for an empty position and then use
`bch_btree_insert_at()` when it finds one, and the btree node's lock will guard
against races with other inode creations.

Note that it might not be possible to do the insert without dropping locks, e.g.
if a split was required; the `BTREE_INSERT_ATOMIC` flag indicates that the
insert shouldn't be done in this case and `bch_btree_insert_at()` will return
`-EINTR` instead, and the caller will loop and retry the operation.

The extents code also uses a similar mechanism to implement a cmpxchg like
operation which is used by things like copygc that move data around in the
background - the index update will only succeed if the original key was present,
which guards against races with foreground writes.

#### Linked iterators and multiple atomic updates:

These mechanisms are quite new, and the interfaces will likely change and become
more refined in the future:

Two or more iterators (up to a small fixed number) can be linked with
`bch_btree_iter_link()` - this causes them to share locks, so that if they are
used to take locks on the same node that would ordinarily conflict (e.g. holding
a read lock on a leaf node with one iterator while updating the same node via
another iterator) won't deadlock.

This makes it possible to get a consistent view of multiple positions in the
btree (or different btrees) at the same time.

Furthermore, updates via one iterator will not invalidate linked iterators - but
be careful, a /pointer/ returned via e.g. `bch_btree_iter_peek()` will still be
invalidated by an update (via `bch_btree_insert_at()`). But the other iterator
will be kept internally consistent, and it won't have to drop locks or redo any
traversals to continue iterating.

Then, updates to multiple (linked) iterators may be performed with
`bch_btree_insert_at_multi()`: it takes a list of iterators, and keys to insert
at each iterator's position. Either all of the updates will be performed without
dropping any locks, or else none of them (the `BTREE_INSERT_ATOMIC` flag is
implicit here, as it makes no sense to use `bch_btree_insert_at_multi()` when
it's not required). Additionally, the updates will all use the same journal
reservation, so the update will be atomic on disk as well.

There are some locking considerations that are not yet hidden/abstracted away
yet:

 - It is not permissible to hold a read lock on any btree node while taking an
   intent lock on another. This is because of a deadlock it would cause (an even
   more obscure version of the one documented in section on
   [[locking|BcacheGuide/#index8h3]] which I will hopefully document in the
   future).

   This is easy to work around - just traverse the iterator taking intent locks
   first (with peek(), or `bch_btree_iter_traverse()` directly if more
   convenient), and if looping unlock the iterator holding read locks before
   re-traversing the intent lock iterator. Since the iterator code will first
   attempt to relock nodes (checking the remembered sequence number against the
   lock's current sequence number), there's no real cost to unlocking the read
   iterator.

 - If linked iterators are only being used to take read locks there are no
   ordering requirements, since read locks do not conflict with read locks -
   however, when taking intent locks the iterator pointing to the lower position
   must be traversed first.

   I haven't yet come up with a satisfactory way of having the iterator code
   handle this - currently it's up to the code using linked iterators. See
   `bch_dirent_rename()` for an example.

## Extents, pointers, and data:

Bcache is extent based, not block based; its extents are much like extents in
other filesystems that has them - a variable sized chunk of data. From the point
of view of the index, they aren't positions, they're ranges or half open
intervals (note that a 0 size extent doesn't overlap with anything).

Bcache's extents are indexed by inode:offset, and their size is stored in the
size field in struct bkey. The offset and size are both in 512 byte sectors (as
are the pointer offsets). The offset field denotes the _end_ position of the
extent within the file - a key with offset 8 and size 8 points to the data for
sectors 0 through 7.

(This oddity was a result of bcache's btree being designed for extents first,
and non extent keys coming later - it makes searching for extents within a
certain range cleaner when iterating in ascending order).

Inside the value is a list of one or more pointers - if there's more than one
pointer, they point to replicated data (or possibly one of the copies is on a
faster device and is considered cached).

Here's a simplified version (in the latest version the extent format has gotten
considerably more complicated in order to incorporate checksums and compression
information, but the concept is still the same):

        struct bch_extent_ptr {
                __u64                   offset:48,
                                        dev:8,
                                        gen:8;
        };

        struct bch_extent {
                struct bch_val          v;

                struct bch_extent_ptr   ptr[0]
        };

 - The device field is the index of one of the devices in the cache set (i.e.
   volume/filesystem).

 - The offset field is the offset of the start of the pointed-to data, counting
   from the start of the device in units of 512 byte sectors.

 - "gen" is a generation number that must match the generation number of the
   bucket the pointer points into for the pointer to be consider valid (not
   stale).

### Buckets and generations:

This mechanism comes from bcache's origins as just a block cache.

The devices bcache manages are divided up into fixed sized buckets (typically
anywhere from 128k to 2M). The core of the allocator works in terms of buckets
(there's a sector allocator on top, similar to how the slab allocator is built
on top of the page allocator in Linux).

When a bucket is allocated, it is written to once sequentially: then, it is
never written to again until the entire bucket is reused. When a bucket is
reused, its generation number is incremented (and the new generation number
persisted before discarding it or writing to it again). If the bucket still
contained any cached data, incrementing the generation number is the mechanism
that invalidates any still live pointers pointing into that bucket (in
programming language terminology, bcache's pointers are weak pointers).

This has a number of implications:

 - Invalidating clean cached data is very cheap - there's no cost to keeping a
   device full of clean cached data.

 - We don't persist fine grained allocation information: we only persist the
   current generation number of each bucket, and at runtime we maintain in
   memory counts of the number of live dirty and cached sectors in each bucket -
   these counts are updated on the fly as the index is updated and old extents
   are overwritten and new ones added.

   This is a performance tradeoff - it's a fairly significant performance win at
   runtime but it costs us at startup time. Eventually, we'll probably implement
   a reserve that we can allocate out of at startup so we can do the initial
   mark and sweep in the background.

   This does mean that at startup we have to walk the extents btree once to
   repopulate these counts before we can do any allocation.

 - If there's a lot of internal fragmentation, we do require copying garbage
   collection to compact data - rewriting it into new buckets.

 - Since the generation number is only 8 bits, we do have to guard against
   wraparound - this isn't really a performance issue since wraparound requires
   rewriting the same bucket many times (and incoming writes will be distributed
   across many buckets). We do occasionally have to walk the extents to update
   the "oldest known generation number" for every bucket, but the mark and sweep
   GC code runs concurrently with everything else, except for the allocator if
   the freelist becomes completely drained before it finishes (the part of GC
   that updates per bucket sector counts mostly isn't required anymore, it may
   be removed at some point).


### IO path:

XXX: describe what consumes bch_read() and bch_write()

The main entry points to the IO code are

 - `bch_read()`

 - `bch_read_extent()`

 - `bch_write()`

The core read path starts in `bch_read()`, in io.c. The algorithm goes like:

 - Iterate over the extents btree (with `for_each_btree_key_with_holes()`),
   starting from the first sector in the request.

 - Check the type of the returned key:

   - Hole? (`KEY_TYPE_DELETED`) - just return zeroes

   - Extent? Check for a pointer we can read from.

      - If they're all stale, it was a cached extent (i.e. we were caching
        another block device), handle it like a hole.

      - If the relevant device is missing/not online, return an error.

      - Ok, we have a pointer we can read from. If the extent is smaller than
        the request, split the request (with `bio_split()`), and issue the
        request to the appropriate device:sector.

      - Iterate until entire request has been handled.

The write path is harder to follow because it's highly asynchronous, but the
basic algorithm is fairly simple there too:

 - Set up a key that'll represent the data we're going to write (not the
   pointers yet).

 - Allocate some space to write to (with `bch_alloc_sectors()`); add the
   pointer(s) to the space we allocated to the key we set up previously.

 - Were we able to allocate as much space as we wanted? If not, split both the
   key and the request.

 - Issue the write to the space we just allocated

 - Loop until we've allocated all the space we want, building up a list of keys
   to insert.

 - Finally, after the data write(s) complete, insert the keys we created into
   the extents btree.

#### Checksumming/compression:

As mentioned previously, bch_extent got more complicated with data checksumming
and compression. 

For data checksumming, what we want to do is store the checksum with the key and
pointers, no the data - if you store the checksum with the data verifying the
checksum only tells you that you got the checksum that goes with that data, it
doesn't tell you if you got the data you actually wanted - perhaps the write
never happened and you're reading old data. There's a number of subtle ways data
can be corrupted that naively putting the checksum with the data won't protect
against, and there are workarounds for most of them but it's fundamentally not
the ideal approach.

So we want to store the checksum with the extent, but now we've got a different
problem - partially overwritten extents. When an extent is partially
overwritten, we don't have a checksum for the portion that's now live and we
can't compute it without reading that data in, which would probably not be
ideal.

So we need to remember what the original extent was, so that later when that
data is read we can read in the entire original extent, checksum that, and then
return only the part that was live.

Compression leads to a similar issue, where if the extent is compressed we won't
be able to part of it and decompress it, we have to be able to read the entire
extent.

One simplifying factor is that since buckets are reused all at once, as long as
any part of the original extent is live the rest of it will still be there - we
don't have to complicate our accounting and have two notions of live data to
make sure the dead parts of the extent aren't overwritten.

So, for compressed or checksummed extents here's the additional information
we'll need to store in struct bch_extent:

        struct bch_extent_crc32 {
                __u32                   type:1,
                                        offset:7,
                                        compressed_size:8,
                                        uncompressed_size:8,
                                        csum_type:4,
                                        compression_type:4;
                __u32                   csum;
        };

Now, when trimming an extent, instead of modifying the pointer offset to point
to the start of the new live region (as we still do for non
checksummed/compressed extents) we add to the offset field in the extent_crc
struct - representing the offset into the extent for the currently live portion.
The compressed_size and uncompressed_size fields store the original sizes of the
extent, while the size field in the key continues to represent the size of the
live portion of the extent.

There's another complicating factor is data migration, from copying garbage
collection and tiering - in general, when they need to move some data we can't
expect them to rewrite every replica of an extent at once (since copygc is
needed for making forward progress, that would lead to nasty deadlocks).

But if we're moving one of the replicas of an extent (or making another copy, as
in the case of promoting to a faster tier) we don't want to have to copy the
dead portions too - we'd never be able to get rid of the dead portions! Thus,
we need to handle extents where different pointers are in different formats.

Having one bch_extent_crc32 field per pointer (or bch_extent_crc64, for the case
where the user is using 64 bit checksums) would make our metadata excessively
large, unfortunately - if they're doing two or three way replication that would
roughly double the size of the extents btree, or triple it for 64 bit checksums.

So the new format struct bch_extent is a list of mixed pointers and extent_crc
fields: each extent_crc field corresponds to the pointers that come after it,
until the next extent_crc field. See include/uapi/linux/bcache.h for the gory
details, and extents.h for the various functions and macros for iterating over
and working with extent pointers and crc fields.


#### RAID5/6/Erasure coding:

This part hasn't been implemented yet - but here's the rough design for what
will hopefully be implemented in the near future:

Background: In conventional RAID, people have long noted the "RAID hole" and
wanted to solve it by moving RAID into the filesystem - this what ZFS does.

The hole is that it's impossible for block layer raid to atomically write a
stripe: while writing a stripe, there's inevitably going to be a window where
the P and Q blocks are inconsistent and recovery is impossible. The best that
they can do is order the writes so that the data blocks are written first, then
the P/Q blocks after, and then on unclean shutdown rebuild all the P/Q blocks
(with perhaps a bitmap to avoid having to resync the whole device, as md does).

The approach that ZFS takes is to fragment incoming writes into (variable sized)
stripes, so that it can write out the entire stripe all at once and never have a
pointer in its index pointing to an incomplete/inconsistent stripe.

This works, but fragmenting IO is painful for performance. Not only writes are
fragmented, but reads of the same data are fragmented too; and since the latency
of the read has to be the latency of the slowest fragment, this drives your
median latency to your tail latency. Worse, on disks you're spending a lot more
seeks than you were before; even on flash it's not ideal because since the
fragmenting is happening above all the scheduling, both on the host and the
device you're still subject to the tail latency effect.

What we want is to be able to erasure encode unrelated data together - while
avoiding update in place. This is problematic to do for foreground writes, but
if we restrict ourselves to erasure encoding data in the background - perhaps
even as it's being copied from the fast tier (flash) to the slow tier (disk) -
the problem is much easier.

The idea is to still replicate foreground writes, but keep track of the buckets
that contain replicated data. Then, a background job can take e.g. 5 buckets
that contain unrelated data, allocate two new buckets, and then compute the p/q
for the original five buckets and write them to the two new buckets. Then, for
every extent that points into those first five buckets (either with an index
scan, or by remembering keys from recent foreground writes), it'll update each
extent dropping their extra replicas and adding pointers to the p/q buckets -
with a bit set in each pointer telling the read path that to use those it has to
do a reconstruct read (which will entail looking up the stripe mapping in
another btree).

This does mean that we can't reuse any of the buckets in the stripe until each
of them are empty - but this shouldn't be much trouble, we just have to teach
copygc about it when it's considering which buckets to evacuate.

### Copying garbage collection:

### Tiering:

### Allocation:

### Multiple devices:

## Btree internals:

At a high level, bcache's btree is a copy on write b+ tree. The main difference
between bcache's b+ tree and others is the nodes are very large (256k is
typical) and log structured. Like other COW b+ trees, updating a node may
require recursively rewriting every node up to the root; however, most updates
(to both leaf nodes and interior nodes) can be done with only an append, until
we've written to the full amount of space we originally reserved for the node.

A single btree node log entry is represented as a header and a list of bkeys,
where the bkeys are all contiguous in memory and in sorted order:

        struct bset {
                /* some fields ommitted */
                ...
                u16                     u64s;
                struct bkey_packed      start[0];
        };

Since bkeys are variable length, it's not possible to access keys randomly
without other data structures - only iterate sequentially via `bkey_next()`.

A btree node thus contains multiple independent bsets that on lookup all must be
searched and iterated over. At any given time there will be zero or more bsets
that have been written out, and a single dirty bset that new keys are being
inserted into.

As the btree is modified we must maintain the invariant in memory that there are
no duplicates (keys that compare as equal), excluding keys marked as deleted.
When an insertion overwrites an existing key, we will mark the existing key as
deleted (by setting `k->type = KEY_TYPE_DELETED`) - but until the entire node is
rewritten the old key will still exist on disk. To handle this, when a btree
node is read, the first thing we do is a mergesort of all the bsets it contains,
and as part of the mergesort duplicate keys are found and the older bkeys are
dropped - carefully matching the same changes we made in memory when doing the
insertions.

This hopefully explains how the lack of deletion as a primitive is a result of
the way the btree is implemented - it's not possible to delete a key except by
inserting a whiteout, which will be dropped when the btree node eventually fills
up and is rewritten.

Once a bset has been written out it may also be sorted, in memory, with other
bsets that have also been written out - we do so periodically so that a given
btree node will have only a few (at most three) bsets in memory: the one being
inserted into will be at most 8 or 16k, and the rest roughly forming a geometric
progression size, so that sorting the entire node is relatively infrequent.

This resorting/compacting in memory is one of the main ways bcache is able to
efficiently use such large btree nodes. The other main trick is to take
advantage of the fact that bsets that have been written out are, aside from
resorts, constant; we precompute lookup tables that would be too inefficient to
use if they had to be modified for insertions. The bcache code refers to these
lookup tables as the auxiliary search trees.

### Locking

Bcache doesn't use read/write locks for btree nodes - the locks it uses have
three states: shared, intent and exclusive (SIX locks). Shared and exclusive
correspond to read and write, while intent is sort of in the middle - intent
locks conflict with other intent locks (like write locks), but they don't
conflict with read locks.

The problem intent locks solve is that with a regular read/write lock, a read
lock can't be upgraded to a write lock - that would lead to deadlock when
multiple threads with read locks tried to upgrade. With a complicated enough
data structure, updates will need to hold write locks for exclusion with other
updates for much longer than the part where they do the actual modification that
needs exclusion from readers.

For example, consider the case of a btree split. The update starts at a leaf
node, and discovers it has to do a split. But before starting the split it has
to acquire a write lock on the parent node, primarily to avoid a deadlock with
other splits: it needs at least a read lock on the parent (roughly in order to
lock the path to the child node), but it couldn't then upgrade that read lock to
a write lock in order to update the parent with the pointers to the new children
because that would deadlock with threads splitting sibling leaf nodes.

Intent locks solve this problem. When doing a split it suffices to acquire an
intent lock on the parent - write locks are only ever held modifying the in
memory btree contents (which is a much shorter duration than the entire split,
which requires waiting for the new nodes to be written to disk).

Intent locks with only three states do introduce another deadlock, though:

        Thread A                        Thread B
        read            | Parent |      intent
        intent          | Child  |      intent

Thread B is splitting the child node: it's allocated new nodes and written them
out, and now needs to take a write lock on the parent in order to add the
pointers to the new nodes (after which it will free the old child).

Thread A just wants to insert into the child node - it has a read lock on the
parent node and it's looked up the child node, and now it's waiting on thread
B to get an intent lock on the child.

But thread A has blocked thread B from taking its write lock in order to update
the parent node, and thread B can't drop its intent lock on the child until
after the new nodes are visible and it has freed the child node.

The way this deadlock is handled is by enforcing (in `bch_btree_node_get()`) that
we drop read locks on parent nodes _before_ taking intent locks on child nodes -
this might cause us to race have the btree node freed out from under us before
we lock it, so we check for that after grabbing the intent lock and redo the
traversal if necessary.

One other thing worth mentioning is bcache's btree node locks have embedded
sequence numbers, which are incremented when taking and releasing write locks
(much like seqlocks). This allows us to aggressively drop locks (because we'll
usually be able to retake the lock), and we also use it for a `try_upgrade()` -
if we discover we need an intent lock (e.g. for a split, or because the caller
is inserting into a leaf node they didn't get an intent lock for) we'll usually
be able to get it without having to unwind and redo the traversal.

### Journaling

Bcache's journal is foremost an optimization for the btree. COW btrees do not
require journals for on disk consistency - but having a journal allows us to
coalesce random updates across multiple btree nodes and flush the nodes
asynchronously.

The journal is a purely logical log, a list of insertions - bkeys - to reinsert
on recovery in the same order they're present in the journal. Provided every
index update is journaled (a critical invariant here), redoing those insertions
is an idempotant operation. See `bch_journal_replay_key()` in journal.c - the
journal uses the same insert path as everything else and doesn't know anything
about the structure of the btree.

It's critical that keys appear in the journal in the same order as the
insertions happened, so both are done under the btree node's write lock: see
`bch_btree_insert_and_journal()` in btree.c, which calls `bch_bset_insert()` at
the top (inserting into the btree node in memory) and `bch_journal_add_keys()`
to journal the same key at the bottom.

At this point in the btree insertion path, we've already marked any key(s) that
we overlapped with (possibly more than one for extents) as deleted - i.e. the
btree node is inconsistent so the insertion must happen before dropping our
write lock.

So we also have some machinery for journal reservations: we reserve some amount
of space in the journal (in units of u64s, as always) midway through the
insertion path (in `bch_btree_insert_keys()`). The journal reservation machinery
(as well as adding the keys to the journal) is entirely lockless in the
fastpath.

### Sequential consistency

As mentioned in the first section on indexing, bcache's b+ tree provides the
guarantee that ordering of updates is always preserved - whether to different
nodes or different btrees (i.e. dirents vs. inodes).

The journal helps here as well. Ordering is always preserved in the journal
itself: if a key at time t made it to the journal and was present after unclean
shutdown/recovery, all the keys journalled prior to time t will either be in the
journal, or in btree nodes that were flushed to disk.

The btree by itself does not provide ordering though - if updates happened to
two different leaf nodes, the leaf nodes could have been flushed in any order -
and importantly, either of them could have been written before the last journal
entry that contained keys for that btree node write.

That is, while typically the journal write will happen before the btree node is
flushed - we don't prevent the btree node from being flushed right away, and we
certainly don't want to: since flushing btree nodes is required both to reclaim
memory and to reclaim space in the journal, just the mere though of the
potential deadlocks is rather horrifying.

Instead, we have a rather neat trick: in struct bset (the common header for a
single btree node write/log entry) we track the most recent sequence number of
all the journal entries the keys in this bset went into.

Then, on recovery when we're first walking the btree if we find a bset with a
higher journal sequence number than the most recent journal entry we actually
found - we merely ignore it.

The bset we ignore will likely also contain keys in older journal entries -
however, those journal entries will all be in the set we are replaying because
they were considered dirty until after the bset was written, and they were
marked as dirty on disk until a journal entry was written after the bset's write
completed - which didn't happen. Thus, ignoring those bsets cannot cause us to
lose anything that won't be replayed.

There is a decent amount of extra machinery required to make this scheme work -
when we find bsets newer than the newest journal entry we have to blacklist the
journal sequence number they referred to - and we have to mark it as
blacklisted, so that on the next recovery we don't think there's a journal entry
missing - but that is the central idea.

## Auxiliary search trees

The code for doing lookups, insertions, etc. within a btree node is relatively
separated from the btree code itself, in bset.c - there's a struct
btree_node_iter separate from struct btree_iter (the btree iterator contains one
btree node iterator per level of the btree).

The bulk of the machinery is the auxiliary search trees - the data structures
for efficiently searching within a bset.

There are two different data structures and lookup paths. For the bset that's
currently being inserted into, we maintain a simple table in an array, with one
entry per cacheline of data in the original bset, that tracks the offset of the
first key in that cacheline. This is enough to do a binary search (and then a
linear search when we're down to a single cacheline), and it's much cheaper to
keep up to date.

For the const bsets, we construct a binary search tree in an array (same layout
as is used for heaps) where each node corresponds to one cacheline of data in
the original bset, and the first key within that cacheline (note that the
auxiliary search tree is not full, i.e. not of size (2^n) - 1). Walking down the
auxilialy search tree thus corresponds roughly to doing a binary search on the
original bset - but it has the advantage of much friendlier memory access
patterns, since at every iteration the children of the current node are adjacent
in memory (and all the grandchildren, and all the great grandchildren) - meaning
unlike with a binary search it's possible to prefetch.

Then there are a couple tricks we use to make these nodes as small as possible:

 - Because each node in the auxiliary search tree corresponds to precisely one
   cacheline, we don't have to store a full pointer to the original key - if we
   can compute given a node's position in the array/tree its index in an inorder
   traversal, we only have to store the key's offset within that cacheline.

   This is done by `to_inorder()` in bset.c, and it's mostly just shifts and bit
   operations.

 - Observe that as we're doing the lookup and walking down the tree, we have
   constrained the keys we're going to compare against to lie within a certain
   range [l, r).

   Then l and r will be equal in some number of their high bits (possibly 0);
   the keys we'll be comparing against and our search key will all be equal in
   the same bits - meaning we don't have to compare against, or store, any bits
   after that position.

   We also don't have to store all the low bits, either - we need to store
   enough bits to correctly pivot on the key the current node points to (call it
   m); i.e. we need to store enough bits to tell m apart from the key
   immeditiately prior to m (call it p). We're not looking for strict equality
   comparisons here, we're going to follow this up with a linear search anyways.

   So the node in the auxiliary search tree (roughly) needs to store the bits
   from where l and r first differed to where m and p first differed - and
   usually that's not going to be very many bits. The full struct bkey has 160
   bit keys, but 16 bit keys in the auxiliary search tree will suffice > 99% of
   the time.

   Lastly, since we'd really like these nodes to be fixed size - we just pick a
   size and then when we're constructing the auxiliary search tree check if we
   weren't able to construct a node, and flag it; the lookup code will fall back
   to comparing against the original key. Provided this happens rarely enough,
   the performance impact will be negligible.

   The auxiliary search trees were an enormous improvement to bcache's
   performance when they were introduced - before they were introduced the
   lookup code was a simple binary search (eons ago when keys were still fixed
   size). On random lookups with a large btree the auxiliary search trees are
   easily over an order of magnitude faster.

## Bkey packing

Bkeys have gotten rather large, with 64 bit inodes, 64 bit offsets, a snapshot
field that isn't being used yet, a 32 bit size field that's only used by
extents... That's a lot of wasted bits.

The packing mechanism was added to address this, and as an added bonus it should
allow us to change or add fields to struct bkey (if we ever want to) without
incompatible on disk format changes (forwards and backwards!).

The main constraint is lookup performance - any packing scheme that required an
unpack for comparisons would probably be right out. So if we want to be able to
compare packed keys, the requirement is that it be order preserving. That is, we
need to define some function pack(), such that

        bkey_cmp(l, r) == bkey_cmp(pack(l), pack(r))

The way it works is for every btree node we define a packed format (and we
recalculate a new packed format when splitting/rewriting nodes); then keys that
are packed in the same format may be compared without unpacking them.

The packed format itself consists of, for each field in struct bkey

 - the size of that field in the packed key, in bits
 - an offset, to be subtracted before packing and when unpacking

Then, provided keys fit into the packed format (their key fields are not less
than the field offset, and after subtracting the offset their fields fit into
the number of bits in the packed keys) packing is order preserving - i.e.
packing is allowed to fail.

Then in the lookup path, we pack the search key into the btree node's packed
format - even better, this helps the auxiliary search tree code since it no
longer has to deal with long strings of zeroes between the inode and the offset.

For extents, keys in the packed format are usually only 8 bytes - vs. 32 bytes
for struct bkey. And with no replication, checksumming or compression, the value
is only 8 bytes - so packing can reduce our metadata size by more than half.

The other interesting thing about this mechanism is that it makes our key format
self describing - we aren't restricted to just defining a format for packed
keys, we can define one for our in memory representation, struct bkey, as well
(and we do).

Then as long as we have the format corresponding to a key, we can use it - we
should just be careful not to drop fields it has that we don't understand (by
converting it to our current in memory representation and back).

All that's left to do is to store formats (besides the btree node-local format)
in the superblock. Right now the btree node local format is defined to be format
0 and the in memory format - struct bkey itself - is format 1,
`BKEY_FORMAT_CURRENT`. Changing struct bkey would be as simple as defining a new
format, and making sure the new format gets added to existing superblocks when
it's used.

## Error handling

This section assumes we have no fallback via replication to try (i.e. retry the
write to another device) - or we have, and that's failed: we're at the point
where we're forced to return errors up the stack.

Error handling of data writes is nothing special - we can return an error for
just the particular IO and be done with it.

Error handling for metadata IO is a more interesting topic. Fundamentally, if a
metadata write fails (be it journal or btree), we can't fail return errors for
some subset of the outstanding operations - we have to stop everything.

This isn't peculiar to bcache - other journaling filesystems have the same
behaviour. There's a few reasons; partly it comes down to the fact that by the
time we do the actual IO, the index update (dirent creation, inode update) has
long completed so we can't return an error and unwind, and we can't
realistically backtrack from the contents of a journal write or a btree node
write to the in memory state that's now inconsistent.

So we have to consider the entire in memory state of the filesystem inconsistent
with what's on disk, and the only thing we can really do is just emergency
remount RO.

### Journal IO errors

### Btree IO errors