summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKent Overstreet <kent.overstreet@gmail.com>2015-05-30 03:54:30 -0700
committerKent Overstreet <kent.overstreet@gmail.com>2015-05-30 03:54:30 -0700
commit2ce449ea9afae7f0ae63e22bfc1ad951a175eb62 (patch)
tree2553ea9f11e8700ff93f7c67628437fa552f40f6
parentce9d482c525bcf842d2d27dbdeae9dfc0e05be30 (diff)
more guide
-rw-r--r--BcacheGuide.mdwn129
1 files changed, 128 insertions, 1 deletions
diff --git a/BcacheGuide.mdwn b/BcacheGuide.mdwn
index b8d8caa..dd45d91 100644
--- a/BcacheGuide.mdwn
+++ b/BcacheGuide.mdwn
@@ -1,4 +1,4 @@
-[[!toc levels=2 startlevel=2]]
+[[!toc levels=3 startlevel=2]]
# THE PROGRAMMER'S GUIDE TO BCACHE:
@@ -237,6 +237,9 @@ other filesystems that has them - a variable sized chunk of data. From the point
of view of the index, they aren't positions, they're ranges or half open
intervals (note that a 0 size extent doesn't overlap with anything).
+Bcache is also purely copy on write - it never does update in place. This is a
+very fundamental part of the design, and quite useful.
+
Bcache's extents are a list of one or more pointers - if there's more than one
pointer, they point to replicated data (or possibly one of the copies is on a
faster device and is considered cached).
@@ -362,3 +365,127 @@ basic algorithm is fairly simple there too:
- Finally, after the data write(s) complete, insert the keys we created into
the extents btree.
+
+#### Checksumming/compression:
+
+As mentioned previously, bch_extent got more complicated with data checksumming
+and compression.
+
+For data checksumming, what we want to do is store the checksum with the key and
+pointers, no the data - if you store the checksum with the data verifying the
+checksum only tells you that you got the checksum that goes with that data, it
+doesn't tell you if you got the data you actually wanted - perhaps the write
+never happened and you're reading old data. There's a number of subtle ways data
+can be corrupted that naively putting the checksum with the data won't protect
+against, and there are workarounds for most of them but it's fundamentally not
+the ideal approach.
+
+So we want to store the checksum with the extent, but now we've got a different
+problem - partially overwritten extents. When an extent is partially
+overwritten, we don't have a checksum for the portion that's now live and we
+can't compute it without reading that data in, which would probably not be
+ideal.
+
+So we need to remember what the original extent was, so that later when that
+data is read we can read in the entire original extent, checksum that, and then
+return only the part that was live.
+
+Compression leads to a similar issue, where if the extent is compressed we won't
+be able to part of it and decompress it, we have to be able to read the entire
+extent.
+
+One simplifying factor is that since buckets are reused all at once, as long as
+any part of the original extent is live the rest of it will still be there - we
+don't have to complicate our accounting and have two notions of live data to
+make sure the dead parts of the extent aren't overwritten.
+
+So, for compressed or checksummed extents here's the additional information
+we'll need to store in struct bch_extent:
+
+ struct bch_extent_crc32 {
+ __u32 type:1,
+ offset:7,
+ compressed_size:8,
+ uncompressed_size:8,
+ csum_type:4,
+ compression_type:4;
+ __u32 csum;
+ };
+
+Now, when trimming an extent, instead of modifying the pointer offset to point
+to the start of the new live region (as we still do for non
+checksummed/compressed extents) we add to the offset field in the extent_crc
+struct - representing the offset into the extent for the currently live portion.
+The compressed_size and uncompressed_size fields store the original sizes of the
+extent, while the size field in the key continues to represent the size of the
+live portion of the extent.
+
+There's another complicating factor is data migration, from copying garbage
+collection and tiering - in general, when they need to move some data we can't
+expect them to rewrite every replica of an extent at once (since copygc is
+needed for making forward progress, that would lead to nasty deadlocks).
+
+But if we're moving one of the replicas of an extent (or making another copy, as
+in the case of promoting to a faster tier) we don't want to have to copy the
+dead portions too - we'd never be able to get rid of the dead portions! Thus,
+we need to handle extents where different pointers are in different formats.
+
+Having one bch_extent_crc32 field per pointer (or bch_extent_crc64, for the case
+where the user is using 64 bit checksums) would make our metadata excessively
+large, unfortunately - if they're doing two or three way replication that would
+roughly double the size of the extents btree, or triple it for 64 bit checksums.
+
+So the new format struct bch_extent is a list of mixed pointers and extent_crc
+fields: each extent_crc field corresponds to the pointers that come after it,
+until the next extent_crc field. See include/uapi/linux/bcache.h for the gory
+details, and extents.h for the various functions and macros for iterating over
+and working with extent pointers and crc fields.
+
+
+#### RAID5/6/Erasure coding:
+
+This part hasn't been implemented yet - but here's the rough design for what
+will hopefully be implemented in the near future:
+
+Background: In conventional RAID, people have long noted the "RAID hole" and
+wanted to solve it by moving RAID into the filesystem - this what ZFS does.
+
+The hole is that it's impossible for block layer raid to atomically write a
+stripe: while writing a stripe, there's inevitably going to be a window where
+the P and Q blocks are inconsistent and recovery is impossible. The best that
+they can do is order the writes so that the data blocks are written first, then
+the P/Q blocks after, and then on unclean shutdown rebuild all the P/Q blocks
+(with perhaps a bitmap to avoid having to resync the whole device, as md does).
+
+The approach that ZFS takes is to fragment incoming writes into (variable sized)
+stripes, so that it can write out the entire stripe all at once and never have a
+pointer in its index pointing to an incomplete/inconsistent stripe.
+
+This works, but fragmenting IO is painful for performance. Not only writes are
+fragmented, but reads of the same data are fragmented too; and since the latency
+of the read has to be the latency of the slowest fragment, this drives your
+median latency to your tail latency. Worse, on disks you're spending a lot more
+seeks than you were before; even on flash it's not ideal because since the
+fragmenting is happening above all the scheduling, both on the host and the
+device you're still subject to the tail latency effect.
+
+What we want is to be able to erasure encode unrelated data together - while
+avoiding update in place. This is problematic to do for foreground writes, but
+if we restrict ourselves to erasure encoding data in the background - perhaps
+even as it's being copied from the fast tier (flash) to the slow tier (disk) -
+the problem is much easier.
+
+The idea is to still replicate foreground writes, but keep track of the buckets
+that contain replicated data. Then, a background job can take e.g. 5 buckets
+that contain unrelated data, allocate two new buckets, and then compute the p/q
+for the original five buckets and write them to the two new buckets. Then, for
+every extent that points into those first five buckets (either with an index
+scan, or by remembering keys from recent foreground writes), it'll update each
+extent dropping their extra replicas and adding pointers to the p/q buckets -
+with a bit set in each pointer telling the read path that to use those it has to
+do a reconstruct read (which will entail looking up the stripe mapping in
+another btree).
+
+This does mean that we can't reuse any of the buckets in the stripe until each
+of them are empty - but this shouldn't be much trouble, we just have to teach
+copygc about it when it's considering which buckets to evacuate.