more guide

author: Kent Overstreet <kent.overstreet@gmail.com> 2015-05-30 03:54:30 -0700
committer: Kent Overstreet <kent.overstreet@gmail.com> 2015-05-30 03:54:30 -0700
commit: 2ce449ea9afae7f0ae63e22bfc1ad951a175eb62 (patch)
tree: 2553ea9f11e8700ff93f7c67628437fa552f40f6
parent: ce9d482c525bcf842d2d27dbdeae9dfc0e05be30 (diff)
1 files changed, 128 insertions, 1 deletions
diff --git a/BcacheGuide.mdwn b/BcacheGuide.mdwn
index b8d8caa..dd45d91 100644
--- a/BcacheGuide.mdwn
+++ b/BcacheGuide.mdwn
@@ -1,4 +1,4 @@
-[[!toc levels=2 startlevel=2]]
+[[!toc levels=3 startlevel=2]]
 
 # THE PROGRAMMER'S GUIDE TO BCACHE:
 
@@ -237,6 +237,9 @@ other filesystems that has them - a variable sized chunk of data. From the point
 of view of the index, they aren't positions, they're ranges or half open
 intervals (note that a 0 size extent doesn't overlap with anything).
 
+Bcache is also purely copy on write - it never does update in place. This is a
+very fundamental part of the design, and quite useful.
+
 Bcache's extents are a list of one or more pointers - if there's more than one
 pointer, they point to replicated data (or possibly one of the copies is on a
 faster device and is considered cached).
@@ -362,3 +365,127 @@ basic algorithm is fairly simple there too:
 
  - Finally, after the data write(s) complete, insert the keys we created into
    the extents btree.
+
+#### Checksumming/compression:
+
+As mentioned previously, bch_extent got more complicated with data checksumming
+and compression. 
+
+For data checksumming, what we want to do is store the checksum with the key and
+pointers, no the data - if you store the checksum with the data verifying the
+checksum only tells you that you got the checksum that goes with that data, it
+doesn't tell you if you got the data you actually wanted - perhaps the write
+never happened and you're reading old data. There's a number of subtle ways data
+can be corrupted that naively putting the checksum with the data won't protect
+against, and there are workarounds for most of them but it's fundamentally not
+the ideal approach.
+
+So we want to store the checksum with the extent, but now we've got a different
+problem - partially overwritten extents. When an extent is partially
+overwritten, we don't have a checksum for the portion that's now live and we
+can't compute it without reading that data in, which would probably not be
+ideal.
+
+So we need to remember what the original extent was, so that later when that
+data is read we can read in the entire original extent, checksum that, and then
+return only the part that was live.
+
+Compression leads to a similar issue, where if the extent is compressed we won't
+be able to part of it and decompress it, we have to be able to read the entire
+extent.
+
+One simplifying factor is that since buckets are reused all at once, as long as
+any part of the original extent is live the rest of it will still be there - we
+don't have to complicate our accounting and have two notions of live data to
+make sure the dead parts of the extent aren't overwritten.
+
+So, for compressed or checksummed extents here's the additional information
+we'll need to store in struct bch_extent:
+
+	struct bch_extent_crc32 {
+		__u32			type:1,
+					offset:7,
+					compressed_size:8,
+					uncompressed_size:8,
+					csum_type:4,
+					compression_type:4;
+		__u32			csum;
+	};
+
+Now, when trimming an extent, instead of modifying the pointer offset to point
+to the start of the new live region (as we still do for non
+checksummed/compressed extents) we add to the offset field in the extent_crc
+struct - representing the offset into the extent for the currently live portion.
+The compressed_size and uncompressed_size fields store the original sizes of the
+extent, while the size field in the key continues to represent the size of the
+live portion of the extent.
+
+There's another complicating factor is data migration, from copying garbage
+collection and tiering - in general, when they need to move some data we can't
+expect them to rewrite every replica of an extent at once (since copygc is
+needed for making forward progress, that would lead to nasty deadlocks).
+
+But if we're moving one of the replicas of an extent (or making another copy, as
+in the case of promoting to a faster tier) we don't want to have to copy the
+dead portions too - we'd never be able to get rid of the dead portions! Thus,
+we need to handle extents where different pointers are in different formats.
+
+Having one bch_extent_crc32 field per pointer (or bch_extent_crc64, for the case
+where the user is using 64 bit checksums) would make our metadata excessively
+large, unfortunately - if they're doing two or three way replication that would
+roughly double the size of the extents btree, or triple it for 64 bit checksums.
+
+So the new format struct bch_extent is a list of mixed pointers and extent_crc
+fields: each extent_crc field corresponds to the pointers that come after it,
+until the next extent_crc field. See include/uapi/linux/bcache.h for the gory
+details, and extents.h for the various functions and macros for iterating over
+and working with extent pointers and crc fields.
+
+
+#### RAID5/6/Erasure coding:
+
+This part hasn't been implemented yet - but here's the rough design for what
+will hopefully be implemented in the near future:
+
+Background: In conventional RAID, people have long noted the "RAID hole" and
+wanted to solve it by moving RAID into the filesystem - this what ZFS does.
+
+The hole is that it's impossible for block layer raid to atomically write a
+stripe: while writing a stripe, there's inevitably going to be a window where
+the P and Q blocks are inconsistent and recovery is impossible. The best that
+they can do is order the writes so that the data blocks are written first, then
+the P/Q blocks after, and then on unclean shutdown rebuild all the P/Q blocks
+(with perhaps a bitmap to avoid having to resync the whole device, as md does).
+
+The approach that ZFS takes is to fragment incoming writes into (variable sized)
+stripes, so that it can write out the entire stripe all at once and never have a
+pointer in its index pointing to an incomplete/inconsistent stripe.
+
+This works, but fragmenting IO is painful for performance. Not only writes are
+fragmented, but reads of the same data are fragmented too; and since the latency
+of the read has to be the latency of the slowest fragment, this drives your
+median latency to your tail latency. Worse, on disks you're spending a lot more
+seeks than you were before; even on flash it's not ideal because since the
+fragmenting is happening above all the scheduling, both on the host and the
+device you're still subject to the tail latency effect.
+
+What we want is to be able to erasure encode unrelated data together - while
+avoiding update in place. This is problematic to do for foreground writes, but
+if we restrict ourselves to erasure encoding data in the background - perhaps
+even as it's being copied from the fast tier (flash) to the slow tier (disk) -
+the problem is much easier.
+
+The idea is to still replicate foreground writes, but keep track of the buckets
+that contain replicated data. Then, a background job can take e.g. 5 buckets
+that contain unrelated data, allocate two new buckets, and then compute the p/q
+for the original five buckets and write them to the two new buckets. Then, for
+every extent that points into those first five buckets (either with an index
+scan, or by remembering keys from recent foreground writes), it'll update each
+extent dropping their extra replicas and adding pointers to the p/q buckets -
+with a bit set in each pointer telling the read path that to use those it has to
+do a reconstruct read (which will entail looking up the stripe mapping in
+another btree).
+
+This does mean that we can't reuse any of the buckets in the stripe until each
+of them are empty - but this shouldn't be much trouble, we just have to teach
+copygc about it when it's considering which buckets to evacuate.
author	Kent Overstreet <kent.overstreet@gmail.com>	2015-05-30 03:54:30 -0700
committer	Kent Overstreet <kent.overstreet@gmail.com>	2015-05-30 03:54:30 -0700
commit	2ce449ea9afae7f0ae63e22bfc1ad951a175eb62 (patch)
tree	2553ea9f11e8700ff93f7c67628437fa552f40f6
parent	ce9d482c525bcf842d2d27dbdeae9dfc0e05be30 (diff)