summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKent Overstreet <kent.overstreet@gmail.com>2016-01-04 04:56:20 -0900
committerKent Overstreet <kent.overstreet@gmail.com>2016-01-04 04:56:20 -0900
commit0444fae55199547cf871921508885ea933c7bd7d (patch)
treedb35f5e015f5871b13bcadc1c1d81e5a0dbf586c
parent6a5996aaaad0b4c0fa13500c601909d1b7f39382 (diff)
http -> https
-rw-r--r--Design.mdwn130
-rw-r--r--index.mdwn14
2 files changed, 7 insertions, 137 deletions
diff --git a/Design.mdwn b/Design.mdwn
deleted file mode 100644
index f14386d..0000000
--- a/Design.mdwn
+++ /dev/null
@@ -1,130 +0,0 @@
-
-
-### Code documentation
-
-I've been adding new design documentation to the code, which should complement the documentation on this page:
-
-* [[http://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/bcache.h|http://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/bcache.h]]
-* [[http://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/btree.h|http://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/btree.h]]
-* [[http://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/bset.h|http://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/bset.h]]
-* [[http://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/alloc.c|http://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/alloc.c]]
-
-### Design Overview
-
-The cache set comprises the highest level of the bcache hierarchy. A cache set is a collection of one or more caches and one or more storage devices. Each cache contains a single fast device. The cache set may contain a journal, which keeps a sequential log of all outstanding operations on the cache set. The cache set also contains worker threads for doing garbage collection and dirty page write back. A cache set can be either in synchronous or asynchronous mode. In synchronous mode, all writes are strictly ordered such that the cache set can be recovered from an unclean shutdown. Each slow storage device attached to the cache set can be in either writethrough or writeback mode. In writethrough mode, writes to the cache set are not acknowledged until they have completed on the slow storage device. In writeback mode, writes are acknowledged when the index and the data have been written to the fast device. Background write back will later find these dirty data and asynchronously write them to the slow storage device.
-
-Because the fast block devices are expected to be SSDs, bcache partitions the storage on the fast storage devices into buckets. All buckets on all the devices associated with a cache set have the same bucket size. The bucket size is expected to be the native erase block size of the SSDs (or the least common multiple of the erase block sizes of all the SSDs). Within a bucket, bcache will write sectors sequentially and only once. The minimum number of data written at any one time is determined by the block size of the cache set. The block size is intended to correspond to the optimal write size of the device. When bcache needs to reclaim space on a fast device, it will reclaim a bucket and send the device a discard command covering the entire bucket before it writes sequentially to sectors within the bucket. This allows bcache to work well with SSDs that have very simple controllers. It also means that bcache could be used with raw flash with minor additions to handle ECC, bad block lists and wear leveling.
-
-Bcache implements a very fast index. On every IO request to the devices in the cache set, bcache must very quickly determine where the requested data is located. This index must be fast and space efficient both in memory and on the fast storage devices. The data structure used to implement the index is essentially a B+ tree. Unlike a traditional B+ tree, every node can have more than one key sets. Keys are ordered within key sets but all key sets must be scanned from the newest set towards the oldest when searching for a key. Keys are never removed from a node until such a node is either split or reclaimed by the garbage collection mechanism. Like a traditional B+ tree, only leaf nodes reference data. All internal nodes point to other B+ tree nodes.
-
-When bcache loads, it creates a major device type. Every slow storage device registered with bcache creates a new minor device number that is associated with that storage device. All accesses to the storage device must now be made through the major, minor pair provided by bcache for that storage device. To prevent anyone from inadvertently referencing the storage device without the cache attached (which would potentially show an inconsistent version of the data) storage devices must be created using a special utility before they can be registered with bcache. This utility writes a bcache superblock at the start of the device which bcache then hides from the version of the device it presents through its special major, minor device number. Once this is done, an ext4 (or any other filesystem) partition can be created through bcache and mounted. Should any attempt be made to mount the storage device without bcache, ext4 will not be able to locate its superblock and reject the partition.
-
-
-### Detailed Design
-
-
-#### Buckets
-
-Allocation of space on fast devices is done through buckets. A bucket is intended to correspond to an erase block on the SSD. At any given time, nearly all buckets are going to be on a heap. The heap allows buckets to be reclaimed based on how frequently they have been accessed. Every bucket has a priority counter. This counter is increased whenever the bucket serves a cache hit. The priority of all the buckets in the heap is decremented periodically. This allows the least frequently used buckets to be quickly reclaimed and put on a short free list as needed. The generation number of the bucket is incremented when the bucket is reclaimed. This allows for very fast bucket reclaiming because there is no immediate need to traverse the entire index removing pointers that reference the reclaimed bucket. Garbage collection is then responsible for removing stale pointer from the index.
-
-Bcache keeps a short list of open buckets at all times. These buckets are ready to accept new data. Bcache will try to put sequential data into the same bucket, even if the data is separated by other arriving items. Failing that, bcache will try to collect data produced by the same task within a bucket. This strategy is the first stage to merging contiguous or related data sequences. When an open data bucket fills, a new bucket is pulled from the free list.
-
-The fixed layout of a fast device is as follows:
-[[!format txt """
-Sector 0 : 4K blank // Just in case somebody runs fdisk on this device
-Sector 8 : 4K superblock // Contains information of bucket size, root of bucket of the index, etc.
-Bucket n+j: Data // Data area managed by the bucket heap
-"""]]
-The superblock points to a list of buckets, which are used as a circular buffer for the journal. The journal header contains pointers to the rest of the metadata - the btree root, the UUIDs, and the bucket priorities/gens.
-
-
-#### Index
-
-The bcache index is a B+ tree in which each node occupies an entire bucket. This produces a very flat tree structure that often requires only a few pointer traversals to determine the presence or absence of any item. The bcache nodes contain key pointer tuples. A tuple has the following structure:
-
-
-[[!format txt """
-struct bkey {
- struct {
- uint64_t header_flag : 1; // Set for the header record, clear for all others
- uint64_t num_ptrs : 3; // Number of pointers in this tuple
- uint64_t : 23; // Reserved for future use
- uint64_t dirty_flag : 1; // Set if the data is dirty
- uint64_t length : 16; // Number of sectors in this data item (up to 32MB)
- uint64_t virtual_device : 20; // Virtual device (currently corresponds to a real device)
- } header;
- struct {
- uint64_t header_flag : 1; // Clear for all except the header record
- uint64_t sector : 63; // Sector number in the virtual device (max 2^72 bytes per device)
- } key;
- struct {
- uint64_t header_flag : 1; // Clear for all except the header record
- uint64_t physical_device : 11; // Index to a real physical device
- uint64_t offset : 44; // Sector offset to the physical device (max 2^53 bytes per device)
- uint64_t generation : 8; // Generation number of the pointer
- } ptr[];
-};
-"""]]
-This structure is extremely flexible and allows bcache to potentially maintain up to eight cached copies of any particular piece of data. For example, bcache could potentially cache two copies of dirty data on separate SSDs in case one device should go bad there is a backup copy still available to provide full consistency. Every pointer has a generation number which allows buckets to be quickly reclaimed. If the generation number of the pointer does not match the generation number of the bucket, then the cached data has been reclaimed and is no longer valid. At some later point, garbage collection will lazily remove the invalid pointer.
-
-A key has a sector length, so each key can represent up to 32MB of cache data. Generally, the amount represented will be smaller since there are limitations on how much sequential data the block devices associated with the cache set will accept. It is important to note that the key value in the sector field actually indicated the number of the sector at the end of the data. To find the actually starting sector offset of the data referenced by the key, the length must be subtracted from the sector. When searching within the index, the search returns the smallest key greater than the search value.
-
-Unlike a textbook B+tree, there is not one more pointer than there are keys in a node. This is not an issue while searching because each key is the sector following the range represented by the key. This affects insertion because inserting a key in a leaf that extends the range past the last range already in the leaf requires that the new key value be propagated up the tree towards the root.
-
-Another issue that complicates maintaining the index is that it is possible for a new range to be inserted into the index that basically overlaps existing ranges in two adjacent leafs. This results in the insertion of the new range in the left leaf, which supersedes the overlapped ranges in that leaf. A special key to invalidate the existing overlapping ranges in the right leaf must also be added (there is no direct deletion from index node, so a new key to invalidate the required range is appended to the index).
-
-Since a bucket is usually very large, it would be very inefficient to have to insert new keys into a large index node as it starts to fill up. Bcache solves this problem by partitioning the index into subsets. Within each subset, the keys are sorted. New keys are always inserted into the last subset. A subset starts with a special subset header which fits into the standard node index array:
-
-
-[[!format txt """
-struct bset {
- uint32_t csum; // Checksum of the subregion
- uint32_t keys; // Number of keys in this subregion
- uint64_t rand; // Used to verify that this is a valid subregion
- struct bkey start[0]; // Start of keys in this subset
-};
-"""]]
-When a subset is first used, a 30 second timer is started. New keys are then inserted into this new subset until the timer expires. When the timer expires, the contents of the subset are written to disk. Since the minimum write size is a block, there is a potential for space to be wasted in the disk version of the index node. This is a small price to pay for ensuring that the written version of the index is relatively current.
-
-When a new subset is prepared to write out to the cache device, the keys are checked to ensure that they are in sorted order. This is an internal sanity check to ensure that the index routines have not corrupted the tree. The subset's checksum is calculated and added to the subset header. To allow tracking of how efficiently the index spaces is used on the device, the global Btree write counter is then incremented and the number of keys in the subset is added to the global keys write counter. A new subset is then set up ready to start accumulating a new set of keys.
-
-Garbage collection will lazily merge written subsets in the memory instance of the index to form longer subsets with sorted keys. This increases the efficiency of future traversals of the index. Garbage collection also detects any keys that need to be removed. Only garbage collection removes keys. On insertion, if any key matching an existing item in the current subset overwrites the previous key, thus effectively deleting the old version. Since the index is always searched from the newest subset towards the oldest, any new key will effectively overwrite any other version of the key in a previous subset of the index node. Finally, since every pointer has a generation, stale pointers do not need to be removed immediately. Garbage collection or node splitting required to grow the index finds any stale or redundant keys and removes them, thus preventing these keys from inordinately consuming valuable memory resources.
-
-
-#### Journal
-
-The bcache journal is used to improve performance. It is possible to create cache devices that have no journal, so bcache can work properly with and without a journal. If a journal is present, write requests can be completed as soon as the journal update request completes. This avoids the longer tail latencies that can result when an index insert requires many writes to the caching device because it causes a B+ tree node split that propagates upwards and causes a new level to be added. With the journal, the index updates can progress in the background after the write completes.
-
-Since the index is always kept in a consistent state on the cache device, a power failure should create problems assuming that the cache device is well behaved and does not randomly corrupt data or lose data as a result of a power loss. In such a case, even the journal is going to be of little use since the journal itself may potentially be lost or corrupted. Assuming a well, behaved caching device, a cache set configured to do work in synchronous mode should always be recoverable after a power loss.
-
-If a cache set has a journal, the journal is quickly replayed when the cache set is started and any items in the journal not updated in the index are inserted into the index.
-
-
-#### Write-back
-
-When write-back is enabled on a cached storage device there is greater potential for reducing seeks and improving performance. In write-through mode, a write is not completed until it has been written to the storage device. In write-back mode, the write is acknowledged when the data is written to the cache device and the journal update finishes. In the future, the journal will contain a strong checksum of the data so that the write will complete when the journal entry finishes (journal replay will stop if the expected data is not available).
-
-In write-back mode, the dirty flag on the key is set to indicate that the data is not available on the storage device. All read accesses to that data must be serviced from the cache. Any new writes to that same location replace the previous version, potentially saving extra seeks and write operations that would have had to take place on the storage device without the cache. At some point, the dirty data must be written to the storage device. During normal operation, this is done by a background write-back process. There are several tunable parameters that control write-back. The first is a write-back delay parameter. This parameter specifies how long to wait after the dirty data arrives before starting a write-back process. The second is a write-back percentage parameter indicating the minimum percentage of the cache that must be full before write-back will start. This allows for more buffering opportunities and increases the effectiveness of each write-back pass (improves how much data is written back for each movement of the head).
-
-When a write-back pass starts, it write data in order to the drive from smallest sector numbers to the highest. This show increase the effective amount of work done for a single sweep of the head across the disk. Foreground read requests can continue to arrive as write-back is processing. For this reason, write-back requests are sent with lower priority. Despite this, there are obvious opportunities to improve the effectiveness of background write-back by coupling it more closely to the foreground requests. Doing so, however, would complicate the mechanism.
-
-
-#### Notes on space efficiency
-
-The btree isn't pinned in memory except for the ~10 nodes in the lru. The pinned memory is all per bucket metadata; you want your bucket size to be equal to the erase block size so it'll be probably 128k or 512k; 128k with 128 gb of cache gives us 1M buckets.
-
-I've got it divided up into a couple structs (partly to reduce padding, cache behavior should be better to, not that that'll matter when disk io is a factor). On 64 bit the total overhead per bucket should be 33 bytes, if I haven't missed anything. Obviously not as good as yours, but still small enough it shouldn't be an issue. The single biggest thing is the heap at a pointer and a long per bucket, if anyone really cares 8 bytes could be saved if they aren't going to have more than 2 billion buckets :)
-
-The btree size is also important since you want it to be in memory. Keys are 16 bytes and can reference anywhere from one sector to an entire bucket. I don't know of anything that does much IO in sub page sized units, so in practice everything will be page sized granularity, and since you want to cache random IO and ignore sequential IO average key size is ideally going to be on the lower end. It does merge keys when possible, and it keeps multiple buckets open for writing simultaneously so it can keep contiguous or related data together, so average key size is going to depend strongly on workload.
-
-So a full cache will give you 4 mb of keys per 1 gb of cache worst case, if they're all 4k. 128 gb of cache gives half a gig of btree.
-
-I haven't checked yet how space efficient the btree is in practice. Since old keys are removed by garbage collection, we don't have any kind of bound on how much of a specific btree bucket is used by non stale keys. However, garbage collection is performed every so often to find data that's been rewritten (there's other constraints on when we have to gc, but this is the one that we hit in practice), so on average not more than a quarter of the keys in the btree should be stale. (Actually, that's not quite true - right now gc doesn't rewrite leaf nodes except to keep generation counters from rolling over, but checking the percentage of stale keys is trivial, I just hadn't thought of it till now :)
-
-Then garbage collection can leave buckets less than half full - I haven't implemented btree merging because I don't think it'll really be needed in practice, mostly empty btree buckets will go away eventually for various reasons. But if it does turn out to be an issue it'll be relatively easy to add to garbage collection.
-
-So at a guess I'd expect the btree to be around half space efficient on average. For disk space with 16 bit keys this should be fine. For memory usage, this could actually be improved a lot since btree buckets are used sequentially; once we've read it from disk or created a new one we know what pages don't contain anything yet, and those could be freed until they are needed. That'd make memory usage of the btree bounded and excellent.
-
-The real reason for the btree though was locality of disk writes. Making an index that's fast enough on lookups is easy - cutting down random writes to the bare minimum was my goal. If it helps performance elsewhere, that's just gravy :)
-
-Further off, checksums are going to be stored in the btree so that'll add a fixed, possibly configurable overhead.
diff --git a/index.mdwn b/index.mdwn
index 7b58be9..906428a 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -2,14 +2,14 @@
# Bcachefs is up!
-[[http://evilpiepirate.org/git/linux-bcache.git/log/?h=bcache-dev]]
+[[https://evilpiepirate.org/git/linux-bcache.git/log/?h=bcache-dev]]
It's still alpha quality, but the code is up and you can play with it now. Build
a kernel from the bcache-dev branch, and a new bcache-tools from the dev branch
in that repository.
- git clone -b bcache-dev http://evilpiepirate.org/git/linux-bcache.git
- git clone -b dev http://evilpiepirate.org/git/bcache-tools.git
+ git clone -b bcache-dev https://evilpiepirate.org/git/linux-bcache.git
+ git clone -b dev https://evilpiepirate.org/git/bcache-tools.git
Then, just run
@@ -43,11 +43,11 @@ It's also designed to be safe. Reliability is critical for anything that does wr
Bcache is designed around the performance characteristics of SSDs. It's designed to minimize write inflation to the greatest extent possible, and never itself does random writes. It turns random writes into sequential writes - first when it writes them to the SSD, and then with writeback caching it can use your SSD to buffer gigabytes of writes and write them all out in order to your hard drive or raid array. If you've got a RAID6, you're probably aware of the painful random write penalty, and the expensive controllers with battery backup people buy to mitigate them. Now, you can use Linux's excellent software RAID and still get fast random writes, even on cheap hardware.
-[[User documentation|http://evilpiepirate.org/git/linux-bcache.git/tree/Documentation/bcache.txt]] - Documentation/bcache.txt in the bcache kernel tree.
+[[User documentation|https://evilpiepirate.org/git/linux-bcache.git/tree/Documentation/bcache.txt]] - Documentation/bcache.txt in the bcache kernel tree.
-[[Troubleshooting performance|http://evilpiepirate.org/git/linux-bcache.git/tree/Documentation/bcache.txt#n126]]
+[[Troubleshooting performance|https://evilpiepirate.org/git/linux-bcache.git/tree/Documentation/bcache.txt#n126]]
-[[Design and code documentation|Design]]
+[[Programmer's guide|BcacheGuide]]
[[FAQ|FAQ]]
@@ -62,7 +62,7 @@ Bcache has been merged into the mainline Linux kernel; for the latest stable bca
For the userspace tools,
- git clone http://evilpiepirate.org/git/bcache-tools.git
+ git clone https://evilpiepirate.org/git/bcache-tools.git
The udev rules, Debian/Ubuntu source package, and [Ubuntu PPA](https://launchpad.net/~g2p/+archive/storage/) are maintained here: