diff options
-rw-r--r-- | usenix2019.tex | 86 |
1 files changed, 72 insertions, 14 deletions
diff --git a/usenix2019.tex b/usenix2019.tex index dd3e891..04ed8db 100644 --- a/usenix2019.tex +++ b/usenix2019.tex @@ -158,22 +158,80 @@ rand_mixed: 10.0M with 4 threads in 38 sec, 14806 nsec per iter, 263k per s \section{Btree design} +The core design elements of bcachefs's btree are fairly old at this point, as +they originated in the earliest versions of bcache, but have not been described +formally before. + +Bcachefs's btree is mostly a relatively standard copy on write B+ tree, with one +major difference - btree nodes are large (typically 256k), and log structured. + +Log structured btree nodes do add a significant amount of complexity to the +btree code. Nodes are too big for all the keys within a node to be kept as a +single sorted array of keys, so we have to keep multiple (up to three) sorted +arrays of keys in memory for a given btree node and the lookup and iterator code +has to internally search all three and combine the results. More significantly, +log structured btree nodes mean that we have some of the properties of a +conventional update-in-place btree: with a copy on write btree, the only writes +that have any direct effect on visible on disk metadata are writes that update +the pointer to the root node. With log structured btree nodes, writes that +append to btree nodes also make changes visible on disk and have to be correctly +ordered with respect to other metadata writes. + +But, log structured btree nodes have turned out to be an enormous performance +win: + +First, by having multiple sorted arrays of keys (referred to as bsets within the +code) per btree node, most keys in a btree node will be in bsets that are no +longer being modified, and this means we can construct lookup tables that would +be much too expensive to use if we had to modify them. + +This is a major issue because binary search is a worst case scenario for +performance on modern processes - prefetching is almost useless because at every +iteration, you have two possible memory accesses depending on which way the +previous comparison went, and they'll be at completely different locations in +memory. + +However, if we construct a binary search tree in Eytzinger layout +(https://arxiv.org/pdf/1509.05053.pdf), prefetching is usefully possible because +a given node's two children are adjacent in memory - as well as all its +grandchildren, great grandchildren, and so on. Then we use a couple tricks to +make nodes in the auxiliary search tree as small as possible - four bytes - so +we can fit as many of them as possible onto a cacheline: +\begin{itemize} + \item Each node in the auxiliary search tree corresponds to one + cacheline worth of data in the original set of keys, which means + we don't have to store a full pointer to the original + corresponding key - just an offset within that cacheline, + combined with some math. + \item Keys are large integers - 160 bits - but for any given comparison, + we don't actually need most of those bits. The keys in the range + we're currently searching will all have the same high bits - we + don't need to store those. And we don't need to store the low + bits: if we compare the pivot with the key immediately prior to + it, we only need to store bits up to and including the first bit + where those two keys differ (i.e. the key in the auxiliary + search tree has to compare greater than the pivot's previous + key). + + We need the nodes in the auxiliary search tree to be fixed size, + which means sometimes when constructing those nodes the pivot + and the prior key won't differ in the bits created - to handle + that case, we flag that node as failed and the lookup code falls + back to comparing against the original key. But the lookup code + parameters have been tuned so that this is a very rare + occurrence (statistics on the auxiliary search trees are + available in debugfs). +\end{itemize} - - - - - - - - - - - - - - +All these tricks mean we can fit 16 nodes in the auxiliary search tree onto a +cacheline, and we can prefetch four levels in advance - which means we can +pipeline memory accesses reasonably effectively. The auxiliary search tree code +was, by far, the biggest single performance improvement in the history of +bcache/bcachefs - around an order of magnitude. This is why we're able to hit 1 +million random lookups per second on a single core - that is an extremely +difficult number to hit for any comparison based data structure, simply based on +memory latency considerations and number of memory accesses needed per lookup. |