1 files changed, 72 insertions, 14 deletions
diff --git a/usenix2019.tex b/usenix2019.tex
index dd3e891..04ed8db 100644
--- a/usenix2019.tex
+++ b/usenix2019.tex
@@ -158,22 +158,80 @@ rand_mixed:  10.0M with 4 threads in    38 sec, 14806 nsec per iter,  263k per s
 
 \section{Btree design}
 
+The core design elements of bcachefs's btree are fairly old at this point, as
+they originated in the earliest versions of bcache, but have not been described
+formally before.
+
+Bcachefs's btree is mostly a relatively standard copy on write B+ tree, with one
+major difference - btree nodes are large (typically 256k), and log structured.
+
+Log structured btree nodes do add a significant amount of complexity to the
+btree code. Nodes are too big for all the keys within a node to be kept as a
+single sorted array of keys, so we have to keep multiple (up to three) sorted
+arrays of keys in memory for a given btree node and the lookup and iterator code
+has to internally search all three and combine the results. More significantly,
+log structured btree nodes mean that we have some of the properties of a
+conventional update-in-place btree: with a copy on write btree, the only writes
+that have any direct effect on visible on disk metadata are writes that update
+the pointer to the root node. With log structured btree nodes, writes that
+append to btree nodes also make changes visible on disk and have to be correctly
+ordered with respect to other metadata writes.
+
+But, log structured btree nodes have turned out to be an enormous performance
+win:
+
+First, by having multiple sorted arrays of keys (referred to as bsets within the
+code) per btree node, most keys in a btree node will be in bsets that are no
+longer being modified, and this means we can construct lookup tables that would
+be much too expensive to use if we had to modify them.
+
+This is a major issue because binary search is a worst case scenario for
+performance on modern processes - prefetching is almost useless because at every
+iteration, you have two possible memory accesses depending on which way the
+previous comparison went, and they'll be at completely different locations in
+memory.
+
+However, if we construct a binary search tree in Eytzinger layout
+(https://arxiv.org/pdf/1509.05053.pdf), prefetching is usefully possible because
+a given node's two children are adjacent in memory - as well as all its
+grandchildren, great grandchildren, and so on. Then we use a couple tricks to
+make nodes in the auxiliary search tree as small as possible - four bytes - so
+we can fit as many of them as possible onto a cacheline:
 
+\begin{itemize}
+	\item Each node in the auxiliary search tree corresponds to one
+		cacheline worth of data in the original set of keys, which means
+		we don't have to store a full pointer to the original
+		corresponding key - just an offset within that cacheline,
+		combined with some math.
+	\item Keys are large integers - 160 bits - but for any given comparison,
+		we don't actually need most of those bits. The keys in the range
+		we're currently searching will all have the same high bits - we
+		don't need to store those. And we don't need to store the low
+		bits: if we compare the pivot with the key immediately prior to
+		it, we only need to store bits up to and including the first bit
+		where those two keys differ (i.e. the key in the auxiliary
+		search tree has to compare greater than the pivot's previous
+		key).
+
+		We need the nodes in the auxiliary search tree to be fixed size,
+		which means sometimes when constructing those nodes the pivot
+		and the prior key won't differ in the bits created - to handle
+		that case, we flag that node as failed and the lookup code falls
+		back to comparing against the original key. But the lookup code
+		parameters have been tuned so that this is a very rare
+		occurrence (statistics on the auxiliary search trees are
+		available in debugfs).
+\end{itemize}
 
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+All these tricks mean we can fit 16 nodes in the auxiliary search tree onto a
+cacheline, and we can prefetch four levels in advance - which means we can
+pipeline memory accesses reasonably effectively. The auxiliary search tree code
+was, by far, the biggest single performance improvement in the history of
+bcache/bcachefs - around an order of magnitude. This is why we're able to hit 1
+million random lookups per second on a single core - that is an extremely
+difficult number to hit for any comparison based data structure, simply based on
+memory latency considerations and number of memory accesses needed per lookup.