usenix2019.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495

% TEMPLATE for Usenix papers, specifically to meet requirements of
%  USENIX '05
% originally a template for producing IEEE-format articles using LaTeX.
%   written by Matthew Ward, CS Department, Worcester Polytechnic Institute.
% adapted by David Beazley for his excellent SWIG paper in Proceedings,
%   Tcl 96
% turned into a smartass generic template by De Clarke, with thanks to
%   both the above pioneers
% use at your own risk.  Complaints to /dev/null.
% make it two column with no page numbering, default is 10 point

% Munged by Fred Douglis <douglis@research.att.com> 10/97 to separate
% the .sty file from the LaTeX source template, so that people can
% more easily include the .sty file into an existing document.  Also
% changed to more closely follow the style guidelines as represented
% by the Word sample file. 

% Note that since 2010, USENIX does not require endnotes. If you want
% foot of page notes, don't include the endnotes package in the 
% usepackage command, below.

% This version uses the latex2e styles, not the very ancient 2.09 stuff.
\documentclass[letterpaper,twocolumn,10pt]{article}
% \usepackage{usenix2019,epsfig,endnotes}
\usepackage{usenix2019,epsfig}

\begin{document}

%don't want date printed
\date{}

%make title bold and 14 pt font (Latex default is non-bold, 16 pt)
\title{\Large \bf bcachefs : A next generation filesystem }

%for single author (just remove % characters)
\author{
{\rm Kent Overstreet}\\
\and
{\rm Second Name}\\
Second Institution
% copy the following lines to add more authors
% \and
% {\rm Name}\\
%Name Institution
} % end author

\maketitle

% Use the following at camera-ready time to suppress page numbers.
% Comment it out when you first submit the paper for review.
\thispagestyle{empty}


\subsection*{Abstract}

bcachefs is a new b-tree based, CoW local POSIX filesystem for Linux. Here we
describe the background, status, and an overview of the key design elements.

\section{Introduction}

\subsection{What is bcachefs?}

bcachefs is a full featured POSIX filesystem, with a featureset aimed at
competing with ZFS, btrfs and xfs. Existing features include:

\begin{itemize}
	\item Copy on Write
	\item Full data checksumming
	\item Compression
	\item Encryption (AEAD style encryption using ChaCha20 and Poly1305)
	\item Multi-device
	\item Replication
	\item Online grow/shrink and add/remove of devices, and a completely
		flexible data layout
	\item Tiering/caching (both writethrough and writeback; IO stack is in
		fact policy driven now)
\end{itemize}
Upcoming features:
\begin{itemize}
	\item Erasure coding (Reed-Solomon, with ability to add others)
	\item Snapshotting
\end{itemize}

\subsection{Status}

bcachefs is in outside use by a small community of early adopters and testers,
and likely soon in a production capacity as well. Upstreaming is hopefully
imminent as there are no blockers left on the bcachefs side.

\section{Bcachefs filesystem design overview}

The biggest departure in bcachefs from more conventional filesystems is in the
arrangement of metadata. Instead of having most metadata hang off of the inode,
bcachefs has a single brtee for each distinct data type - i.e. inodes, extents,
dirents, xatttrs, etc.

The result is a filesystem that is constructed almost as if it was implemented
on top of a relational database - bcachefs uses btrees like you would use tables
in a relational database, and in the codebase there is a very clear distinction
between the fileysystem layer and the key/value store layer.

This is a major simplification compared to existing filesystem codebases. Our
handling of on disk metadata is greatly regularized, and the filesystem code
need not concern itself with atomicity or journalling, or even most locking. In
bcachefs, only the btree update path interacts with the journalling code - the
actual filesystem code is completely unaware of journalling. This is a major
improvement over other existing filesystems.

The key points of how this is achieved are worth summarizing:

\begin{itemize}
	\item The btree uses an iterator interface. Iterators point to a
		position in some btree and support the usual peek() and next()
		operations, and it is the btree iterators that hold locks on
		btree nodes.
	\item Multiple iterators may be used simultaneously - we have a
		mechanism for linking iterators together so that they share
		locks, and updates through one iterator don't invalidate other
		iterators.
	\item Btree updates are done through iterators - and, since we have
		multiple iterators, the update interface takes a list of
		(iterator, new key) pairs and does all the updates atomically,
		in particular by putting them in the same journal entry.
\end{itemize}

Additionally, we have some wrapper code that allocates btree iterators out of a
context, and adds an update to a list of pending updates to be done at commit
time. With this, filesystem operations like create, rename etc. can be
structured almost as if they were written in SQL, wrapped in a BEGIN
TRANSACTION/COMMIT TRANSACTION.

This gives us functionality normally only found in much more complicated
databases - but boiled down to the bare essentials so that it can be implemented
in a reasonable amount of code for an in-kernel filesystem. The btree code is
still only ~13k lines.

\section{Btree performance}

This design does mean btree performance is a more critical variable than in
other filesystems - instead of having one btree for each directory, we have one
global btree with all the dirents for every directory. Fortunately, bcachefs's
btree is one of the fastest ordered key value stores in existence:

TODO: get benchmarks of other btrees (leveldb, others?)

\begin{verbatim}
seq_insert:  10.0M with 1 threads in    21 sec,  2016 nsec per iter,  484k per sec
seq_lookup:  10.0M with 1 threads in     0 sec,    13 nsec per iter, 72.1M per sec
seq_delete:  10.0M with 1 threads in    16 sec,  1589 nsec per iter,  614k per sec
rand_insert: 10.0M with 1 threads in    92 sec,  8785 nsec per iter,  111k per sec
seq_delete:  10.0M with 1 threads in    17 sec,  1660 nsec per iter,  588k per sec
rand_insert: 10.0M with 4 threads in    87 sec, 33382 nsec per iter,  117k per sec
rand_lookup: 10.0M with 1 threads in    10 sec,   996 nsec per iter,  979k per sec
rand_lookup: 10.0M with 4 threads in     3 sec,  1233 nsec per iter,  3.0M per sec
rand_mixed:  10.0M with 1 threads in    46 sec,  4388 nsec per iter,  222k per sec
rand_mixed:  10.0M with 4 threads in    38 sec, 14806 nsec per iter,  263k per sec
\end{verbatim}

\section{Btree design}

The core design elements of bcachefs's btree are fairly old at this point, as
they originated in the earliest versions of bcache, but have not been described
formally before.

Bcachefs's btree is mostly a relatively standard copy on write B+ tree, with one
major difference - btree nodes are large (typically 256k), and log structured.

Log structured btree nodes do add a significant amount of complexity to the
btree code. Nodes are too big for all the keys within a node to be kept as a
single sorted array of keys, so we have to keep multiple (up to three) sorted
arrays of keys in memory for a given btree node and the lookup and iterator code
has to internally search all three and combine the results. More significantly,
log structured btree nodes mean that we have some of the properties of a
conventional update-in-place btree: with a copy on write btree, the only writes
that have any direct effect on visible on disk metadata are writes that update
the pointer to the root node. With log structured btree nodes, writes that
append to btree nodes also make changes visible on disk and have to be correctly
ordered with respect to other metadata writes.

But, log structured btree nodes have turned out to be an enormous performance
win:

First, by having multiple sorted arrays of keys (referred to as bsets within the
code) per btree node, most keys in a btree node will be in bsets that are no
longer being modified, and this means we can construct lookup tables that would
be much too expensive to use if we had to modify them.

This is a major issue because binary search is a worst case scenario for
performance on modern processes - prefetching is almost useless because at every
iteration, you have two possible memory accesses depending on which way the
previous comparison went, and they'll be at completely different locations in
memory.

However, if we construct a binary search tree in Eytzinger layout
(https://arxiv.org/pdf/1509.05053.pdf), prefetching is usefully possible because
a given node's two children are adjacent in memory - as well as all its
grandchildren, great grandchildren, and so on. Then we use a couple tricks to
make nodes in the auxiliary search tree as small as possible - four bytes - so
we can fit as many of them as possible onto a cacheline:

\begin{itemize}
	\item Each node in the auxiliary search tree corresponds to one
		cacheline worth of data in the original set of keys, which means
		we don't have to store a full pointer to the original
		corresponding key - just an offset within that cacheline,
		combined with some math.
	\item Keys are large integers - 160 bits - but for any given comparison,
		we don't actually need most of those bits. The keys in the range
		we're currently searching will all have the same high bits - we
		don't need to store those. And we don't need to store the low
		bits: if we compare the pivot with the key immediately prior to
		it, we only need to store bits up to and including the first bit
		where those two keys differ (i.e. the key in the auxiliary
		search tree has to compare greater than the pivot's previous
		key).

		We need the nodes in the auxiliary search tree to be fixed size,
		which means sometimes when constructing those nodes the pivot
		and the prior key won't differ in the bits created - to handle
		that case, we flag that node as failed and the lookup code falls
		back to comparing against the original key. But the lookup code
		parameters have been tuned so that this is a very rare
		occurrence (statistics on the auxiliary search trees are
		available in debugfs).
\end{itemize}

All these tricks mean we can fit 16 nodes in the auxiliary search tree onto a
cacheline, and we can prefetch four levels in advance - which means we can
pipeline memory accesses reasonably effectively. The auxiliary search tree code
was, by far, the biggest single performance improvement in the history of
bcache/bcachefs - around an order of magnitude. This is why we're able to hit 1
million random lookups per second on a single core - that is an extremely
difficult number to hit for any comparison based data structure, simply based on
memory latency considerations and number of memory accesses needed per lookup.


OLD STUFF STARTS HERE:


\section{What makes bcachefs interesting?}

\begin{itemize}
	\item The design is a major departure from existing filesystems. In
		particular it is constructed more as a layer on top of a generic
		key-value store, with much more consistent handling of metadata
		and on disk data structures.
	\item This simplification is possible because we have a very
		sophisticated and scalable b-tree implementation, which has been
		under development for nearly 10 years now, inherited from
		bcache.
\end{itemize}

\section{Benchmarks}

Btree microbenchmarks:


\section{Design Novelties}

\begin{itemize}
	\item Auxiliary search tree for searching within a btree node

\end{itemize}


\section{bcachefs design fetaures}

\subsection{Filesystem as RDBMS}

In bcachefs, all metadata is organized into a small number of "btree" -
effectively tables:

\begin{itemize}
	\item Inodes table
	\item Extents table
	\item Dirents table
	\item Xattrs table
	\item etc.
\end{itemize}

This is a major departure from most existing filesystems, where most data
structures hang off of the inodes. This isn't an unreasonable way to do things -
it's an effective way to shard most data structures and historically, those data
structures weren't particularly scalable so you really wanted that sharding
(i.e. block based filesystems that used the double/triple indirect block
scheme).

But bcachefs's history meant that we started out with a btree implementation
scalable enough that we no longer needed that sharding - bcache uses a single
btree for indexing every cached extent, regardless of the number of volumes
being cached (and a cached volume in bcache is equivalent to a file in
bcachefs).

This means we have a single unified, iterator based API for manipulating
essentially all filesystem data (everything not stored in the superblock), and
that metadata is all in a single flat namespace.

This is a huge win for the complexity of the filesystem code itself. Rename, for
example, is just a couple of operations on keys in the dirents btree, done
atomically in a single update operation - drastically simpler than most other
filesystems.

Even better, we don't need complex logging to make high level filesystem
operations like create, link and rename atomic. Instead, they make heavy use of
the btree's ability to use multiple iterators simultaneously, and then the final
update is done by passing a list of (iterator, new key) pairs to the btree
update path - for atomicity, all that is required is that the btree update path
use the same journal reservation for all the updates to be done.

This make journalling and in particular journal replay drastically simpler than
on other contemporary filesystems. The filesystem code that implements
create/rename etc. approaches what it would look like using a conventional
database - the only complication is that it has to handle transaction/lock
restarts.

\subsection{Fully flexible layout}

The fundamental unit of data in bcachefs is an extent. An extent is a list of
one or more pointers (more in the case of replicated data, or data with a copy
on the cache device) along with a checksum, if data checksumming is enabled.
There is no structure above or below extents for the purposes of the IO path,
and there aren't any inherent restrictions on where extents or pointers can
point - data layout is very fundamentally policy driven, not baked in anywhere.

This means the core IO path in bcachefs is a relatively small amount of code.
Some of it is fairly tricky since 

All data operations (e.g. scrub, migrate, rereplicate)


A paragraph of text goes here.  Lots of text.  Plenty of interesting
text. \\

% More fascinating text. Features\endnote{Remember to use endnotes, not footnotes!} galore, plethora of promises.\\

\section{This is Another Section}

Some embedded literal typset code might 
look like the following :

{\tt \small
\begin{verbatim}
int wrap_fact(ClientData clientData,
              Tcl_Interp *interp,
              int argc, char *argv[]) {
    int result;
    int arg0;
    if (argc != 2) {
        interp->result = "wrong # args";
        return TCL_ERROR;
    }
    arg0 = atoi(argv[1]);
    result = fact(arg0);
    sprintf(interp->result,"%d",result);
    return TCL_OK;
}
\end{verbatim}
}

Now we're going to cite somebody.  Watch for the cite tag.
Here it comes~\cite{Chaum1981,Diffie1976}.  The tilde character (\~{})
in the source means a non-breaking space.  This way, your reference will
always be attached to the word that preceded it, instead of going to the
next line.

\section{This Section has SubSections}
\subsection{First SubSection}

Here's a typical figure reference.  The figure is centered at the
top of the column.  It's scaled.  It's explicitly placed.  You'll
have to tweak the numbers to get what you want.\\

% you can also use the wonderful epsfig package...
\begin{figure}[t]
\begin{center}
\begin{picture}(300,150)(0,200)
\put(-15,-30){\special{psfile = fig1.ps hscale = 50 vscale = 50}}
\end{picture}\\
\end{center}
\caption{Wonderful Flowchart}
\end{figure}

This text came after the figure, so we'll casually refer to Figure 1
as we go on our merry way.

\subsection{New Subsection}

It can get tricky typesetting Tcl and C code in LaTeX because they share
a lot of mystical feelings about certain magic characters.  You
will have to do a lot of escaping to typeset curly braces and percent
signs, for example, like this:
``The {\tt \%module} directive
sets the name of the initialization function.  This is optional, but is
recommended if building a Tcl 7.5 module.
Everything inside the {\tt \%\{, \%\}}
block is copied directly into the output. allowing the inclusion of
header files and additional C code." \\

Sometimes you want to really call attention to a piece of text.  You
can center it in the column like this:
\begin{center}
{\tt \_1008e614\_Vector\_p}
\end{center}
and people will really notice it.\\

\noindent
The noindent at the start of this paragraph makes it clear that it's
a continuation of the preceding text, not a new para in its own right.


Now this is an ingenious way to get a forced space.
{\tt Real~$*$} and {\tt double~$*$} are equivalent. 

Now here is another way to call attention to a line of code, but instead
of centering it, we noindent and bold it.\\

\noindent
{\bf \tt size\_t : fread ptr size nobj stream } \\

And here we have made an indented para like a definition tag (dt)
in HTML.  You don't need a surrounding list macro pair.
\begin{itemize}
\item[]  {\tt fread} reads from {\tt stream} into the array {\tt ptr} at
most {\tt nobj} objects of size {\tt size}.   {\tt fread} returns
the number of objects read. 
\end{itemize}
This concludes the definitions tag.

\subsection{How to Build Your Paper}

You have to run {\tt latex} once to prepare your references for
munging.  Then run {\tt bibtex} to build your bibliography metadata.
Then run {\tt latex} twice to ensure all references have been resolved.
If your source file is called {\tt usenixTemplate.tex} and your {\tt
  bibtex} file is called {\tt usenixTemplate.bib}, here's what you do:
{\tt \small
\begin{verbatim}
latex usenixTemplate
bibtex usenixTemplate
latex usenixTemplate
latex usenixTemplate
\end{verbatim}
}


\subsection{Last SubSection}

Well, it's getting boring isn't it.  This is the last subsection
before we wrap it up.

\section{Acknowledgments}

A polite author always includes acknowledgments.  Thank everyone,
especially those who funded the work. 

\section{Availability}

It's great when this section says that MyWonderfulApp is free software, 
available via anonymous FTP from

\begin{center}
{\tt ftp.site.dom/pub/myname/Wonderful}\\
\end{center}

Also, it's even greater when you can write that information is also 
available on the Wonderful homepage at 

\begin{center}
{\tt http://www.site.dom/\~{}myname/SWIG}
\end{center}

Now we get serious and fill in those references.  Remember you will
have to run latex twice on the document in order to resolve those
cite tags you met earlier.  This is where they get resolved.
We've preserved some real ones in addition to the template-speak.
After the bibliography you are DONE.

{\footnotesize \bibliographystyle{acm}
\bibliography{../common/bibliography}}


% \theendnotes

\end{document}