% TEMPLATE for Usenix papers, specifically to meet requirements of % USENIX '05 % originally a template for producing IEEE-format articles using LaTeX. % written by Matthew Ward, CS Department, Worcester Polytechnic Institute. % adapted by David Beazley for his excellent SWIG paper in Proceedings, % Tcl 96 % turned into a smartass generic template by De Clarke, with thanks to % both the above pioneers % use at your own risk. Complaints to /dev/null. % make it two column with no page numbering, default is 10 point % Munged by Fred Douglis 10/97 to separate % the .sty file from the LaTeX source template, so that people can % more easily include the .sty file into an existing document. Also % changed to more closely follow the style guidelines as represented % by the Word sample file. % Note that since 2010, USENIX does not require endnotes. If you want % foot of page notes, don't include the endnotes package in the % usepackage command, below. % This version uses the latex2e styles, not the very ancient 2.09 stuff. \documentclass[letterpaper,twocolumn,10pt]{article} % \usepackage{usenix2019,epsfig,endnotes} \usepackage{usenix2019,epsfig} \begin{document} %don't want date printed \date{} %make title bold and 14 pt font (Latex default is non-bold, 16 pt) \title{\Large \bf bcachefs : A next generation filesystem } %for single author (just remove % characters) \author{ {\rm Kent Overstreet}\\ \and {\rm Second Name}\\ Second Institution % copy the following lines to add more authors % \and % {\rm Name}\\ %Name Institution } % end author \maketitle % Use the following at camera-ready time to suppress page numbers. % Comment it out when you first submit the paper for review. \thispagestyle{empty} \subsection*{Abstract} bcachefs is a new b-tree based, CoW local POSIX filesystem for Linux. Here we describe the background, status, and an overview of the key design elements. \section{Introduction} \subsection{What is bcachefs?} bcachefs is a full featured POSIX filesystem, with a featureset aimed at competing with ZFS, btrfs and xfs. Existing features include: \begin{itemize} \item Copy on Write \item Full data checksumming \item Compression \item Encryption (AEAD style encryption using ChaCha20 and Poly1305) \item Multi-device \item Replication \item Online grow/shrink and add/remove of devices, and a completely flexible data layout \item Tiering/caching (both writethrough and writeback; IO stack is in fact policy driven now) \end{itemize} Upcoming features: \begin{itemize} \item Erasure coding (Reed-Solomon, with ability to add others) \item Snapshotting \end{itemize} \subsection{Status} bcachefs is in outside use by a small community of early adopters and testers, and likely soon in a production capacity as well. Upstreaming is hopefully imminent as there are no blockers left on the bcachefs side. \section{Bcachefs filesystem design overview} The biggest departure in bcachefs from more conventional filesystems is in the arrangement of metadata. Instead of having most metadata hang off of the inode, bcachefs has a single brtee for each distinct data type - i.e. inodes, extents, dirents, xatttrs, etc. The result is a filesystem that is constructed almost as if it was implemented on top of a relational database - bcachefs uses btrees like you would use tables in a relational database, and in the codebase there is a very clear distinction between the fileysystem layer and the key/value store layer. This is a major simplification compared to existing filesystem codebases. Our handling of on disk metadata is greatly regularized, and the filesystem code need not concern itself with atomicity or journalling, or even most locking. In bcachefs, only the btree update path interacts with the journalling code - the actual filesystem code is completely unaware of journalling. This is a major improvement over other existing filesystems. The key points of how this is achieved are worth summarizing: \begin{itemize} \item The btree uses an iterator interface. Iterators point to a position in some btree and support the usual peek() and next() operations, and it is the btree iterators that hold locks on btree nodes. \item Multiple iterators may be used simultaneously - we have a mechanism for linking iterators together so that they share locks, and updates through one iterator don't invalidate other iterators. \item Btree updates are done through iterators - and, since we have multiple iterators, the update interface takes a list of (iterator, new key) pairs and does all the updates atomically, in particular by putting them in the same journal entry. \end{itemize} Additionally, we have some wrapper code that allocates btree iterators out of a context, and adds an update to a list of pending updates to be done at commit time. With this, filesystem operations like create, rename etc. can be structured almost as if they were written in SQL, wrapped in a BEGIN TRANSACTION/COMMIT TRANSACTION. This gives us functionality normally only found in much more complicated databases - but boiled down to the bare essentials so that it can be implemented in a reasonable amount of code for an in-kernel filesystem. The btree code is still only ~13k lines. \section{Btree performance} This design does mean btree performance is a more critical variable than in other filesystems - instead of having one btree for each directory, we have one global btree with all the dirents for every directory. Fortunately, bcachefs's btree is one of the fastest ordered key value stores in existence: TODO: get benchmarks of other btrees (leveldb, others?) \begin{verbatim} seq_insert: 10.0M with 1 threads in 21 sec, 2016 nsec per iter, 484k per sec seq_lookup: 10.0M with 1 threads in 0 sec, 13 nsec per iter, 72.1M per sec seq_delete: 10.0M with 1 threads in 16 sec, 1589 nsec per iter, 614k per sec rand_insert: 10.0M with 1 threads in 92 sec, 8785 nsec per iter, 111k per sec seq_delete: 10.0M with 1 threads in 17 sec, 1660 nsec per iter, 588k per sec rand_insert: 10.0M with 4 threads in 87 sec, 33382 nsec per iter, 117k per sec rand_lookup: 10.0M with 1 threads in 10 sec, 996 nsec per iter, 979k per sec rand_lookup: 10.0M with 4 threads in 3 sec, 1233 nsec per iter, 3.0M per sec rand_mixed: 10.0M with 1 threads in 46 sec, 4388 nsec per iter, 222k per sec rand_mixed: 10.0M with 4 threads in 38 sec, 14806 nsec per iter, 263k per sec \end{verbatim} \section{Btree design} The core design elements of bcachefs's btree are fairly old at this point, as they originated in the earliest versions of bcache, but have not been described formally before. Bcachefs's btree is mostly a relatively standard copy on write B+ tree, with one major difference - btree nodes are large (typically 256k), and log structured. Log structured btree nodes do add a significant amount of complexity to the btree code. Nodes are too big for all the keys within a node to be kept as a single sorted array of keys, so we have to keep multiple (up to three) sorted arrays of keys in memory for a given btree node and the lookup and iterator code has to internally search all three and combine the results. More significantly, log structured btree nodes mean that we have some of the properties of a conventional update-in-place btree: with a copy on write btree, the only writes that have any direct effect on visible on disk metadata are writes that update the pointer to the root node. With log structured btree nodes, writes that append to btree nodes also make changes visible on disk and have to be correctly ordered with respect to other metadata writes. But, log structured btree nodes have turned out to be an enormous performance win: First, by having multiple sorted arrays of keys (referred to as bsets within the code) per btree node, most keys in a btree node will be in bsets that are no longer being modified, and this means we can construct lookup tables that would be much too expensive to use if we had to modify them. This is a major issue because binary search is a worst case scenario for performance on modern processes - prefetching is almost useless because at every iteration, you have two possible memory accesses depending on which way the previous comparison went, and they'll be at completely different locations in memory. However, if we construct a binary search tree in Eytzinger layout (https://arxiv.org/pdf/1509.05053.pdf), prefetching is usefully possible because a given node's two children are adjacent in memory - as well as all its grandchildren, great grandchildren, and so on. Then we use a couple tricks to make nodes in the auxiliary search tree as small as possible - four bytes - so we can fit as many of them as possible onto a cacheline: \begin{itemize} \item Each node in the auxiliary search tree corresponds to one cacheline worth of data in the original set of keys, which means we don't have to store a full pointer to the original corresponding key - just an offset within that cacheline, combined with some math. \item Keys are large integers - 160 bits - but for any given comparison, we don't actually need most of those bits. The keys in the range we're currently searching will all have the same high bits - we don't need to store those. And we don't need to store the low bits: if we compare the pivot with the key immediately prior to it, we only need to store bits up to and including the first bit where those two keys differ (i.e. the key in the auxiliary search tree has to compare greater than the pivot's previous key). We need the nodes in the auxiliary search tree to be fixed size, which means sometimes when constructing those nodes the pivot and the prior key won't differ in the bits created - to handle that case, we flag that node as failed and the lookup code falls back to comparing against the original key. But the lookup code parameters have been tuned so that this is a very rare occurrence (statistics on the auxiliary search trees are available in debugfs). \end{itemize} All these tricks mean we can fit 16 nodes in the auxiliary search tree onto a cacheline, and we can prefetch four levels in advance - which means we can pipeline memory accesses reasonably effectively. The auxiliary search tree code was, by far, the biggest single performance improvement in the history of bcache/bcachefs - around an order of magnitude. This is why we're able to hit 1 million random lookups per second on a single core - that is an extremely difficult number to hit for any comparison based data structure, simply based on memory latency considerations and number of memory accesses needed per lookup. OLD STUFF STARTS HERE: \section{What makes bcachefs interesting?} \begin{itemize} \item The design is a major departure from existing filesystems. In particular it is constructed more as a layer on top of a generic key-value store, with much more consistent handling of metadata and on disk data structures. \item This simplification is possible because we have a very sophisticated and scalable b-tree implementation, which has been under development for nearly 10 years now, inherited from bcache. \end{itemize} \section{Benchmarks} Btree microbenchmarks: \section{Design Novelties} \begin{itemize} \item Auxiliary search tree for searching within a btree node \end{itemize} \section{bcachefs design fetaures} \subsection{Filesystem as RDBMS} In bcachefs, all metadata is organized into a small number of "btree" - effectively tables: \begin{itemize} \item Inodes table \item Extents table \item Dirents table \item Xattrs table \item etc. \end{itemize} This is a major departure from most existing filesystems, where most data structures hang off of the inodes. This isn't an unreasonable way to do things - it's an effective way to shard most data structures and historically, those data structures weren't particularly scalable so you really wanted that sharding (i.e. block based filesystems that used the double/triple indirect block scheme). But bcachefs's history meant that we started out with a btree implementation scalable enough that we no longer needed that sharding - bcache uses a single btree for indexing every cached extent, regardless of the number of volumes being cached (and a cached volume in bcache is equivalent to a file in bcachefs). This means we have a single unified, iterator based API for manipulating essentially all filesystem data (everything not stored in the superblock), and that metadata is all in a single flat namespace. This is a huge win for the complexity of the filesystem code itself. Rename, for example, is just a couple of operations on keys in the dirents btree, done atomically in a single update operation - drastically simpler than most other filesystems. Even better, we don't need complex logging to make high level filesystem operations like create, link and rename atomic. Instead, they make heavy use of the btree's ability to use multiple iterators simultaneously, and then the final update is done by passing a list of (iterator, new key) pairs to the btree update path - for atomicity, all that is required is that the btree update path use the same journal reservation for all the updates to be done. This make journalling and in particular journal replay drastically simpler than on other contemporary filesystems. The filesystem code that implements create/rename etc. approaches what it would look like using a conventional database - the only complication is that it has to handle transaction/lock restarts. \subsection{Fully flexible layout} The fundamental unit of data in bcachefs is an extent. An extent is a list of one or more pointers (more in the case of replicated data, or data with a copy on the cache device) along with a checksum, if data checksumming is enabled. There is no structure above or below extents for the purposes of the IO path, and there aren't any inherent restrictions on where extents or pointers can point - data layout is very fundamentally policy driven, not baked in anywhere. This means the core IO path in bcachefs is a relatively small amount of code. Some of it is fairly tricky since All data operations (e.g. scrub, migrate, rereplicate) A paragraph of text goes here. Lots of text. Plenty of interesting text. \\ % More fascinating text. Features\endnote{Remember to use endnotes, not footnotes!} galore, plethora of promises.\\ \section{This is Another Section} Some embedded literal typset code might look like the following : {\tt \small \begin{verbatim} int wrap_fact(ClientData clientData, Tcl_Interp *interp, int argc, char *argv[]) { int result; int arg0; if (argc != 2) { interp->result = "wrong # args"; return TCL_ERROR; } arg0 = atoi(argv[1]); result = fact(arg0); sprintf(interp->result,"%d",result); return TCL_OK; } \end{verbatim} } Now we're going to cite somebody. Watch for the cite tag. Here it comes~\cite{Chaum1981,Diffie1976}. The tilde character (\~{}) in the source means a non-breaking space. This way, your reference will always be attached to the word that preceded it, instead of going to the next line. \section{This Section has SubSections} \subsection{First SubSection} Here's a typical figure reference. The figure is centered at the top of the column. It's scaled. It's explicitly placed. You'll have to tweak the numbers to get what you want.\\ % you can also use the wonderful epsfig package... \begin{figure}[t] \begin{center} \begin{picture}(300,150)(0,200) \put(-15,-30){\special{psfile = fig1.ps hscale = 50 vscale = 50}} \end{picture}\\ \end{center} \caption{Wonderful Flowchart} \end{figure} This text came after the figure, so we'll casually refer to Figure 1 as we go on our merry way. \subsection{New Subsection} It can get tricky typesetting Tcl and C code in LaTeX because they share a lot of mystical feelings about certain magic characters. You will have to do a lot of escaping to typeset curly braces and percent signs, for example, like this: ``The {\tt \%module} directive sets the name of the initialization function. This is optional, but is recommended if building a Tcl 7.5 module. Everything inside the {\tt \%\{, \%\}} block is copied directly into the output. allowing the inclusion of header files and additional C code." \\ Sometimes you want to really call attention to a piece of text. You can center it in the column like this: \begin{center} {\tt \_1008e614\_Vector\_p} \end{center} and people will really notice it.\\ \noindent The noindent at the start of this paragraph makes it clear that it's a continuation of the preceding text, not a new para in its own right. Now this is an ingenious way to get a forced space. {\tt Real~$*$} and {\tt double~$*$} are equivalent. Now here is another way to call attention to a line of code, but instead of centering it, we noindent and bold it.\\ \noindent {\bf \tt size\_t : fread ptr size nobj stream } \\ And here we have made an indented para like a definition tag (dt) in HTML. You don't need a surrounding list macro pair. \begin{itemize} \item[] {\tt fread} reads from {\tt stream} into the array {\tt ptr} at most {\tt nobj} objects of size {\tt size}. {\tt fread} returns the number of objects read. \end{itemize} This concludes the definitions tag. \subsection{How to Build Your Paper} You have to run {\tt latex} once to prepare your references for munging. Then run {\tt bibtex} to build your bibliography metadata. Then run {\tt latex} twice to ensure all references have been resolved. If your source file is called {\tt usenixTemplate.tex} and your {\tt bibtex} file is called {\tt usenixTemplate.bib}, here's what you do: {\tt \small \begin{verbatim} latex usenixTemplate bibtex usenixTemplate latex usenixTemplate latex usenixTemplate \end{verbatim} } \subsection{Last SubSection} Well, it's getting boring isn't it. This is the last subsection before we wrap it up. \section{Acknowledgments} A polite author always includes acknowledgments. Thank everyone, especially those who funded the work. \section{Availability} It's great when this section says that MyWonderfulApp is free software, available via anonymous FTP from \begin{center} {\tt ftp.site.dom/pub/myname/Wonderful}\\ \end{center} Also, it's even greater when you can write that information is also available on the Wonderful homepage at \begin{center} {\tt http://www.site.dom/\~{}myname/SWIG} \end{center} Now we get serious and fill in those references. Remember you will have to run latex twice on the document in order to resolve those cite tags you met earlier. This is where they get resolved. We've preserved some real ones in addition to the template-speak. After the bibliography you are DONE. {\footnotesize \bibliographystyle{acm} \bibliography{../common/bibliography}} % \theendnotes \end{document}