summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKent Overstreet <kent.overstreet@gmail.com>2018-09-18 15:56:57 -0400
committerKent Overstreet <kent.overstreet@gmail.com>2018-09-18 15:56:57 -0400
commit5df33cbb1f2e80a7205a7778ba6b451b77a4150e (patch)
treea8b620f6b93e3120a9e9d3d220cdaf14302f258c
parent2babf772f07281bf5a3beb778ad924fdc061a125 (diff)
rev1
-rw-r--r--usenix2019.tex117
1 files changed, 115 insertions, 2 deletions
diff --git a/usenix2019.tex b/usenix2019.tex
index 34557a4..dd3e891 100644
--- a/usenix2019.tex
+++ b/usenix2019.tex
@@ -87,6 +87,99 @@ bcachefs is in outside use by a small community of early adopters and testers,
and likely soon in a production capacity as well. Upstreaming is hopefully
imminent as there are no blockers left on the bcachefs side.
+\section{Bcachefs filesystem design overview}
+
+The biggest departure in bcachefs from more conventional filesystems is in the
+arrangement of metadata. Instead of having most metadata hang off of the inode,
+bcachefs has a single brtee for each distinct data type - i.e. inodes, extents,
+dirents, xatttrs, etc.
+
+The result is a filesystem that is constructed almost as if it was implemented
+on top of a relational database - bcachefs uses btrees like you would use tables
+in a relational database, and in the codebase there is a very clear distinction
+between the fileysystem layer and the key/value store layer.
+
+This is a major simplification compared to existing filesystem codebases. Our
+handling of on disk metadata is greatly regularized, and the filesystem code
+need not concern itself with atomicity or journalling, or even most locking. In
+bcachefs, only the btree update path interacts with the journalling code - the
+actual filesystem code is completely unaware of journalling. This is a major
+improvement over other existing filesystems.
+
+The key points of how this is achieved are worth summarizing:
+
+\begin{itemize}
+ \item The btree uses an iterator interface. Iterators point to a
+ position in some btree and support the usual peek() and next()
+ operations, and it is the btree iterators that hold locks on
+ btree nodes.
+ \item Multiple iterators may be used simultaneously - we have a
+ mechanism for linking iterators together so that they share
+ locks, and updates through one iterator don't invalidate other
+ iterators.
+ \item Btree updates are done through iterators - and, since we have
+ multiple iterators, the update interface takes a list of
+ (iterator, new key) pairs and does all the updates atomically,
+ in particular by putting them in the same journal entry.
+\end{itemize}
+
+Additionally, we have some wrapper code that allocates btree iterators out of a
+context, and adds an update to a list of pending updates to be done at commit
+time. With this, filesystem operations like create, rename etc. can be
+structured almost as if they were written in SQL, wrapped in a BEGIN
+TRANSACTION/COMMIT TRANSACTION.
+
+This gives us functionality normally only found in much more complicated
+databases - but boiled down to the bare essentials so that it can be implemented
+in a reasonable amount of code for an in-kernel filesystem. The btree code is
+still only ~13k lines.
+
+\section{Btree performance}
+
+This design does mean btree performance is a more critical variable than in
+other filesystems - instead of having one btree for each directory, we have one
+global btree with all the dirents for every directory. Fortunately, bcachefs's
+btree is one of the fastest ordered key value stores in existence:
+
+TODO: get benchmarks of other btrees (leveldb, others?)
+
+\begin{verbatim}
+seq_insert: 10.0M with 1 threads in 21 sec, 2016 nsec per iter, 484k per sec
+seq_lookup: 10.0M with 1 threads in 0 sec, 13 nsec per iter, 72.1M per sec
+seq_delete: 10.0M with 1 threads in 16 sec, 1589 nsec per iter, 614k per sec
+rand_insert: 10.0M with 1 threads in 92 sec, 8785 nsec per iter, 111k per sec
+seq_delete: 10.0M with 1 threads in 17 sec, 1660 nsec per iter, 588k per sec
+rand_insert: 10.0M with 4 threads in 87 sec, 33382 nsec per iter, 117k per sec
+rand_lookup: 10.0M with 1 threads in 10 sec, 996 nsec per iter, 979k per sec
+rand_lookup: 10.0M with 4 threads in 3 sec, 1233 nsec per iter, 3.0M per sec
+rand_mixed: 10.0M with 1 threads in 46 sec, 4388 nsec per iter, 222k per sec
+rand_mixed: 10.0M with 4 threads in 38 sec, 14806 nsec per iter, 263k per sec
+\end{verbatim}
+
+\section{Btree design}
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+OLD STUFF STARTS HERE:
+
+
\section{What makes bcachefs interesting?}
\begin{itemize}
@@ -100,7 +193,12 @@ imminent as there are no blockers left on the bcachefs side.
bcache.
\end{itemize}
-\section{bcachefs design novelties}
+\section{Benchmarks}
+
+Btree microbenchmarks:
+
+
+\section{Design Novelties}
\begin{itemize}
\item Auxiliary search tree for searching within a btree node
@@ -154,9 +252,24 @@ update path - for atomicity, all that is required is that the btree update path
use the same journal reservation for all the updates to be done.
This make journalling and in particular journal replay drastically simpler than
-on other contemporary filesystems.
+on other contemporary filesystems. The filesystem code that implements
+create/rename etc. approaches what it would look like using a conventional
+database - the only complication is that it has to handle transaction/lock
+restarts.
+
+\subsection{Fully flexible layout}
+
+The fundamental unit of data in bcachefs is an extent. An extent is a list of
+one or more pointers (more in the case of replicated data, or data with a copy
+on the cache device) along with a checksum, if data checksumming is enabled.
+There is no structure above or below extents for the purposes of the IO path,
+and there aren't any inherent restrictions on where extents or pointers can
+point - data layout is very fundamentally policy driven, not baked in anywhere.
+This means the core IO path in bcachefs is a relatively small amount of code.
+Some of it is fairly tricky since
+All data operations (e.g. scrub, migrate, rereplicate)