Bcachefs documentation updates

author: Kent Overstreet <kent.overstreet@gmail.com> 2016-08-28 19:04:11 -0800
committer: Kent Overstreet <kent.overstreet@gmail.com> 2016-08-28 19:04:11 -0800
commit: 6dbe3d1f437dc9034f116d31b3fd476e83a273cc (patch)
tree: f308e5f3863061be3eac9aff0ea7d2410155b07e
parent: 8e0e14f5b8e7caba2522e3c8ad15bef6a282329a (diff)
3 files changed, 318 insertions, 36 deletions
diff --git a/Bcachefs.mdwn b/Bcachefs.mdwn
new file mode 100644
index 0000000..530fcf2
--- /dev/null
+++ b/Bcachefs.mdwn
@@ -0,0 +1,234 @@
+
+# Bcachefs
+
+It's a next generation copy on write filesystem for Linux with a long list of
+features - tiering/caching, data checksumming, compression, encryption, multiple
+devices, et cetera.
+
+It's not vaporware - it's a real filesystem you can run on your laptop or server
+today.
+
+We prioritize robustness and reliability over features and hype: we make every
+effort to ensure you won't lose data. It's building on top of a codebase with a
+pedigree - bcache already has a reasonably good track record for reliability
+(particularly considering how young upstream bcache is, in terms of engineer
+man/years). Starting from there, bcachefs development has prioritized
+incremental development, and keeping things stable, and aggressively fixing
+design issues as they are found; the bcachefs codebase is considerably more
+robust and mature than upstream bcache.
+
+Developing a filesystem is also not cheap or quick or easy; we need funding!
+Please chip in on [[Patreon|https://www.patreon.com/bcachefs]] - the Patreon
+page also has more information on the motivation for bcachefs and the state of
+Linux filesystems.
+
+## Why bcachefs?
+
+For existing bcache users, we've got a particularly compelling argument: a block
+layer cache implements the core functionality of a filesystem (allocation,
+reclamation, and mapping from one address space to another). By running a
+filesystem on top of a block cache every IO must traverse two different mapping
+layers - each taking up memory and space for their index, each adding to every
+IO's latency (and tail latency!), and adding more complexity to your IO path -
+in particular, making it that much harder to debug performance issues.
+
+By using a filesystem that was designed for caching from the start (among many
+other things) we're able to collapse the two mapping layers and eliminate a lot
+of redundant complexity - many things also become easier when they're done
+within the context of an (appropriately designed) filesystem, versus the block
+layer - cache coherency, for example, is a tricky problem in bcache but trivial
+in bcachefs. This has real world impact - some of the most pernicious bugs and
+performance issues bcache users have hit have been because of the writeback
+lock, which is needed for cache coherency - that code is all gone in bcachefs.
+
+What if you don't care about caching, what if you just want a filesystem that
+works? Bcachefs is not just targeted at caching - it's meant to be a superior
+replacement for ext4, xfs and btrfs. We intend to deliver a copy on write
+filesystem, with all the features you'd expect from a modern copy on write
+filesystem - but with the performance and robustness to be a very viable
+replacement for ext4 and xfs. We will deliver on that goal.
+
+## Documentation
+
+End user documentation is currently fairly minimal; this would be a very helpful
+area for anyone who wishes to contribute - I would like the bcache man page in
+the bcache-tools repository to be rewritten and expanded.
+
+There is some fairly substantial developer documentation: see [[BcacheGuide]].
+
+## Getting started
+
+Bcachefs is not upstream, and won't be for awhile. If you want to try out
+bcachefs now, you'll need to be comfortable with building your own kernel. Also,
+as bcachefs has had many incompatible on disk format changes, you cannot
+currently build a kernel with support for both bcachefs and the existing,
+upstream bcache on disk format (this will change prior to bcachefs going
+upstream).
+
+First, check out the bcache kernel and tools repositories:
+
+    git clone -b bcache-dev https://evilpiepirate.org/git/linux-bcache.git
+    git clone -b dev https://evilpiepirate.org/git/bcache-tools.git
+
+Build and install as usual. Then, to format and mount a single device with the
+default options, run:
+
+    bcache format /dev/sda1
+    mount /dev/sda1 /mnt
+
+See `bcache format --help` for more options.
+
+## Status
+
+Bcachefs can currently be considered beta quality. It has a small pool of
+outside users and has been quite stable and reliable so far; there's no reason
+to expect issues as long as you stick to the currently supported feature set.
+Being a new filesystem, backups are still recommended though.
+
+Performance is generally quite good - generally faster than btrfs, and not far
+behind xfs/ext4. There are still performance bugs to be found and optimizations
+we'd like to do, but performance isn't currently the primary focus - the main
+focus is on making sure it's production quality and finishing the core feature
+set.
+
+Normal posix filesystem functionality is all finished - if you're using bcachefs
+as a replacement for ext4 on a desktop, you shouldn't find anything missing. For
+servers, NFS export support is still missing (but coming soon) and we don't yet
+support quotas (probably further off).
+
+Pretty much all the normal posix filesystem stuff is supported (things like
+xattrs, acls, etc. - no quotas yet, though).
+
+The on disk format is not yet set in stone - there will be future breaking
+changes to the on disk format, but we will make every effort make transitioning
+easy for users (e.g. when there are breaking changes there will be kernel
+branches maintained in parallel that support old and new formats to give users
+time to transition, users won't be left stranded with data they can't access).
+We'll need at least one more breaking change for encryption and possibly
+snapshots, but I'm trying to batch up all the breaking changes as much as
+possible.
+
+### Feature status
+
+ - Full data checksumming
+
+   Fully supported and enabled by default. We do need to implement scrubbing,
+   once we've got replication and can take advantage of it.
+
+ - Compression
+
+   Not _quite_ finished - it's safe to enable, but there's some work left
+   related to copy GC before we can enable free space accounting based on
+   compressed size: right now, enabling compression won't actually let you store
+   any more data in your filesystem than if the data was uncompressed
+
+ - Tiering
+
+   Works (there are users using it), but recent testing and development has not
+   focused enough on multiple devices to call it supported. In particular, the
+   device add/remove functionality is known to be currently buggy.
+
+ - Multiple devices, replication
+
+   Roughly 80% or 90% implemented, but it's been on the back burner for quite
+   awhile in favor of making the core functionality production quality -
+   replication is not currently suitable for outside testing.
+
+ - Encryption
+
+   Implementation is finished, and passes all the tests. The blocker on rolling
+   it out is finishing the design doc and getting outside review (as feedback
+   any changes based on outside review will almost definitely require on disk
+   format changes), as well as finishing up some unrelated on disk format
+   changes (particularly for replication) that I'm batching up with the on disk
+   format changes for encryption.
+
+ - Snapshots
+
+   Snapshot implementation has been started, but snapshots are by far the most
+   complex of the remaining features to implement - it's going to be quite
+   awhile before I can dedicate enough time to finishing them, but I'm very much
+   looking forward to showing off what it'll be able to do.
+
+## FAQ
+
+Please ask questions and ask for them to be added here!
+
+## Todo list
+
+### Current priorities:
+
+ * Encryption is pretty much done - current focus is on finishing the design
+   doc, so we can get some review from experienced cryptographers.
+
+ * Compression is almost done: it's quite thoroughly tested, the only remaining
+   issue is a problem with copygc fragmenting existing compressed extents that
+   only breaks accounting.
+
+ * NFS export support is almost done: implementing i_generation correctly
+   required some new transaction machinery, but that's mostly done. What's left
+   is implementing a new kind of reservation of journal space for the new, long
+   running transactions.
+
+### Breaking changes:
+
+ * Need incompatible superblock changes - encryption key used up remaining
+   reserved space. we need:
+    * more flag bits
+    * a feature bits field
+    * bring some structure to the variable length portion, so we can add more
+      crap later - do it like inode optional fields
+
+ * More bits (once we have feature bits) for "has this feature ever been used", e.g.
+   * encryption - if we don't have encrypted data, we don't need to load cyphers
+   * compression - if gzip has never been used, we don't need gzip's crazy huge
+     compression workspace
+
+ * journal format tweaks:
+   * right now btree node roots are added to every journal entry - we really
+     only need to journal them when we change, and with the generic journal pin
+     infrastructure this'll be easy to implement. this is a slight on disk
+     format change - old kernels won't be able to read filesystems from newer
+     kernels, but it's not a breaking change
+
+   * prio bucket pointers - We also add to every journal entry a pointer to
+     each device's starting prio bucket. this one is more important to fix,
+     because with large numbers of devices we'll be wasting more and more of
+     each journal entry on these prio pointers that mostly aren't changing. We
+     just need to break out this journal entry into one entry per component
+     device (and do like with btree node roots, and change it to only journal
+     when it changes).
+
+ * inode format:
+   We're adding optional fields for i_generation (needed for NFS export
+   support), but if we're doing breaking on disk format changes it'd make more
+   sense for i_dev to be an optional field - (i_dev is for block/char devices).
+
+### Other wishlist items:
+
+ * When we're using compression, we end up wasting a fair amount of space on
+   internal fragmentation because compressed extents get rounded up to the
+   filesystem block size when they're written - usually 4k. It'd be really nice
+   if we could pack them in more efficiently - probably 512 byte sector
+   granularity.
+
+   On the read side this is no big deal to support - we have to bounce
+   compressed extents anyways. The write side is the annoying part. The options
+   are:
+    * Buffer up writes when we don't have full blocks to write? Highly
+      problematic, not going to do this.
+    * Read modify write? Not an option for raw flash, would prefer it to not be
+      our only option
+    * Do data journalling when we don't have a full block to write? Possible
+      solution, we want data journalling anyways
+
+ * Inline extents - good for space efficiency for both small files, and
+   compression when extents happen to compress particularly well.
+
+ * Full data journalling - we're definitely going to want this for when the
+   journal is on an NVRAM device (also need to implement external journalling
+   (easy), and direct journal on NVRAM support (what's involved here?)).
+
+   Would be good to get a simple implementation done and tested so we know what
+   the on disk format is going to be.
+
diff --git a/Encryption.mdwn b/Encryption.mdwn
new file mode 100644
index 0000000..17f0a62
--- /dev/null
+++ b/Encryption.mdwn
@@ -0,0 +1,66 @@
+# bcache/bcachefs encryption design:
+
+## Intro:
+
+Bcachefs provides whole-filesystem encryption, using ChaCha20/Poly1305.
+Encryption may be enabled when creating a filesystem, or encryption may be
+enabled on an existing filesystem (TODO: implement interface for enabling
+encryption on an existing filesystem - kernel code exists).
+
+Example:
+  $ bcache format --encrypted /dev/sda
+(Enter passphrase when prompted)
+
+ $ bcache unlock /dev/sda
+(Enter passphrase again)
+
+Then mount as normal:
+ $ mount /dev/sda /mnt
+
+## Goals:
+
+Bcachefs encryption is meant to be a clean slate design that prioritizes
+security and robustness, and is meant to defend against a wider variety of
+yadversarial models than is typical in existing filesystem level or block level
+encryption.
+
+## Filesystem vs. directory encryption
+
+We do not currently offer per directory encryption; instead, we take an "encrypt
+everything" approach.
+
+Rationale:
+
+With per directory encryption, it would be nigh impossible to prevent
+potentially sensitive metadata from leaking. For example, file sizes - file
+sizes are required for fsck, so they would have to be stored unencrypted - or
+failing that some complicated way of deferring fsck for that part of the
+filesystem until the key has been provided.
+
+With per directory encryption there would be additional complications around
+filenames, xattrs, extents (and inline extents), etc. - not necessarily
+insurmountable, but they would definitely lead to a more complicated, more
+fragile design.
+
+With whole filesystem encryption, it’s much easier to say what is and isn’t
+encrypted - essentially everything is encrypted.
+
+### Algorithms
+
+By virtue of working within a copy on write filesystem with provisions for ZFS
+style checksums (that is, checksums with the pointers, not the data), we’re
+able to use a modern AEAD style construction. We use ChaCha20 and Poly1305.
+
+Note that ChaCha20 is a stream cypher. This means: it’s critical that we use a
+cryptographic MAC (which would be highly desirable anyways).  Avoiding nonce
+reuse is critical. Getting nonces right is where most of the trickiness is
+involved in bcachefs’s encryption.
+
+### Master key
+
+Passphrase encrypted by userspace with scrypt, decrypts real encryption key
+stored in superblock.
+
+### Metadata
+
+### Data
diff --git a/index.mdwn b/index.mdwn
index 8fe8b94..e7e63fa 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -1,41 +1,23 @@
 [[!toc 3]] 
 
-# Bcachefs is up!
-
-[[https://evilpiepirate.org/git/linux-bcache.git/log/?h=bcache-dev]]
-
-It's still alpha quality, but the code is up and you can play with it now. Build
-a kernel from the bcache-dev branch, and a new bcache-tools from the dev branch
-in that repository.
-
-    git clone -b bcache-dev https://evilpiepirate.org/git/linux-bcache.git
-    git clone -b dev https://evilpiepirate.org/git/bcache-tools.git
-
-Then, just run
-
-    bcache format -C /dev/sda1
-    mount /dev/sda1 /mnt
-
-It's fast - many years of performance improvements since what's in the upstream
-kernel, and a long list of features - among them, multiple devices, replication,
-caching and full data checksumming/compression. Snapshotting is coming, too.
-
-Pretty much all the normal posix filesystem stuff is supported (things like
-xattrs, acls, etc. - no quotas yet, though).
-
-Stability: there are no known outstanding bugs if you stick to the single device
-filesystems (no tiering/replication). It's been passing xfstests for months, and
-it is seeing real world usage (I've been using it on my development laptop for
-some months now). It is highly unlikely to eat your data, but that is always a
-possibility with new filesystems - please keep backups.
-
-The on disk format is not yet set in stone, but any future breaking changes will
-come with plenty of warning. We'll need at least one more breaking change for
-encryption and possibly snapshots, but I'm hoping to delay rolling out breaking
-changes until I can do them at once.
-
-For other kernel programmers who might be interested in getting involved, I've
-started a guide to the bcache internals: [[BcacheGuide]]
+# Bcachefs
+
+Bcache is done and stable - but work hasn't stopped, [[Bcachefs]] is the hot new
+thing: a next generation, robust, high performance copy on write filesystem. You
+could think of it as bcache version two, but it might be more accurate to call
+bcache the prototype for what's happening in bcachefs - incrementally developing
+a filesystem was part of the bcache plan since nearly the beginning.
+
+It's proving to be quite stable, and it's gotten to the point where it's
+suitable for careful deployment and a wider user base. Please see the
+[[bcachefs]] page for the current status and instructions on getting started.
+
+Developing a filesystem is also not cheap or quick or easy; we need funding!
+Please chip in on [[Patreon|https://www.patreon.com/bcachefs]] - the Patreon
+page also has more information on the motivation for bcachefs and the state of
+Linux filesystems. If you've been a happy bcache user, your contribution will be
+particularly appreciated - I didn't ask for contributions when I was working on
+bcache, but I am now.
 
 # What is bcache?
author	Kent Overstreet <kent.overstreet@gmail.com>	2016-08-28 19:04:11 -0800
committer	Kent Overstreet <kent.overstreet@gmail.com>	2016-08-28 19:04:11 -0800
commit	6dbe3d1f437dc9034f116d31b3fd476e83a273cc (patch)
tree	f308e5f3863061be3eac9aff0ea7d2410155b07e
parent	8e0e14f5b8e7caba2522e3c8ad15bef6a282329a (diff)