diff options
author | Kent Overstreet <kent.overstreet@gmail.com> | 2016-08-28 19:04:11 -0800 |
---|---|---|
committer | Kent Overstreet <kent.overstreet@gmail.com> | 2016-08-28 19:04:11 -0800 |
commit | 6dbe3d1f437dc9034f116d31b3fd476e83a273cc (patch) | |
tree | f308e5f3863061be3eac9aff0ea7d2410155b07e | |
parent | 8e0e14f5b8e7caba2522e3c8ad15bef6a282329a (diff) |
Bcachefs documentation updates
-rw-r--r-- | Bcachefs.mdwn | 234 | ||||
-rw-r--r-- | Encryption.mdwn | 66 | ||||
-rw-r--r-- | index.mdwn | 54 |
3 files changed, 318 insertions, 36 deletions
diff --git a/Bcachefs.mdwn b/Bcachefs.mdwn new file mode 100644 index 0000000..530fcf2 --- /dev/null +++ b/Bcachefs.mdwn @@ -0,0 +1,234 @@ + +# Bcachefs + +It's a next generation copy on write filesystem for Linux with a long list of +features - tiering/caching, data checksumming, compression, encryption, multiple +devices, et cetera. + +It's not vaporware - it's a real filesystem you can run on your laptop or server +today. + +We prioritize robustness and reliability over features and hype: we make every +effort to ensure you won't lose data. It's building on top of a codebase with a +pedigree - bcache already has a reasonably good track record for reliability +(particularly considering how young upstream bcache is, in terms of engineer +man/years). Starting from there, bcachefs development has prioritized +incremental development, and keeping things stable, and aggressively fixing +design issues as they are found; the bcachefs codebase is considerably more +robust and mature than upstream bcache. + +Developing a filesystem is also not cheap or quick or easy; we need funding! +Please chip in on [[Patreon|https://www.patreon.com/bcachefs]] - the Patreon +page also has more information on the motivation for bcachefs and the state of +Linux filesystems. + +## Why bcachefs? + +For existing bcache users, we've got a particularly compelling argument: a block +layer cache implements the core functionality of a filesystem (allocation, +reclamation, and mapping from one address space to another). By running a +filesystem on top of a block cache every IO must traverse two different mapping +layers - each taking up memory and space for their index, each adding to every +IO's latency (and tail latency!), and adding more complexity to your IO path - +in particular, making it that much harder to debug performance issues. + +By using a filesystem that was designed for caching from the start (among many +other things) we're able to collapse the two mapping layers and eliminate a lot +of redundant complexity - many things also become easier when they're done +within the context of an (appropriately designed) filesystem, versus the block +layer - cache coherency, for example, is a tricky problem in bcache but trivial +in bcachefs. This has real world impact - some of the most pernicious bugs and +performance issues bcache users have hit have been because of the writeback +lock, which is needed for cache coherency - that code is all gone in bcachefs. + +What if you don't care about caching, what if you just want a filesystem that +works? Bcachefs is not just targeted at caching - it's meant to be a superior +replacement for ext4, xfs and btrfs. We intend to deliver a copy on write +filesystem, with all the features you'd expect from a modern copy on write +filesystem - but with the performance and robustness to be a very viable +replacement for ext4 and xfs. We will deliver on that goal. + +## Documentation + +End user documentation is currently fairly minimal; this would be a very helpful +area for anyone who wishes to contribute - I would like the bcache man page in +the bcache-tools repository to be rewritten and expanded. + +There is some fairly substantial developer documentation: see [[BcacheGuide]]. + +## Getting started + +Bcachefs is not upstream, and won't be for awhile. If you want to try out +bcachefs now, you'll need to be comfortable with building your own kernel. Also, +as bcachefs has had many incompatible on disk format changes, you cannot +currently build a kernel with support for both bcachefs and the existing, +upstream bcache on disk format (this will change prior to bcachefs going +upstream). + +First, check out the bcache kernel and tools repositories: + + git clone -b bcache-dev https://evilpiepirate.org/git/linux-bcache.git + git clone -b dev https://evilpiepirate.org/git/bcache-tools.git + +Build and install as usual. Then, to format and mount a single device with the +default options, run: + + bcache format /dev/sda1 + mount /dev/sda1 /mnt + +See `bcache format --help` for more options. + +## Status + +Bcachefs can currently be considered beta quality. It has a small pool of +outside users and has been quite stable and reliable so far; there's no reason +to expect issues as long as you stick to the currently supported feature set. +Being a new filesystem, backups are still recommended though. + +Performance is generally quite good - generally faster than btrfs, and not far +behind xfs/ext4. There are still performance bugs to be found and optimizations +we'd like to do, but performance isn't currently the primary focus - the main +focus is on making sure it's production quality and finishing the core feature +set. + +Normal posix filesystem functionality is all finished - if you're using bcachefs +as a replacement for ext4 on a desktop, you shouldn't find anything missing. For +servers, NFS export support is still missing (but coming soon) and we don't yet +support quotas (probably further off). + +Pretty much all the normal posix filesystem stuff is supported (things like +xattrs, acls, etc. - no quotas yet, though). + +The on disk format is not yet set in stone - there will be future breaking +changes to the on disk format, but we will make every effort make transitioning +easy for users (e.g. when there are breaking changes there will be kernel +branches maintained in parallel that support old and new formats to give users +time to transition, users won't be left stranded with data they can't access). +We'll need at least one more breaking change for encryption and possibly +snapshots, but I'm trying to batch up all the breaking changes as much as +possible. + +### Feature status + + - Full data checksumming + + Fully supported and enabled by default. We do need to implement scrubbing, + once we've got replication and can take advantage of it. + + - Compression + + Not _quite_ finished - it's safe to enable, but there's some work left + related to copy GC before we can enable free space accounting based on + compressed size: right now, enabling compression won't actually let you store + any more data in your filesystem than if the data was uncompressed + + - Tiering + + Works (there are users using it), but recent testing and development has not + focused enough on multiple devices to call it supported. In particular, the + device add/remove functionality is known to be currently buggy. + + - Multiple devices, replication + + Roughly 80% or 90% implemented, but it's been on the back burner for quite + awhile in favor of making the core functionality production quality - + replication is not currently suitable for outside testing. + + - Encryption + + Implementation is finished, and passes all the tests. The blocker on rolling + it out is finishing the design doc and getting outside review (as feedback + any changes based on outside review will almost definitely require on disk + format changes), as well as finishing up some unrelated on disk format + changes (particularly for replication) that I'm batching up with the on disk + format changes for encryption. + + - Snapshots + + Snapshot implementation has been started, but snapshots are by far the most + complex of the remaining features to implement - it's going to be quite + awhile before I can dedicate enough time to finishing them, but I'm very much + looking forward to showing off what it'll be able to do. + +## FAQ + +Please ask questions and ask for them to be added here! + +## Todo list + +### Current priorities: + + * Encryption is pretty much done - current focus is on finishing the design + doc, so we can get some review from experienced cryptographers. + + * Compression is almost done: it's quite thoroughly tested, the only remaining + issue is a problem with copygc fragmenting existing compressed extents that + only breaks accounting. + + * NFS export support is almost done: implementing i_generation correctly + required some new transaction machinery, but that's mostly done. What's left + is implementing a new kind of reservation of journal space for the new, long + running transactions. + +### Breaking changes: + + * Need incompatible superblock changes - encryption key used up remaining + reserved space. we need: + * more flag bits + * a feature bits field + * bring some structure to the variable length portion, so we can add more + crap later - do it like inode optional fields + + * More bits (once we have feature bits) for "has this feature ever been used", e.g. + * encryption - if we don't have encrypted data, we don't need to load cyphers + * compression - if gzip has never been used, we don't need gzip's crazy huge + compression workspace + + * journal format tweaks: + * right now btree node roots are added to every journal entry - we really + only need to journal them when we change, and with the generic journal pin + infrastructure this'll be easy to implement. this is a slight on disk + format change - old kernels won't be able to read filesystems from newer + kernels, but it's not a breaking change + + * prio bucket pointers - We also add to every journal entry a pointer to + each device's starting prio bucket. this one is more important to fix, + because with large numbers of devices we'll be wasting more and more of + each journal entry on these prio pointers that mostly aren't changing. We + just need to break out this journal entry into one entry per component + device (and do like with btree node roots, and change it to only journal + when it changes). + + * inode format: + We're adding optional fields for i_generation (needed for NFS export + support), but if we're doing breaking on disk format changes it'd make more + sense for i_dev to be an optional field - (i_dev is for block/char devices). + +### Other wishlist items: + + * When we're using compression, we end up wasting a fair amount of space on + internal fragmentation because compressed extents get rounded up to the + filesystem block size when they're written - usually 4k. It'd be really nice + if we could pack them in more efficiently - probably 512 byte sector + granularity. + + On the read side this is no big deal to support - we have to bounce + compressed extents anyways. The write side is the annoying part. The options + are: + * Buffer up writes when we don't have full blocks to write? Highly + problematic, not going to do this. + * Read modify write? Not an option for raw flash, would prefer it to not be + our only option + * Do data journalling when we don't have a full block to write? Possible + solution, we want data journalling anyways + + * Inline extents - good for space efficiency for both small files, and + compression when extents happen to compress particularly well. + + * Full data journalling - we're definitely going to want this for when the + journal is on an NVRAM device (also need to implement external journalling + (easy), and direct journal on NVRAM support (what's involved here?)). + + Would be good to get a simple implementation done and tested so we know what + the on disk format is going to be. + diff --git a/Encryption.mdwn b/Encryption.mdwn new file mode 100644 index 0000000..17f0a62 --- /dev/null +++ b/Encryption.mdwn @@ -0,0 +1,66 @@ +# bcache/bcachefs encryption design: + +## Intro: + +Bcachefs provides whole-filesystem encryption, using ChaCha20/Poly1305. +Encryption may be enabled when creating a filesystem, or encryption may be +enabled on an existing filesystem (TODO: implement interface for enabling +encryption on an existing filesystem - kernel code exists). + +Example: + $ bcache format --encrypted /dev/sda +(Enter passphrase when prompted) + + $ bcache unlock /dev/sda +(Enter passphrase again) + +Then mount as normal: + $ mount /dev/sda /mnt + +## Goals: + +Bcachefs encryption is meant to be a clean slate design that prioritizes +security and robustness, and is meant to defend against a wider variety of +yadversarial models than is typical in existing filesystem level or block level +encryption. + +## Filesystem vs. directory encryption + +We do not currently offer per directory encryption; instead, we take an "encrypt +everything" approach. + +Rationale: + +With per directory encryption, it would be nigh impossible to prevent +potentially sensitive metadata from leaking. For example, file sizes - file +sizes are required for fsck, so they would have to be stored unencrypted - or +failing that some complicated way of deferring fsck for that part of the +filesystem until the key has been provided. + +With per directory encryption there would be additional complications around +filenames, xattrs, extents (and inline extents), etc. - not necessarily +insurmountable, but they would definitely lead to a more complicated, more +fragile design. + +With whole filesystem encryption, it’s much easier to say what is and isn’t +encrypted - essentially everything is encrypted. + +### Algorithms + +By virtue of working within a copy on write filesystem with provisions for ZFS +style checksums (that is, checksums with the pointers, not the data), we’re +able to use a modern AEAD style construction. We use ChaCha20 and Poly1305. + +Note that ChaCha20 is a stream cypher. This means: it’s critical that we use a +cryptographic MAC (which would be highly desirable anyways). Avoiding nonce +reuse is critical. Getting nonces right is where most of the trickiness is +involved in bcachefs’s encryption. + +### Master key + +Passphrase encrypted by userspace with scrypt, decrypts real encryption key +stored in superblock. + +### Metadata + +### Data @@ -1,41 +1,23 @@ [[!toc 3]] -# Bcachefs is up! - -[[https://evilpiepirate.org/git/linux-bcache.git/log/?h=bcache-dev]] - -It's still alpha quality, but the code is up and you can play with it now. Build -a kernel from the bcache-dev branch, and a new bcache-tools from the dev branch -in that repository. - - git clone -b bcache-dev https://evilpiepirate.org/git/linux-bcache.git - git clone -b dev https://evilpiepirate.org/git/bcache-tools.git - -Then, just run - - bcache format -C /dev/sda1 - mount /dev/sda1 /mnt - -It's fast - many years of performance improvements since what's in the upstream -kernel, and a long list of features - among them, multiple devices, replication, -caching and full data checksumming/compression. Snapshotting is coming, too. - -Pretty much all the normal posix filesystem stuff is supported (things like -xattrs, acls, etc. - no quotas yet, though). - -Stability: there are no known outstanding bugs if you stick to the single device -filesystems (no tiering/replication). It's been passing xfstests for months, and -it is seeing real world usage (I've been using it on my development laptop for -some months now). It is highly unlikely to eat your data, but that is always a -possibility with new filesystems - please keep backups. - -The on disk format is not yet set in stone, but any future breaking changes will -come with plenty of warning. We'll need at least one more breaking change for -encryption and possibly snapshots, but I'm hoping to delay rolling out breaking -changes until I can do them at once. - -For other kernel programmers who might be interested in getting involved, I've -started a guide to the bcache internals: [[BcacheGuide]] +# Bcachefs + +Bcache is done and stable - but work hasn't stopped, [[Bcachefs]] is the hot new +thing: a next generation, robust, high performance copy on write filesystem. You +could think of it as bcache version two, but it might be more accurate to call +bcache the prototype for what's happening in bcachefs - incrementally developing +a filesystem was part of the bcache plan since nearly the beginning. + +It's proving to be quite stable, and it's gotten to the point where it's +suitable for careful deployment and a wider user base. Please see the +[[bcachefs]] page for the current status and instructions on getting started. + +Developing a filesystem is also not cheap or quick or easy; we need funding! +Please chip in on [[Patreon|https://www.patreon.com/bcachefs]] - the Patreon +page also has more information on the motivation for bcachefs and the state of +Linux filesystems. If you've been a happy bcache user, your contribution will be +particularly appreciated - I didn't ask for contributions when I was working on +bcache, but I am now. # What is bcache? |