bcachefs redirect

author: Kent Overstreet <kent.overstreet@gmail.com> 2018-03-08 16:29:34 -0500
committer: Kent Overstreet <kent.overstreet@gmail.com> 2018-03-08 16:29:34 -0500
commit: a03a6730dc073dd137af6e2bf4b9b9e0ec8418bb (patch)
tree: 8b4c4fc045c7854bcc44735c4c4c34cf9490df49
parent: 47a7d9c9a0d594d2a32ecad0196377aa8c654312 (diff)
1 files changed, 1 insertions, 287 deletions
diff --git a/Bcachefs.mdwn b/Bcachefs.mdwn
index ab63971..8e10cf2 100644
--- a/Bcachefs.mdwn
+++ b/Bcachefs.mdwn
@@ -1,287 +1 @@
-
-# Bcachefs
-
-It's a next generation copy on write filesystem for Linux with a long list of
-features - tiering/caching, data checksumming, compression, encryption, multiple
-devices, et cetera.
-
-It's not vaporware - it's a real filesystem you can run on your laptop or server
-today.
-
-We prioritize robustness and reliability over features and hype: we make every
-effort to ensure you won't lose data. It's building on top of a codebase with a
-pedigree - bcache already has a reasonably good track record for reliability
-(particularly considering how young upstream bcache is, in terms of engineer
-man/years). Starting from there, bcachefs development has prioritized
-incremental development, and keeping things stable, and aggressively fixing
-design issues as they are found; the bcachefs codebase is considerably more
-robust and mature than upstream bcache.
-
-Developing a filesystem is also not cheap or quick or easy; we need funding!
-Please chip in on [[Patreon|https://www.patreon.com/bcachefs]] - the Patreon
-page also has more information on the motivation for bcachefs and the state of
-Linux filesystems, as well as some bcachefs status updates and information on
-development.
-
-If you don't want to use Patreon, I'm also happy to take donations via paypal:
-kent.overstreet@gmail.com.
-
-Join us in the bcache IRC channel, we have a small group of bcachefs users and
-testers there: #bcache on OFTC (irc.oftc.net).
-
-## Why bcachefs?
-
-For existing bcache users, we've got a particularly compelling argument: a block
-layer cache implements the core functionality of a filesystem (allocation,
-reclamation, and mapping from one address space to another). By running a
-filesystem on top of a block cache every IO must traverse two different mapping
-layers - each taking up memory and space for their index, each adding to every
-IO's latency (and tail latency!), and adding more complexity to your IO path -
-in particular, making it that much harder to debug performance issues.
-
-By using a filesystem that was designed for caching from the start (among many
-other things) we're able to collapse the two mapping layers and eliminate a lot
-of redundant complexity - many things also become easier when they're done
-within the context of an (appropriately designed) filesystem, versus the block
-layer - cache coherency, for example, is a tricky problem in bcache but trivial
-in bcachefs. This has real world impact - some of the most pernicious bugs and
-performance issues bcache users have hit have been because of the writeback
-lock, which is needed for cache coherency - that code is all gone in bcachefs.
-
-What if you don't care about caching, what if you just want a filesystem that
-works? Bcachefs is not just targeted at caching - it's meant to be a superior
-replacement for ext4, xfs and btrfs. We intend to deliver a copy on write
-filesystem, with all the features you'd expect from a modern copy on write
-filesystem - but with the performance and robustness to be a very viable
-replacement for ext4 and xfs. We will deliver on that goal.
-
-## Documentation
-
-End user documentation is currently fairly minimal; this would be a very helpful
-area for anyone who wishes to contribute - I would like the bcache man page in
-the bcache-tools repository to be rewritten and expanded.
-
-There is some fairly substantial developer documentation: see [[BcacheGuide]].
-
-## Getting started
-
-Bcachefs is not upstream, and won't be for awhile. If you want to try out
-bcachefs now, you'll need to be comfortable with building your own kernel. Also,
-as bcachefs has had many incompatible on disk format changes, you cannot
-currently build a kernel with support for both bcachefs and the existing,
-upstream bcache on disk format (this will change prior to bcachefs going
-upstream).
-
-First, check out the bcache kernel and tools repositories:
-
-    git clone -b bcache-dev https://evilpiepirate.org/git/linux-bcache.git
-    git clone -b dev https://evilpiepirate.org/git/bcache-tools.git
-
-Build and install as usual. Then, to format and mount a single device with the
-default options, run:
-
-    bcache format /dev/sda1
-    mount /dev/sda1 /mnt
-
-See `bcache format --help` for more options.
-
-## Status
-
-Bcachefs can currently be considered beta quality. It has a small pool of
-outside users and has been quite stable and reliable so far; there's no reason
-to expect issues as long as you stick to the currently supported feature set.
-Being a new filesystem, backups are still recommended though.
-
-Performance is generally quite good - generally faster than btrfs, and not far
-behind xfs/ext4. There are still performance bugs to be found and optimizations
-we'd like to do, but performance isn't currently the primary focus - the main
-focus is on making sure it's production quality and finishing the core feature
-set.
-
-Normal posix filesystem functionality is all finished - if you're using bcachefs
-as a replacement for ext4 on a desktop, you shouldn't find anything missing. For
-servers, NFS export support is still missing (but coming soon) and we don't yet
-support quotas (probably further off).
-
-Pretty much all the normal posix filesystem stuff is supported (things like
-xattrs, acls, etc. - no quotas yet, though).
-
-The on disk format is not yet set in stone - there will be future breaking
-changes to the on disk format, but we will make every effort make transitioning
-easy for users (e.g. when there are breaking changes there will be kernel
-branches maintained in parallel that support old and new formats to give users
-time to transition, users won't be left stranded with data they can't access).
-We'll need at least one more breaking change for encryption and possibly
-snapshots, but I'm trying to batch up all the breaking changes as much as
-possible.
-
-### Feature status
-
- - Full data checksumming
-
-   Fully supported and enabled by default. We do need to implement scrubbing,
-   once we've got replication and can take advantage of it.
-
- - Compression
-
-   Not _quite_ finished - it's safe to enable, but there's some work left
-   related to copy GC before we can enable free space accounting based on
-   compressed size: right now, enabling compression won't actually let you store
-   any more data in your filesystem than if the data was uncompressed
-
- - Tiering
-
-   Works (there are users using it), but recent testing and development has not
-   focused enough on multiple devices to call it supported. In particular, the
-   device add/remove functionality is known to be currently buggy.
-
- - Multiple devices, replication
-
-   Roughly 80% or 90% implemented, but it's been on the back burner for quite
-   awhile in favor of making the core functionality production quality -
-   replication is not currently suitable for outside testing.
-
- - [[Encryption]]
-
-   Implementation is finished, and passes all the tests. The blocker on rolling
-   it out is finishing the design doc and getting outside review (as feedback
-   any changes based on outside review will almost definitely require on disk
-   format changes), as well as finishing up some unrelated on disk format
-   changes (particularly for replication) that I'm batching up with the on disk
-   format changes for encryption.
-
- - Snapshots
-
-   Snapshot implementation has been started, but snapshots are by far the most
-   complex of the remaining features to implement - it's going to be quite
-   awhile before I can dedicate enough time to finishing them, but I'm very much
-   looking forward to showing off what it'll be able to do.
-
-### Known issues/caveats
-
- - Mount time
-
-   We currently walk all metadata at mount time (multiple times, in fact) - on
-   flash this shouldn't even be noticeable unless your filesystem is very large,
-   but on rotating disk expect mount times to be slow.
-
-   This will be addressed in the future - mount times will likely be the next
-   big push after the next big batch of on disk format changes.
-
- - Fsck
-
-   There is a fsck - it's just in kernel, done at mount time, not in userspace.
-   We shouldn't be missing any checks - we should be able to detect any
-   filesystem inconsistencies. Repair is only implemented for a few
-   inconsistencies, though.
-
-   By default, fsck is run on every mount - mount with -o nofsck if you don't
-   want to run it. Errors are not fixed by default, because I want to make sure
-   I get bug reports if inconsistencies are found - if you do run into fixable
-   errors, mount with -o fix_errors (and send a bug report!).
-
-## FAQ
-
-Please ask questions and ask for them to be added here!
-
-## Todo list
-
-### Current priorities:
-
- * Encryption is pretty much done - just finished the design doc.
-
-   Cryptographers, security experts, etc. please review: [[Encryption]].
-
- * Compression is almost done: it's quite thoroughly tested, the only remaining
-   issue is a problem with copygc fragmenting existing compressed extents that
-   only breaks accounting.
-
- * NFS export support is almost done: implementing i_generation correctly
-   required some new transaction machinery, but that's mostly done. What's left
-   is implementing a new kind of reservation of journal space for the new, long
-   running transactions.
-
-### Breaking changes:
-
- * Need incompatible superblock changes - encryption key used up remaining
-   reserved space. we need:
-    * more flag bits
-    * a feature bits field
-    * bring some structure to the variable length portion, so we can add more
-      crap later - do it like inode optional fields
-    * on clean shutdown, write current journal sequence number to superblock -
-      help guard against corruption or an encrypted filesystem being tampered
-      with
-
- * More bits (once we have feature bits) for "has this feature ever been used", e.g.
-   * encryption - if we don't have encrypted data, we don't need to load cyphers
-   * compression - if gzip has never been used, we don't need gzip's crazy huge
-     compression workspace
-
- * journal format tweaks:
-   * right now btree node roots are added to every journal entry - we really
-     only need to journal them when we change, and with the generic journal pin
-     infrastructure this'll be easy to implement. this is a slight on disk
-     format change - old kernels won't be able to read filesystems from newer
-     kernels, but it's not a breaking change
-
-   * prio bucket pointers - We also add to every journal entry a pointer to
-     each device's starting prio bucket. this one is more important to fix,
-     because with large numbers of devices we'll be wasting more and more of
-     each journal entry on these prio pointers that mostly aren't changing. We
-     just need to break out this journal entry into one entry per component
-     device (and do like with btree node roots, and change it to only journal
-     when it changes).
-
-     when tweaking prio bucket pointers, should add a random sequence field so
-     we can distinguish reading valid prio_sets that aren't the one we actually
-     wanted
-
- * fallocate + compression - calling fallocate is supposed to ensure that a
-   future write call won't return -ENOSPC, regardless of what the file range
-   already contains. We have persistent reservations to support fallocate, but
-   if the file already contains compressed data we currently can't put a
-   persistent reservation where we've already got an extent. We need another
-   type of persistent reservation, that we can add to a normal data extent.
-
- * checksumming stuff:
-    * configurable action for nonfatal IO errors & data checksum errors
-    * RO, continue or threshold
-    * absolute threshold, or moving average threshold (error rate)
-    * when we get a read error/data checksum error, flip a bit in the key - "has
-      seen read error" - so we don't blow through the global limit on one bad
-      extent
-    * global and per device options: per device options take precedence if set,
-      but may be unset
-    * how should configuration handle multiple devices? we probably want to just
-      continue by default in single device mode, but in multi device mode kick
-      it RO
-
-### Other wishlist items:
-
- * When we're using compression, we end up wasting a fair amount of space on
-   internal fragmentation because compressed extents get rounded up to the
-   filesystem block size when they're written - usually 4k. It'd be really nice
-   if we could pack them in more efficiently - probably 512 byte sector
-   granularity.
-
-   On the read side this is no big deal to support - we have to bounce
-   compressed extents anyways. The write side is the annoying part. The options
-   are:
-    * Buffer up writes when we don't have full blocks to write? Highly
-      problematic, not going to do this.
-    * Read modify write? Not an option for raw flash, would prefer it to not be
-      our only option
-    * Do data journalling when we don't have a full block to write? Possible
-      solution, we want data journalling anyways
-
- * Inline extents - good for space efficiency for both small files, and
-   compression when extents happen to compress particularly well.
-
- * Full data journalling - we're definitely going to want this for when the
-   journal is on an NVRAM device (also need to implement external journalling
-   (easy), and direct journal on NVRAM support (what's involved here?)).
-
-   Would be good to get a simple implementation done and tested so we know what
-   the on disk format is going to be.
-
+[[!meta redir="https://bcachefs.org"]]
author	Kent Overstreet <kent.overstreet@gmail.com>	2018-03-08 16:29:34 -0500
committer	Kent Overstreet <kent.overstreet@gmail.com>	2018-03-08 16:29:34 -0500
commit	a03a6730dc073dd137af6e2bf4b9b9e0ec8418bb (patch)
tree	8b4c4fc045c7854bcc44735c4c4c34cf9490df49
parent	47a7d9c9a0d594d2a32ecad0196377aa8c654312 (diff)