summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKent Overstreet <kent.overstreet@gmail.com>2018-03-08 16:29:34 -0500
committerKent Overstreet <kent.overstreet@gmail.com>2018-03-08 16:29:34 -0500
commita03a6730dc073dd137af6e2bf4b9b9e0ec8418bb (patch)
tree8b4c4fc045c7854bcc44735c4c4c34cf9490df49
parent47a7d9c9a0d594d2a32ecad0196377aa8c654312 (diff)
bcachefs redirect
-rw-r--r--Bcachefs.mdwn288
1 files changed, 1 insertions, 287 deletions
diff --git a/Bcachefs.mdwn b/Bcachefs.mdwn
index ab63971..8e10cf2 100644
--- a/Bcachefs.mdwn
+++ b/Bcachefs.mdwn
@@ -1,287 +1 @@
-
-# Bcachefs
-
-It's a next generation copy on write filesystem for Linux with a long list of
-features - tiering/caching, data checksumming, compression, encryption, multiple
-devices, et cetera.
-
-It's not vaporware - it's a real filesystem you can run on your laptop or server
-today.
-
-We prioritize robustness and reliability over features and hype: we make every
-effort to ensure you won't lose data. It's building on top of a codebase with a
-pedigree - bcache already has a reasonably good track record for reliability
-(particularly considering how young upstream bcache is, in terms of engineer
-man/years). Starting from there, bcachefs development has prioritized
-incremental development, and keeping things stable, and aggressively fixing
-design issues as they are found; the bcachefs codebase is considerably more
-robust and mature than upstream bcache.
-
-Developing a filesystem is also not cheap or quick or easy; we need funding!
-Please chip in on [[Patreon|https://www.patreon.com/bcachefs]] - the Patreon
-page also has more information on the motivation for bcachefs and the state of
-Linux filesystems, as well as some bcachefs status updates and information on
-development.
-
-If you don't want to use Patreon, I'm also happy to take donations via paypal:
-kent.overstreet@gmail.com.
-
-Join us in the bcache IRC channel, we have a small group of bcachefs users and
-testers there: #bcache on OFTC (irc.oftc.net).
-
-## Why bcachefs?
-
-For existing bcache users, we've got a particularly compelling argument: a block
-layer cache implements the core functionality of a filesystem (allocation,
-reclamation, and mapping from one address space to another). By running a
-filesystem on top of a block cache every IO must traverse two different mapping
-layers - each taking up memory and space for their index, each adding to every
-IO's latency (and tail latency!), and adding more complexity to your IO path -
-in particular, making it that much harder to debug performance issues.
-
-By using a filesystem that was designed for caching from the start (among many
-other things) we're able to collapse the two mapping layers and eliminate a lot
-of redundant complexity - many things also become easier when they're done
-within the context of an (appropriately designed) filesystem, versus the block
-layer - cache coherency, for example, is a tricky problem in bcache but trivial
-in bcachefs. This has real world impact - some of the most pernicious bugs and
-performance issues bcache users have hit have been because of the writeback
-lock, which is needed for cache coherency - that code is all gone in bcachefs.
-
-What if you don't care about caching, what if you just want a filesystem that
-works? Bcachefs is not just targeted at caching - it's meant to be a superior
-replacement for ext4, xfs and btrfs. We intend to deliver a copy on write
-filesystem, with all the features you'd expect from a modern copy on write
-filesystem - but with the performance and robustness to be a very viable
-replacement for ext4 and xfs. We will deliver on that goal.
-
-## Documentation
-
-End user documentation is currently fairly minimal; this would be a very helpful
-area for anyone who wishes to contribute - I would like the bcache man page in
-the bcache-tools repository to be rewritten and expanded.
-
-There is some fairly substantial developer documentation: see [[BcacheGuide]].
-
-## Getting started
-
-Bcachefs is not upstream, and won't be for awhile. If you want to try out
-bcachefs now, you'll need to be comfortable with building your own kernel. Also,
-as bcachefs has had many incompatible on disk format changes, you cannot
-currently build a kernel with support for both bcachefs and the existing,
-upstream bcache on disk format (this will change prior to bcachefs going
-upstream).
-
-First, check out the bcache kernel and tools repositories:
-
- git clone -b bcache-dev https://evilpiepirate.org/git/linux-bcache.git
- git clone -b dev https://evilpiepirate.org/git/bcache-tools.git
-
-Build and install as usual. Then, to format and mount a single device with the
-default options, run:
-
- bcache format /dev/sda1
- mount /dev/sda1 /mnt
-
-See `bcache format --help` for more options.
-
-## Status
-
-Bcachefs can currently be considered beta quality. It has a small pool of
-outside users and has been quite stable and reliable so far; there's no reason
-to expect issues as long as you stick to the currently supported feature set.
-Being a new filesystem, backups are still recommended though.
-
-Performance is generally quite good - generally faster than btrfs, and not far
-behind xfs/ext4. There are still performance bugs to be found and optimizations
-we'd like to do, but performance isn't currently the primary focus - the main
-focus is on making sure it's production quality and finishing the core feature
-set.
-
-Normal posix filesystem functionality is all finished - if you're using bcachefs
-as a replacement for ext4 on a desktop, you shouldn't find anything missing. For
-servers, NFS export support is still missing (but coming soon) and we don't yet
-support quotas (probably further off).
-
-Pretty much all the normal posix filesystem stuff is supported (things like
-xattrs, acls, etc. - no quotas yet, though).
-
-The on disk format is not yet set in stone - there will be future breaking
-changes to the on disk format, but we will make every effort make transitioning
-easy for users (e.g. when there are breaking changes there will be kernel
-branches maintained in parallel that support old and new formats to give users
-time to transition, users won't be left stranded with data they can't access).
-We'll need at least one more breaking change for encryption and possibly
-snapshots, but I'm trying to batch up all the breaking changes as much as
-possible.
-
-### Feature status
-
- - Full data checksumming
-
- Fully supported and enabled by default. We do need to implement scrubbing,
- once we've got replication and can take advantage of it.
-
- - Compression
-
- Not _quite_ finished - it's safe to enable, but there's some work left
- related to copy GC before we can enable free space accounting based on
- compressed size: right now, enabling compression won't actually let you store
- any more data in your filesystem than if the data was uncompressed
-
- - Tiering
-
- Works (there are users using it), but recent testing and development has not
- focused enough on multiple devices to call it supported. In particular, the
- device add/remove functionality is known to be currently buggy.
-
- - Multiple devices, replication
-
- Roughly 80% or 90% implemented, but it's been on the back burner for quite
- awhile in favor of making the core functionality production quality -
- replication is not currently suitable for outside testing.
-
- - [[Encryption]]
-
- Implementation is finished, and passes all the tests. The blocker on rolling
- it out is finishing the design doc and getting outside review (as feedback
- any changes based on outside review will almost definitely require on disk
- format changes), as well as finishing up some unrelated on disk format
- changes (particularly for replication) that I'm batching up with the on disk
- format changes for encryption.
-
- - Snapshots
-
- Snapshot implementation has been started, but snapshots are by far the most
- complex of the remaining features to implement - it's going to be quite
- awhile before I can dedicate enough time to finishing them, but I'm very much
- looking forward to showing off what it'll be able to do.
-
-### Known issues/caveats
-
- - Mount time
-
- We currently walk all metadata at mount time (multiple times, in fact) - on
- flash this shouldn't even be noticeable unless your filesystem is very large,
- but on rotating disk expect mount times to be slow.
-
- This will be addressed in the future - mount times will likely be the next
- big push after the next big batch of on disk format changes.
-
- - Fsck
-
- There is a fsck - it's just in kernel, done at mount time, not in userspace.
- We shouldn't be missing any checks - we should be able to detect any
- filesystem inconsistencies. Repair is only implemented for a few
- inconsistencies, though.
-
- By default, fsck is run on every mount - mount with -o nofsck if you don't
- want to run it. Errors are not fixed by default, because I want to make sure
- I get bug reports if inconsistencies are found - if you do run into fixable
- errors, mount with -o fix_errors (and send a bug report!).
-
-## FAQ
-
-Please ask questions and ask for them to be added here!
-
-## Todo list
-
-### Current priorities:
-
- * Encryption is pretty much done - just finished the design doc.
-
- Cryptographers, security experts, etc. please review: [[Encryption]].
-
- * Compression is almost done: it's quite thoroughly tested, the only remaining
- issue is a problem with copygc fragmenting existing compressed extents that
- only breaks accounting.
-
- * NFS export support is almost done: implementing i_generation correctly
- required some new transaction machinery, but that's mostly done. What's left
- is implementing a new kind of reservation of journal space for the new, long
- running transactions.
-
-### Breaking changes:
-
- * Need incompatible superblock changes - encryption key used up remaining
- reserved space. we need:
- * more flag bits
- * a feature bits field
- * bring some structure to the variable length portion, so we can add more
- crap later - do it like inode optional fields
- * on clean shutdown, write current journal sequence number to superblock -
- help guard against corruption or an encrypted filesystem being tampered
- with
-
- * More bits (once we have feature bits) for "has this feature ever been used", e.g.
- * encryption - if we don't have encrypted data, we don't need to load cyphers
- * compression - if gzip has never been used, we don't need gzip's crazy huge
- compression workspace
-
- * journal format tweaks:
- * right now btree node roots are added to every journal entry - we really
- only need to journal them when we change, and with the generic journal pin
- infrastructure this'll be easy to implement. this is a slight on disk
- format change - old kernels won't be able to read filesystems from newer
- kernels, but it's not a breaking change
-
- * prio bucket pointers - We also add to every journal entry a pointer to
- each device's starting prio bucket. this one is more important to fix,
- because with large numbers of devices we'll be wasting more and more of
- each journal entry on these prio pointers that mostly aren't changing. We
- just need to break out this journal entry into one entry per component
- device (and do like with btree node roots, and change it to only journal
- when it changes).
-
- when tweaking prio bucket pointers, should add a random sequence field so
- we can distinguish reading valid prio_sets that aren't the one we actually
- wanted
-
- * fallocate + compression - calling fallocate is supposed to ensure that a
- future write call won't return -ENOSPC, regardless of what the file range
- already contains. We have persistent reservations to support fallocate, but
- if the file already contains compressed data we currently can't put a
- persistent reservation where we've already got an extent. We need another
- type of persistent reservation, that we can add to a normal data extent.
-
- * checksumming stuff:
- * configurable action for nonfatal IO errors & data checksum errors
- * RO, continue or threshold
- * absolute threshold, or moving average threshold (error rate)
- * when we get a read error/data checksum error, flip a bit in the key - "has
- seen read error" - so we don't blow through the global limit on one bad
- extent
- * global and per device options: per device options take precedence if set,
- but may be unset
- * how should configuration handle multiple devices? we probably want to just
- continue by default in single device mode, but in multi device mode kick
- it RO
-
-### Other wishlist items:
-
- * When we're using compression, we end up wasting a fair amount of space on
- internal fragmentation because compressed extents get rounded up to the
- filesystem block size when they're written - usually 4k. It'd be really nice
- if we could pack them in more efficiently - probably 512 byte sector
- granularity.
-
- On the read side this is no big deal to support - we have to bounce
- compressed extents anyways. The write side is the annoying part. The options
- are:
- * Buffer up writes when we don't have full blocks to write? Highly
- problematic, not going to do this.
- * Read modify write? Not an option for raw flash, would prefer it to not be
- our only option
- * Do data journalling when we don't have a full block to write? Possible
- solution, we want data journalling anyways
-
- * Inline extents - good for space efficiency for both small files, and
- compression when extents happen to compress particularly well.
-
- * Full data journalling - we're definitely going to want this for when the
- journal is on an NVRAM device (also need to implement external journalling
- (easy), and direct journal on NVRAM support (what's involved here?)).
-
- Would be good to get a simple implementation done and tested so we know what
- the on disk format is going to be.
-
+[[!meta redir="https://bcachefs.org"]]