From a03a6730dc073dd137af6e2bf4b9b9e0ec8418bb Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Thu, 8 Mar 2018 16:29:34 -0500 Subject: bcachefs redirect --- Bcachefs.mdwn | 288 +--------------------------------------------------------- 1 file changed, 1 insertion(+), 287 deletions(-) diff --git a/Bcachefs.mdwn b/Bcachefs.mdwn index ab63971..8e10cf2 100644 --- a/Bcachefs.mdwn +++ b/Bcachefs.mdwn @@ -1,287 +1 @@ - -# Bcachefs - -It's a next generation copy on write filesystem for Linux with a long list of -features - tiering/caching, data checksumming, compression, encryption, multiple -devices, et cetera. - -It's not vaporware - it's a real filesystem you can run on your laptop or server -today. - -We prioritize robustness and reliability over features and hype: we make every -effort to ensure you won't lose data. It's building on top of a codebase with a -pedigree - bcache already has a reasonably good track record for reliability -(particularly considering how young upstream bcache is, in terms of engineer -man/years). Starting from there, bcachefs development has prioritized -incremental development, and keeping things stable, and aggressively fixing -design issues as they are found; the bcachefs codebase is considerably more -robust and mature than upstream bcache. - -Developing a filesystem is also not cheap or quick or easy; we need funding! -Please chip in on [[Patreon|https://www.patreon.com/bcachefs]] - the Patreon -page also has more information on the motivation for bcachefs and the state of -Linux filesystems, as well as some bcachefs status updates and information on -development. - -If you don't want to use Patreon, I'm also happy to take donations via paypal: -kent.overstreet@gmail.com. - -Join us in the bcache IRC channel, we have a small group of bcachefs users and -testers there: #bcache on OFTC (irc.oftc.net). - -## Why bcachefs? - -For existing bcache users, we've got a particularly compelling argument: a block -layer cache implements the core functionality of a filesystem (allocation, -reclamation, and mapping from one address space to another). By running a -filesystem on top of a block cache every IO must traverse two different mapping -layers - each taking up memory and space for their index, each adding to every -IO's latency (and tail latency!), and adding more complexity to your IO path - -in particular, making it that much harder to debug performance issues. - -By using a filesystem that was designed for caching from the start (among many -other things) we're able to collapse the two mapping layers and eliminate a lot -of redundant complexity - many things also become easier when they're done -within the context of an (appropriately designed) filesystem, versus the block -layer - cache coherency, for example, is a tricky problem in bcache but trivial -in bcachefs. This has real world impact - some of the most pernicious bugs and -performance issues bcache users have hit have been because of the writeback -lock, which is needed for cache coherency - that code is all gone in bcachefs. - -What if you don't care about caching, what if you just want a filesystem that -works? Bcachefs is not just targeted at caching - it's meant to be a superior -replacement for ext4, xfs and btrfs. We intend to deliver a copy on write -filesystem, with all the features you'd expect from a modern copy on write -filesystem - but with the performance and robustness to be a very viable -replacement for ext4 and xfs. We will deliver on that goal. - -## Documentation - -End user documentation is currently fairly minimal; this would be a very helpful -area for anyone who wishes to contribute - I would like the bcache man page in -the bcache-tools repository to be rewritten and expanded. - -There is some fairly substantial developer documentation: see [[BcacheGuide]]. - -## Getting started - -Bcachefs is not upstream, and won't be for awhile. If you want to try out -bcachefs now, you'll need to be comfortable with building your own kernel. Also, -as bcachefs has had many incompatible on disk format changes, you cannot -currently build a kernel with support for both bcachefs and the existing, -upstream bcache on disk format (this will change prior to bcachefs going -upstream). - -First, check out the bcache kernel and tools repositories: - - git clone -b bcache-dev https://evilpiepirate.org/git/linux-bcache.git - git clone -b dev https://evilpiepirate.org/git/bcache-tools.git - -Build and install as usual. Then, to format and mount a single device with the -default options, run: - - bcache format /dev/sda1 - mount /dev/sda1 /mnt - -See `bcache format --help` for more options. - -## Status - -Bcachefs can currently be considered beta quality. It has a small pool of -outside users and has been quite stable and reliable so far; there's no reason -to expect issues as long as you stick to the currently supported feature set. -Being a new filesystem, backups are still recommended though. - -Performance is generally quite good - generally faster than btrfs, and not far -behind xfs/ext4. There are still performance bugs to be found and optimizations -we'd like to do, but performance isn't currently the primary focus - the main -focus is on making sure it's production quality and finishing the core feature -set. - -Normal posix filesystem functionality is all finished - if you're using bcachefs -as a replacement for ext4 on a desktop, you shouldn't find anything missing. For -servers, NFS export support is still missing (but coming soon) and we don't yet -support quotas (probably further off). - -Pretty much all the normal posix filesystem stuff is supported (things like -xattrs, acls, etc. - no quotas yet, though). - -The on disk format is not yet set in stone - there will be future breaking -changes to the on disk format, but we will make every effort make transitioning -easy for users (e.g. when there are breaking changes there will be kernel -branches maintained in parallel that support old and new formats to give users -time to transition, users won't be left stranded with data they can't access). -We'll need at least one more breaking change for encryption and possibly -snapshots, but I'm trying to batch up all the breaking changes as much as -possible. - -### Feature status - - - Full data checksumming - - Fully supported and enabled by default. We do need to implement scrubbing, - once we've got replication and can take advantage of it. - - - Compression - - Not _quite_ finished - it's safe to enable, but there's some work left - related to copy GC before we can enable free space accounting based on - compressed size: right now, enabling compression won't actually let you store - any more data in your filesystem than if the data was uncompressed - - - Tiering - - Works (there are users using it), but recent testing and development has not - focused enough on multiple devices to call it supported. In particular, the - device add/remove functionality is known to be currently buggy. - - - Multiple devices, replication - - Roughly 80% or 90% implemented, but it's been on the back burner for quite - awhile in favor of making the core functionality production quality - - replication is not currently suitable for outside testing. - - - [[Encryption]] - - Implementation is finished, and passes all the tests. The blocker on rolling - it out is finishing the design doc and getting outside review (as feedback - any changes based on outside review will almost definitely require on disk - format changes), as well as finishing up some unrelated on disk format - changes (particularly for replication) that I'm batching up with the on disk - format changes for encryption. - - - Snapshots - - Snapshot implementation has been started, but snapshots are by far the most - complex of the remaining features to implement - it's going to be quite - awhile before I can dedicate enough time to finishing them, but I'm very much - looking forward to showing off what it'll be able to do. - -### Known issues/caveats - - - Mount time - - We currently walk all metadata at mount time (multiple times, in fact) - on - flash this shouldn't even be noticeable unless your filesystem is very large, - but on rotating disk expect mount times to be slow. - - This will be addressed in the future - mount times will likely be the next - big push after the next big batch of on disk format changes. - - - Fsck - - There is a fsck - it's just in kernel, done at mount time, not in userspace. - We shouldn't be missing any checks - we should be able to detect any - filesystem inconsistencies. Repair is only implemented for a few - inconsistencies, though. - - By default, fsck is run on every mount - mount with -o nofsck if you don't - want to run it. Errors are not fixed by default, because I want to make sure - I get bug reports if inconsistencies are found - if you do run into fixable - errors, mount with -o fix_errors (and send a bug report!). - -## FAQ - -Please ask questions and ask for them to be added here! - -## Todo list - -### Current priorities: - - * Encryption is pretty much done - just finished the design doc. - - Cryptographers, security experts, etc. please review: [[Encryption]]. - - * Compression is almost done: it's quite thoroughly tested, the only remaining - issue is a problem with copygc fragmenting existing compressed extents that - only breaks accounting. - - * NFS export support is almost done: implementing i_generation correctly - required some new transaction machinery, but that's mostly done. What's left - is implementing a new kind of reservation of journal space for the new, long - running transactions. - -### Breaking changes: - - * Need incompatible superblock changes - encryption key used up remaining - reserved space. we need: - * more flag bits - * a feature bits field - * bring some structure to the variable length portion, so we can add more - crap later - do it like inode optional fields - * on clean shutdown, write current journal sequence number to superblock - - help guard against corruption or an encrypted filesystem being tampered - with - - * More bits (once we have feature bits) for "has this feature ever been used", e.g. - * encryption - if we don't have encrypted data, we don't need to load cyphers - * compression - if gzip has never been used, we don't need gzip's crazy huge - compression workspace - - * journal format tweaks: - * right now btree node roots are added to every journal entry - we really - only need to journal them when we change, and with the generic journal pin - infrastructure this'll be easy to implement. this is a slight on disk - format change - old kernels won't be able to read filesystems from newer - kernels, but it's not a breaking change - - * prio bucket pointers - We also add to every journal entry a pointer to - each device's starting prio bucket. this one is more important to fix, - because with large numbers of devices we'll be wasting more and more of - each journal entry on these prio pointers that mostly aren't changing. We - just need to break out this journal entry into one entry per component - device (and do like with btree node roots, and change it to only journal - when it changes). - - when tweaking prio bucket pointers, should add a random sequence field so - we can distinguish reading valid prio_sets that aren't the one we actually - wanted - - * fallocate + compression - calling fallocate is supposed to ensure that a - future write call won't return -ENOSPC, regardless of what the file range - already contains. We have persistent reservations to support fallocate, but - if the file already contains compressed data we currently can't put a - persistent reservation where we've already got an extent. We need another - type of persistent reservation, that we can add to a normal data extent. - - * checksumming stuff: - * configurable action for nonfatal IO errors & data checksum errors - * RO, continue or threshold - * absolute threshold, or moving average threshold (error rate) - * when we get a read error/data checksum error, flip a bit in the key - "has - seen read error" - so we don't blow through the global limit on one bad - extent - * global and per device options: per device options take precedence if set, - but may be unset - * how should configuration handle multiple devices? we probably want to just - continue by default in single device mode, but in multi device mode kick - it RO - -### Other wishlist items: - - * When we're using compression, we end up wasting a fair amount of space on - internal fragmentation because compressed extents get rounded up to the - filesystem block size when they're written - usually 4k. It'd be really nice - if we could pack them in more efficiently - probably 512 byte sector - granularity. - - On the read side this is no big deal to support - we have to bounce - compressed extents anyways. The write side is the annoying part. The options - are: - * Buffer up writes when we don't have full blocks to write? Highly - problematic, not going to do this. - * Read modify write? Not an option for raw flash, would prefer it to not be - our only option - * Do data journalling when we don't have a full block to write? Possible - solution, we want data journalling anyways - - * Inline extents - good for space efficiency for both small files, and - compression when extents happen to compress particularly well. - - * Full data journalling - we're definitely going to want this for when the - journal is on an NVRAM device (also need to implement external journalling - (easy), and direct journal on NVRAM support (what's involved here?)). - - Would be good to get a simple implementation done and tested so we know what - the on disk format is going to be. - +[[!meta redir="https://bcachefs.org"]] -- cgit v1.2.3