From 0c82c32ab1a66329d56337c183db3e8397e06025 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Fri, 9 Apr 2021 00:42:41 -0400 Subject: roadmap --- Roadmap.mdwn | 245 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ sidebar.mdwn | 1 + 2 files changed, 246 insertions(+) create mode 100644 Roadmap.mdwn diff --git a/Roadmap.mdwn b/Roadmap.mdwn new file mode 100644 index 0000000..b61cc9a --- /dev/null +++ b/Roadmap.mdwn @@ -0,0 +1,245 @@ +# bcachefs status, roadmap: + +## Performance: + +### Core btree: + +The core btree code is heavily optimized, and I don't expect much in the way of +significant gains without novel data structures. Btree performance is +particularly important in bcachefs because we not shard btrees on a per inode +basis like most filesystems. Instead have one global extents btree, one global +dirents btree, et cetera. Historically, this came about because bcachefs started +as a block cache where we needed a single index that would be fast enough to +index an entire SSD worth of blocks, and for persistent data structures btrees +have a number of advantages over hash tables (can do extents, and when access +patterns have locality they preserve that locality). Using singleton btrees is +a major simplification when writing a filesystem, but it does mean that +everything is leaving rather heavily on the performance and scalability of the +btree implementation. + +Bcachefs b-trees are a hybrid compacting data structure - they share some +aspect with COW btrees, but btree nodes internally are log structured. This +means that btree nodes are big (default 256k), and have a very high fanout. We +keep multiple sorted sets of keys within a btree node (up to 3), +compacting/sorting when the last set, the one that updates go to, becomes too +large. + +Lookups within a btree node use eytzinger layout search trees with summary keys +that fit in 4 bytes. Eytzinger layout means descendents of any given node are +adjacent in memory, meaning they can be effectively prefetched, and 4 byte nodes +mean 16 of them fit on a cacheline, which means we can prefetch 4 levels ahead - +this means we can pipeline effectively enough to get good lookup performance +even when our memery accesses are not cached. + +On my Ryzen 3900X single threaded lookups across 16M keys go at 1.3M second, and +scale almost linearly - with 12 threads it's about 12M/second. We can do 16M +random inserts at 670k per second single threaded; with 12 threads we achieve +2.4 random inserts per second. + +Btree locking is very well optimized. We have external btree iterators which +make it easy to aggressively drop and retake locks, giving the filesystem as a +whole very good latency. + +Minor scalability todo item: btree node splits/merges still take intent locks +all the way up to the root, they should be checking how far up they need to go. +This hasn't shown up yet because bcachefs btree nodes are big and splits/merges +are infrequent. + +### IO path: + +IO paths: the read path looks to be in excellent shape - I see around 350k iops +doing async 4k random reads, single threaded, on a somewhat pokey Xeon server. +We're nowhere near where we should be on 4k random writes - I've been getting in +the neighborhood of 70k iops lately, single threaded, and it's unclear where the +bottleneck is - we don't seem to be cpu bound. Considering the btree performance +this shouldn't be a core architectural issue - work needs to be done to identify +where we're losing performance. + +Filesystem metadata operation performance (e.g. fsmark): On most benchmarks, we +seem to be approaching or sometimes mathing XFS on performance. There are +exceptions: multithreaded rm -rf of empty files is currently roughly 4k slower +than XFS - we bottleneck on journal reclaim, which is flushing inodes from the +btree key cache back to the inodes btree. + +## Scalability: + +There are still some places where we rely on walking/scanning metadata too much. + +### Allocator: + +For allocation purposes, we divide up the device into buckets (default 512k, +currently support up to 8M). There is an in-memory array that represents these +buckets, and the allocator periodically scans this array to fill up freelists - +this is currently a pain point, and eliminating this scanning is a high priority +item. Secondarily, we want to get rid of this in memory array - buckets are +primarily represented in the alloc btree. Almost all of the work to get rid of +this has been done, fixing the allocator to not rescan for empty buckets is the +last major item. + +### Copygc: + +Because allocation is bucket based, we're subject to internal fragmentation that +we need copygc to deal with, and when copygc needs to run it has to scan the +bucket arrays to determine which buckets to evacuate, and then it has to scan +the extents and reflink btrees for extents to be moved. + +This is something we'd like to improve, but is not a major pain point issue - +being a filesystem, we're better positioned to avoid mixing unrelated writes +which tends to be the main cause of write amplification in SSDs. Improving this +will require adding a backpointer index, which will necessarily add overhead to +the write path - thus, I intend to defer this until after we've diagnosed our +current lack of performance in the write path and worked to make is fast as we +think it can go. We may also want backpointers to be an optional feature. + +### Rebalance: + +Various filesystem features may want data to be moved around or rewritten in the +background: e.g. using a device as a writeback cache, or using the +`background_compression` option - excluding copygc, these are all handled by the +rebalance thread. Rebalance currently has to scan the entire extents and reflink +btrees to find work items - and this is more of a pain point than copygc, as +this is a normal filesystem feature. We should add a secondary index for extents +that rebalance will need to move or rewrite - this will be a "pay if you're +using it" feature so the performance impact should be less of a concern than +with copygc. + +Some thought and attention should be given to what else we might want to use +rebalance for in the future. + +There are also current bug reports of rebalance reading data and not doing any +work; while work is being done on this subsystem we should think about improving +visibility as to what it's doing. + +### Disk space accounting: + +Currently, disk space accounting is primarily persisted in the journal (on clean +shutdown, it's also stored in the superblock) - and each disk space counter is +added to every journal entry. As the number of devices in a filesystem increases +this overhead grows, but where this will really become an issue is adding +per-snapshot disk space accounting. We really need these counters to be stored +in btree keys, but this will be a nontrivial change as we use a complicated +system of percpu counters for disk space accounting, with multiple sets of +counters for each open journal entry - rolling them up just prior to each +journal write. My plan is to extend the btree key cache so that it can support a +similar mechanism. + +Also, we want disk space accounting to be broken by compression type, for +compressed data, and compressed/uncompressed totals to both be kept - right now +there's no good way for users to find out the compression ratio they're getting. + +### Number of devices in a filesystem: + +Currently, we're limited to a maximum of 64 devices in a filesystem. It would be +desirable and not much work to lift that limit - it's not encoded in any on disk +data structures, just in memory bitmasks. After that we'll support 256 devices, +and to go past that we'll need another, larger `bch_extent_ptr` type and a new +superblock section for larger device index fields. + +`bch_extent_ptr` uses 44 bits for the offset field (in units of 512 byte +sectors), giving us a current maximum device size of 8 petabytes. We already +have code for different types of checksum fields within an extent for different +size checksums, so adding support for a new, larger extent ptr type will be +fairly trivial. + +### Device management: + +We need a concept of "failure domains", and perhaps something more, for managing +large numbers of devices with bcachefs to make sense. Currently, disks can be +assigned labels that are essentially paths, delimited by periods, e.g. + + controller1.ssd.ssd1 + controller1.ssd.ssd2 + controller1.hdd.hdd1 + controller2.hdd.hdd1 + controller2.hdd.hdd2 + +This lets disks be organized into a heirarchy, and referring to any part of the +heirarchy will include all the disks underneath that point: e.g. controller1 +would refer to the first three disks in the example. + +This is insufficient, though. If a user has a filesystem with 60 drives, all +alike, we need a way to specify that some disks are on the same controller and +that multiple replicas should not be stored on the same controller, and we also +need a way to constrain how replicated writes are laid out. Today, in a 60 +devices filesystem with 3x replication, each allocation will be unconstrained +which means that the failure of _any_ three devices would lead to loss of data. + +A feature that addresses these issues still needs to be designed. + +### Fsck + +Fsck performance is good, but for supporting huge filesystems we are going to +need online fsck. + +Fsck in bcachefs is divided up into two main components: there is `btree_gc`, +which walks the btree and regenerates allocation information (and also checks +btree topology), and there is fsck proper which checks higher level filesystme +invariants (e.g. extents/dirents/xattrs must belong to an inode, and filesystem +connectivity). + +so named because +bachefs originally (when it was bcache) not only did not have persistent +allocation information, it also didn't keep bucket sector counts up to date when +adding/removing extents from the btree - it relied entirely on periodically +rescanning the btree. + +This capability was kept for a very long time, partly as an aid to testing an +debugging when uptodate allocation information was being developed, then +persistent, then transactionally persistent allocation information was +developed. The capability to run `btree_gc` at runtime has been disabled for +roughly the past year or so - it was a bit too much to deal with while getting +everything else stabilized - but the code has been retained and there haven't +been any real architectural changes that would preclude making it a supported +feature again - this is something that we'd like to do, after bcachefs is +upstreamed and when more manpower is available. + +Additionally, fsck itself is part of the kernel bcachefs codebase - the +userspace fsck tool uses a shim layer to run the kernel bcachefs codebase in +userspace, and fsck uses the normal btree iterators interface that the rest of +the filesystem uses. The in-kernel fsck code can be run at mount time by +mounting wiht the `-o fsck` option - this is essentially what the userspace +fsck tool does, just running the same code in userspace. + +This means that there's nothing preventing us from running the current fsck +implementation at runtime, while the filesystem is in use, and there's actually +not much work that needs to be done for this to work properly and reliably: we +need to add locking for the parts of fsck are checking nonlocal invariants and +can't rely solely on btree iterators for locking. + +Since backpointers from inodes to dirents were recently added, the checking of +directory connectivity is about to become drastically simpler and shouldn't need +any external data structures or locking with the rest of the filesystem. The +only fsck passes that appear to require new locking are: + +* Counting of extents/dirents to validate `i_sectors` and subdirectory counts. + +* Checking of `i_nlink` for inodes with hardlinks - but since inode backpointers + have now been added, this pass now only needs to consider inodes that actually + have hardlinks. + +So that's cool. + +### Snapshots: + +I expect bcachefs to scale well into the millions of snapshots. There will need +to be some additional work to make this happen, but it shouldn't be much. + +* We have a function, `bch2_snapshot_is_ancestor()`, that needs to be fast - + it's called by the btree iterator code (in `BTREE_ITER_FILTER_SNAPSHOTS` mode) + for every key it looks at. The current implementation is linear in the depth + of the snapshot tree. We'll probably need to improve this - perhaps by caching + a bit vector, for each snapshot ID, of ancestor IDs. + +* Too many overwrites in different snapshots at the same part of the keyspace + will slow down lookups - we haven't seen this come up yet, but it certainly + will at some point. + + When this happens, it won't be for most files in the filesystem, in anything + like typical usage - most files are created, written to once, and then + overwritten with a rename; it will be for a subset of files that are long + lived and continuously being overwritten - i.e. databases. + + To deal with this we'll need to implement per-inode counters so that we can + detect this situation - and then when we detect too many overwrites, we can + allocate a new inode number internally, and move all keys that aren't visible + in the main subvolume to that inode number. diff --git a/sidebar.mdwn b/sidebar.mdwn index 06cf820..ffac3c1 100644 --- a/sidebar.mdwn +++ b/sidebar.mdwn @@ -18,5 +18,6 @@ * [[Snapshots]] * [[Allocator]] * [[Fsck]] + * [[Roadmap]] * Other technical docs * [[StablePages]] -- cgit v1.2.3