From 0c82c32ab1a66329d56337c183db3e8397e06025 Mon Sep 17 00:00:00 2001
From: Kent Overstreet <kent.overstreet@gmail.com>
Date: Fri, 9 Apr 2021 00:42:41 -0400
Subject: roadmap

---
 Roadmap.mdwn | 245 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 sidebar.mdwn |   1 +
 2 files changed, 246 insertions(+)
 create mode 100644 Roadmap.mdwn

diff --git a/Roadmap.mdwn b/Roadmap.mdwn
new file mode 100644
index 0000000..b61cc9a
--- /dev/null
+++ b/Roadmap.mdwn
@@ -0,0 +1,245 @@
+# bcachefs status, roadmap:
+
+## Performance:
+
+### Core btree:
+
+The core btree code is heavily optimized, and I don't expect much in the way of
+significant gains without novel data structures. Btree performance is
+particularly important in bcachefs because we not shard btrees on a per inode
+basis like most filesystems. Instead have one global extents btree, one global
+dirents btree, et cetera. Historically, this came about because bcachefs started
+as a block cache where we needed a single index that would be fast enough to
+index an entire SSD worth of blocks, and for persistent data structures btrees
+have a number of advantages over hash tables (can do extents, and when access
+patterns have locality they preserve that locality). Using singleton btrees is
+a major simplification when writing a filesystem, but it does mean that
+everything is leaving rather heavily on the performance and scalability of the
+btree implementation.
+
+Bcachefs b-trees are a hybrid compacting data structure - they share some
+aspect with COW btrees, but btree nodes internally are log structured. This
+means that btree nodes are big (default 256k), and have a very high fanout. We
+keep multiple sorted sets of keys within a btree node (up to 3),
+compacting/sorting when the last set, the one that updates go to, becomes too
+large.
+
+Lookups within a btree node use eytzinger layout search trees with summary keys
+that fit in 4 bytes. Eytzinger layout means descendents of any given node are
+adjacent in memory, meaning they can be effectively prefetched, and 4 byte nodes
+mean 16 of them fit on a cacheline, which means we can prefetch 4 levels ahead -
+this means we can pipeline effectively enough to get good lookup performance
+even when our memery accesses are not cached.
+
+On my Ryzen 3900X single threaded lookups across 16M keys go at 1.3M second, and
+scale almost linearly - with 12 threads it's about 12M/second. We can do 16M
+random inserts at 670k per second single threaded; with 12 threads we achieve
+2.4 random inserts per second.
+
+Btree locking is very well optimized. We have external btree iterators which
+make it easy to aggressively drop and retake locks, giving the filesystem as a
+whole very good latency.
+
+Minor scalability todo item: btree node splits/merges still take intent locks
+all the way up to the root, they should be checking how far up they need to go.
+This hasn't shown up yet because bcachefs btree nodes are big and splits/merges
+are infrequent.
+
+### IO path:
+
+IO paths: the read path looks to be in excellent shape - I see around 350k iops
+doing async 4k random reads, single threaded, on a somewhat pokey Xeon server.
+We're nowhere near where we should be on 4k random writes - I've been getting in
+the neighborhood of 70k iops lately, single threaded, and it's unclear where the
+bottleneck is - we don't seem to be cpu bound. Considering the btree performance
+this shouldn't be a core architectural issue - work needs to be done to identify
+where we're losing performance.
+
+Filesystem metadata operation performance (e.g. fsmark): On most benchmarks, we
+seem to be approaching or sometimes mathing XFS on performance. There are
+exceptions: multithreaded rm -rf of empty files is currently roughly 4k slower
+than XFS - we bottleneck on journal reclaim, which is flushing inodes from the
+btree key cache back to the inodes btree. 
+
+## Scalability:
+
+There are still some places where we rely on walking/scanning metadata too much.
+
+### Allocator:
+
+For allocation purposes, we divide up the device into buckets (default 512k,
+currently support up to 8M). There is an in-memory array that represents these
+buckets, and the allocator periodically scans this array to fill up freelists -
+this is currently a pain point, and eliminating this scanning is a high priority
+item. Secondarily, we want to get rid of this in memory array - buckets are
+primarily represented in the alloc btree. Almost all of the work to get rid of
+this has been done, fixing the allocator to not rescan for empty buckets is the
+last major item.
+
+### Copygc:
+
+Because allocation is bucket based, we're subject to internal fragmentation that
+we need copygc to deal with, and when copygc needs to run it has to scan the
+bucket arrays to determine which buckets to evacuate, and then it has to scan
+the extents and reflink btrees for extents to be moved.
+
+This is something we'd like to improve, but is not a major pain point issue -
+being a filesystem, we're better positioned to avoid mixing unrelated writes
+which tends to be the main cause of write amplification in SSDs. Improving this
+will require adding a backpointer index, which will necessarily add overhead to
+the write path - thus, I intend to defer this until after we've diagnosed our
+current lack of performance in the write path and worked to make is fast as we
+think it can go. We may also want backpointers to be an optional feature.
+
+### Rebalance:
+
+Various filesystem features may want data to be moved around or rewritten in the
+background: e.g. using a device as a writeback cache, or using the
+`background_compression` option - excluding copygc, these are all handled by the
+rebalance thread. Rebalance currently has to scan the entire extents and reflink
+btrees to find work items - and this is more of a pain point than copygc, as
+this is a normal filesystem feature. We should add a secondary index for extents
+that rebalance will need to move or rewrite - this will be a "pay if you're
+using it" feature so the performance impact should be less of a concern than
+with copygc.
+
+Some thought and attention should be given to what else we might want to use
+rebalance for in the future.
+
+There are also current bug reports of rebalance reading data and not doing any
+work; while work is being done on this subsystem we should think about improving
+visibility as to what it's doing.
+
+### Disk space accounting:
+
+Currently, disk space accounting is primarily persisted in the journal (on clean
+shutdown, it's also stored in the superblock) - and each disk space counter is
+added to every journal entry. As the number of devices in a filesystem increases
+this overhead grows, but where this will really become an issue is adding
+per-snapshot disk space accounting. We really need these counters to be stored
+in btree keys, but this will be a nontrivial change as we use a complicated
+system of percpu counters for disk space accounting, with multiple sets of
+counters for each open journal entry - rolling them up just prior to each
+journal write. My plan is to extend the btree key cache so that it can support a
+similar mechanism.
+
+Also, we want disk space accounting to be broken by compression type, for
+compressed data, and compressed/uncompressed totals to both be kept - right now
+there's no good way for users to find out the compression ratio they're getting.
+
+### Number of devices in a filesystem:
+
+Currently, we're limited to a maximum of 64 devices in a filesystem. It would be
+desirable and not much work to lift that limit - it's not encoded in any on disk
+data structures, just in memory bitmasks. After that we'll support 256 devices,
+and to go past that we'll need another, larger `bch_extent_ptr` type and a new
+superblock section for larger device index fields.
+
+`bch_extent_ptr` uses 44 bits for the offset field (in units of 512 byte
+sectors), giving us a current maximum device size of 8 petabytes. We already
+have code for different types of checksum fields within an extent for different
+size checksums, so adding support for a new, larger extent ptr type will be
+fairly trivial.
+
+### Device management:
+
+We need a concept of "failure domains", and perhaps something more, for managing
+large numbers of devices with bcachefs to make sense. Currently, disks can be
+assigned labels that are essentially paths, delimited by periods, e.g.
+
+  controller1.ssd.ssd1
+  controller1.ssd.ssd2
+  controller1.hdd.hdd1
+  controller2.hdd.hdd1
+  controller2.hdd.hdd2
+
+This lets disks be organized into a heirarchy, and referring to any part of the
+heirarchy will include all the disks underneath that point: e.g. controller1
+would refer to the first three disks in the example.
+
+This is insufficient, though. If a user has a filesystem with 60 drives, all
+alike, we need a way to specify that some disks are on the same controller and
+that multiple replicas should not be stored on the same controller, and we also
+need a way to constrain how replicated writes are laid out. Today, in a 60
+devices filesystem with 3x replication, each allocation will be unconstrained
+which means that the failure of _any_ three devices would lead to loss of data.
+
+A feature that addresses these issues still needs to be designed.
+
+### Fsck
+
+Fsck performance is good, but for supporting huge filesystems we are going to
+need online fsck.
+
+Fsck in bcachefs is divided up into two main components: there is `btree_gc`,
+which walks the btree and regenerates allocation information (and also checks
+btree topology), and there is fsck proper which checks higher level filesystme
+invariants (e.g. extents/dirents/xattrs must belong to an inode, and filesystem
+connectivity).
+
+so named because
+bachefs originally (when it was bcache) not only did not have persistent
+allocation information, it also didn't keep bucket sector counts up to date when
+adding/removing extents from the btree - it relied entirely on periodically
+rescanning the btree.
+
+This capability was kept for a very long time, partly as an aid to testing an
+debugging when uptodate allocation information was being developed, then
+persistent, then transactionally persistent allocation information was
+developed. The capability to run `btree_gc` at runtime has been disabled for
+roughly the past year or so - it was a bit too much to deal with while getting
+everything else stabilized - but the code has been retained and there haven't
+been any real architectural changes that would preclude making it a supported
+feature again - this is something that we'd like to do, after bcachefs is
+upstreamed and when more manpower is available.
+
+Additionally, fsck itself is part of the kernel bcachefs codebase - the
+userspace fsck tool uses a shim layer to run the kernel bcachefs codebase in
+userspace, and fsck uses the normal btree iterators interface that the rest of
+the filesystem uses. The in-kernel fsck code can be run at mount time by
+mounting wiht the `-o  fsck` option - this is essentially what the userspace
+fsck tool does, just running the same code in userspace.
+
+This means that there's nothing preventing us from running the current fsck
+implementation at runtime, while the filesystem is in use, and there's actually
+not much work that needs to be done for this to work properly and reliably: we
+need to add locking for the parts of fsck are checking nonlocal invariants and
+can't rely solely on btree iterators for locking.
+
+Since backpointers from inodes to dirents were recently added, the checking of
+directory connectivity is about to become drastically simpler and shouldn't need
+any external data structures or locking with the rest of the filesystem. The
+only fsck passes that appear to require new locking are:
+
+* Counting of extents/dirents to validate `i_sectors` and subdirectory counts.
+
+* Checking of `i_nlink` for inodes with hardlinks - but since inode backpointers
+  have now been added, this pass now only needs to consider inodes that actually
+  have hardlinks.
+
+So that's cool.
+
+### Snapshots:
+
+I expect bcachefs to scale well into the millions of snapshots. There will need
+to be some additional work to make this happen, but it shouldn't be much.
+
+* We have a function, `bch2_snapshot_is_ancestor()`, that needs to be fast -
+  it's called by the btree iterator code (in `BTREE_ITER_FILTER_SNAPSHOTS` mode)
+  for every key it looks at. The current implementation is linear in the depth
+  of the snapshot tree. We'll probably need to improve this - perhaps by caching
+  a bit vector, for each snapshot ID, of ancestor IDs.
+
+* Too many overwrites in different snapshots at the same part of the keyspace
+  will slow down lookups - we haven't seen this come up yet, but it certainly
+  will at some point.
+
+  When this happens, it won't be for most files in the filesystem, in anything
+  like typical usage - most files are created, written to once, and then
+  overwritten with a rename; it will be for a subset of files that are long
+  lived and continuously being overwritten - i.e. databases.
+
+  To deal with this we'll need to implement per-inode counters so that we can
+  detect this situation - and then when we detect too many overwrites, we can
+  allocate a new inode number internally, and move all keys that aren't visible
+  in the main subvolume to that inode number.
diff --git a/sidebar.mdwn b/sidebar.mdwn
index 06cf820..ffac3c1 100644
--- a/sidebar.mdwn
+++ b/sidebar.mdwn
@@ -18,5 +18,6 @@
   * [[Snapshots]]
   * [[Allocator]]
   * [[Fsck]]
+  * [[Roadmap]]
 * Other technical docs
   * [[StablePages]]
-- 
cgit v1.2.3