bcache/bcachefs todo list: * Implement epochs, and get rid of read retry path The idea is a refcount on allocator cycles - "epochs" - if something (e.g. copygc) is buffering extents where GC can't find them (for recalculating the oldest gen), it can take a ref on the current epoch; current epoch - oldest existing epoch = how much more out of date pointer gens can be, for keys gc can't find - then the allocator can wait if this is too big. For foreground reads, the trick is the read takes a (percpu) refcount on the current epoch, and releases it when the read completes; then when the allocator wants to start another cycle, it bumps the epoch, then waits for all the references from foreground reads to go to 0 on the previous epoch. * Lockless btree lookups The idea is to use the technique from seqlocks - instead taking a read lock on a btree node, we'll check a sequence number for that node, do the lookup, then check the sequence number again: if they don't match we raced and we have to retry. This will let us do lookups without writing to any shared cachelines, which should be a fairly significant performance win - right now the root node's lock is a bottleneck on multithreaded workloads simply because all lookups have to write to that cacheline, just to take a read lock on it. We already have the sequence number, from SIX locks. The main thing that has to be done is the lookup code in bset.c has to be audited and modified to handle reading garbage and gracefully return an error indicating the lookup raced with a writer. This hopefully won't be too difficult, we don't really have pointers to deal with. * Scalability to large cache sets: Mark and sweep GC is an issue - it runs concurrently with everything _except_ the allocator invalidating buckets. So as long as gc can finish before the freelist is used up, everything is fine - if not, all writes are going to stall until gc finishes. Additionally, we _almost_ don't need mark and sweep anymore. It would be nice to rip it out entirely (which Slava was working on). This is going to expose subtle races though - leaks of sector counts that aren't an issue today though because mark and sweep eventually fixes them. * bcachefs: add a mount option to disable journal_push_seq() - so userspace is never waiting on metadata to be synced. Instead, metadata will be persisted after journal_delay_ms (and ordering will be preserved as usual). The idea here is that a lot of the time users really don't care about losing the past 10-100 ms of work (client machines, build servers) and would prefer the performance improvement on fsync heavy workloads. * Discards - right now we only issue discards right before reusing a bucket, which makes sense when the device is expected to be entirely full of cached data - but when you're using it as a fs it makes more sense to issue discards as buckets get emptied out. should figure out where to hook this in. * bcachefs - journal unused inode hint, or otherwise improve inode allocation * Idea for improving the situation with btree nodes + large buckets: This is probably only needed for running on raw flash, but might be worth doing even without raw flash - it'd reduce internal fragmentation on disk from the btree when we're using large buckets: The current situation is that each btree node is a contiguous chunk of disk space, used as a log - when the log fills up, the btree node is full and we split or rewrite it. Each btree node is, on disk, its own individual log. When the btree node size is smaller than the bucket size we allocate multiple btree nodes per bucket - but this means that the entire bucket is not written to sequentially (as the different btree nodes within the bucket will be appended to in different orders), and the entire bucket won't be reused until every btree node within the bucket won't be freed. Idea: have a second journal for the btree. The different bsets (log entries) for a particular btree node would no longer be contiguous on disk - whenever any btree node write happens, it goes to the tail of the journal. This would not affect anything about how btree nodes work in memory - only the on disk layout would change. We'd have to maintain a map of where the bsets currently live on disk for any given btree node. At runtime this map would presumably just be pinned in memory - it wouldn't be very big, but then we'd have to have provisions for regenerating it at startup (either just scan the entire btree journal, or maintain this map in the primary journal).