Todo.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86

bcache/bcachefs todo list:

 * Implement epochs, and get rid of read retry path
   
   The idea is a refcount on allocator cycles - "epochs" - if something (e.g.
   copygc) is buffering extents where GC can't find them (for recalculating the
   oldest gen), it can take a ref on the current epoch; current epoch - oldest
   existing epoch = how much more out of date pointer gens can be, for keys gc
   can't find - then the allocator can wait if this is too big.

   For foreground reads, the trick is the read takes a (percpu) refcount on the
   current epoch, and releases it when the read completes; then when the
   allocator wants to start another cycle, it bumps the epoch, then waits for
   all the references from foreground reads to go to 0 on the previous epoch.

 * Lockless btree lookups

   The idea is to use the technique from seqlocks - instead taking a read lock
   on a btree node, we'll check a sequence number for that node, do the lookup,
   then check the sequence number again: if they don't match we raced and we
   have to retry.

   This will let us do lookups without writing to any shared cachelines, which
   should be a fairly significant performance win - right now the root node's
   lock is a bottleneck on multithreaded workloads simply because all lookups
   have to write to that cacheline, just to take a read lock on it.

   We already have the sequence number, from SIX locks. The main thing that has
   to be done is the lookup code in bset.c has to be audited and modified to
   handle reading garbage and gracefully return an error indicating the lookup
   raced with a writer. This hopefully won't be too difficult, we don't really
   have pointers to deal with.

 * Scalability to large cache sets:

   Mark and sweep GC is an issue - it runs concurrently with everything _except_
   the allocator invalidating buckets. So as long as gc can finish before the
   freelist is used up, everything is fine - if not, all writes are going to
   stall until gc finishes.

   Additionally, we _almost_ don't need mark and sweep anymore. It would be nice
   to rip it out entirely (which Slava was working on). This is going to expose
   subtle races though - leaks of sector counts that aren't an issue today
   though because mark and sweep eventually fixes them.

 * bcachefs: add a mount option to disable journal_push_seq() - so userspace is
   never waiting on metadata to be synced. Instead, metadata will be persisted
   after journal_delay_ms (and ordering will be preserved as usual).

   The idea here is that a lot of the time users really don't care about losing
   the past 10-100 ms of work (client machines, build servers) and would prefer the
   performance improvement on fsync heavy workloads.

 * Discards - right now we only issue discards right before reusing a bucket,
   which makes sense when the device is expected to be entirely full of cached
   data - but when you're using it as a fs it makes more sense to issue discards
   as buckets get emptied out. should figure out where to hook this in.

 * bcachefs - journal unused inode hint, or otherwise improve inode allocation

 * Idea for improving the situation with btree nodes + large buckets:

   This is probably only needed for running on raw flash, but might be worth
   doing even without raw flash - it'd reduce internal fragmentation on disk
   from the btree when we're using large buckets:

   The current situation is that each btree node is a contiguous chunk of disk
   space, used as a log - when the log fills up, the btree node is full and we
   split or rewrite it. Each btree node is, on disk, its own individual log.
   When the btree node size is smaller than the bucket size we allocate multiple
   btree nodes per bucket - but this means that the entire bucket is not written
   to sequentially (as the different btree nodes within the bucket will be
   appended to in different orders), and the entire bucket won't be reused until
   every btree node within the bucket won't be freed.

   Idea: have a second journal for the btree. The different bsets (log entries)
   for a particular btree node would no longer be contiguous on disk - whenever
   any btree node write happens, it goes to the tail of the journal. This would
   not affect anything about how btree nodes work in memory - only the on disk
   layout would change.

   We'd have to maintain a map of where the bsets currently live on disk for any
   given btree node. At runtime this map would presumably just be pinned in
   memory - it wouldn't be very big, but then we'd have to have provisions for
   regenerating it at startup (either just scan the entire btree journal, or
   maintain this map in the primary journal).