1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
|
bcache/bcachefs todo list:
* Implement epochs, and get rid of read retry path
The idea is a refcount on allocator cycles - "epochs" - if something (e.g.
copygc) is buffering extents where GC can't find them (for recalculating the
oldest gen), it can take a ref on the current epoch; current epoch - oldest
existing epoch = how much more out of date pointer gens can be, for keys gc
can't find - then the allocator can wait if this is too big.
For foreground reads, the trick is the read takes a (percpu) refcount on the
current epoch, and releases it when the read completes; then when the
allocator wants to start another cycle, it bumps the epoch, then waits for
all the references from foreground reads to go to 0 on the previous epoch.
* Lockless btree lookups
The idea is to use the technique from seqlocks - instead taking a read lock
on a btree node, we'll check a sequence number for that node, do the lookup,
then check the sequence number again: if they don't match we raced and we
have to retry.
This will let us do lookups without writing to any shared cachelines, which
should be a fairly significant performance win - right now the root node's
lock is a bottleneck on multithreaded workloads simply because all lookups
have to write to that cacheline, just to take a read lock on it.
We already have the sequence number, from SIX locks. The main thing that has
to be done is the lookup code in bset.c has to be audited and modified to
handle reading garbage and gracefully return an error indicating the lookup
raced with a writer. This hopefully won't be too difficult, we don't really
have pointers to deal with.
* Scalability to large cache sets:
Mark and sweep GC is an issue - it runs concurrently with everything _except_
the allocator invalidating buckets. So as long as gc can finish before the
freelist is used up, everything is fine - if not, all writes are going to
stall until gc finishes.
Additionally, we _almost_ don't need mark and sweep anymore. It would be nice
to rip it out entirely (which Slava was working on). This is going to expose
subtle races though - leaks of sector counts that aren't an issue today
though because mark and sweep eventually fixes them.
* bcachefs: add a mount option to disable journal_push_seq() - so userspace is
never waiting on metadata to be synced. Instead, metadata will be persisted
after journal_delay_ms (and ordering will be preserved as usual).
The idea here is that a lot of the time users really don't care about losing
the past 10-100 ms of work (client machines, build servers) and would prefer the
performance improvement on fsync heavy workloads.
* Discards - right now we only issue discards right before reusing a bucket,
which makes sense when the device is expected to be entirely full of cached
data - but when you're using it as a fs it makes more sense to issue discards
as buckets get emptied out. should figure out where to hook this in.
* bcachefs - journal unused inode hint, or otherwise improve inode allocation
* Idea for improving the situation with btree nodes + large buckets:
This is probably only needed for running on raw flash, but might be worth
doing even without raw flash - it'd reduce internal fragmentation on disk
from the btree when we're using large buckets:
The current situation is that each btree node is a contiguous chunk of disk
space, used as a log - when the log fills up, the btree node is full and we
split or rewrite it. Each btree node is, on disk, its own individual log.
When the btree node size is smaller than the bucket size we allocate multiple
btree nodes per bucket - but this means that the entire bucket is not written
to sequentially (as the different btree nodes within the bucket will be
appended to in different orders), and the entire bucket won't be reused until
every btree node within the bucket won't be freed.
Idea: have a second journal for the btree. The different bsets (log entries)
for a particular btree node would no longer be contiguous on disk - whenever
any btree node write happens, it goes to the tail of the journal. This would
not affect anything about how btree nodes work in memory - only the on disk
layout would change.
We'd have to maintain a map of where the bsets currently live on disk for any
given btree node. At runtime this map would presumably just be pinned in
memory - it wouldn't be very big, but then we'd have to have provisions for
regenerating it at startup (either just scan the entire btree journal, or
maintain this map in the primary journal).
|