Roadmap.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245

# bcachefs status, roadmap:

## Performance:

### Core btree:

The core btree code is heavily optimized, and I don't expect much in the way of
significant gains without novel data structures. Btree performance is
particularly important in bcachefs because we not shard btrees on a per inode
basis like most filesystems. Instead have one global extents btree, one global
dirents btree, et cetera. Historically, this came about because bcachefs started
as a block cache where we needed a single index that would be fast enough to
index an entire SSD worth of blocks, and for persistent data structures btrees
have a number of advantages over hash tables (can do extents, and when access
patterns have locality they preserve that locality). Using singleton btrees is
a major simplification when writing a filesystem, but it does mean that
everything is leaving rather heavily on the performance and scalability of the
btree implementation.

Bcachefs b-trees are a hybrid compacting data structure - they share some
aspect with COW btrees, but btree nodes internally are log structured. This
means that btree nodes are big (default 256k), and have a very high fanout. We
keep multiple sorted sets of keys within a btree node (up to 3),
compacting/sorting when the last set, the one that updates go to, becomes too
large.

Lookups within a btree node use eytzinger layout search trees with summary keys
that fit in 4 bytes. Eytzinger layout means descendents of any given node are
adjacent in memory, meaning they can be effectively prefetched, and 4 byte nodes
mean 16 of them fit on a cacheline, which means we can prefetch 4 levels ahead -
this means we can pipeline effectively enough to get good lookup performance
even when our memery accesses are not cached.

On my Ryzen 3900X single threaded lookups across 16M keys go at 1.3M second, and
scale almost linearly - with 12 threads it's about 12M/second. We can do 16M
random inserts at 670k per second single threaded; with 12 threads we achieve
2.4 random inserts per second.

Btree locking is very well optimized. We have external btree iterators which
make it easy to aggressively drop and retake locks, giving the filesystem as a
whole very good latency.

Minor scalability todo item: btree node splits/merges still take intent locks
all the way up to the root, they should be checking how far up they need to go.
This hasn't shown up yet because bcachefs btree nodes are big and splits/merges
are infrequent.

### IO path:

IO paths: the read path looks to be in excellent shape - I see around 350k iops
doing async 4k random reads, single threaded, on a somewhat pokey Xeon server.
We're nowhere near where we should be on 4k random writes - I've been getting in
the neighborhood of 70k iops lately, single threaded, and it's unclear where the
bottleneck is - we don't seem to be cpu bound. Considering the btree performance
this shouldn't be a core architectural issue - work needs to be done to identify
where we're losing performance.

Filesystem metadata operation performance (e.g. fsmark): On most benchmarks, we
seem to be approaching or sometimes mathing XFS on performance. There are
exceptions: multithreaded rm -rf of empty files is currently roughly 4k slower
than XFS - we bottleneck on journal reclaim, which is flushing inodes from the
btree key cache back to the inodes btree. 

## Scalability:

There are still some places where we rely on walking/scanning metadata too much.

### Allocator:

For allocation purposes, we divide up the device into buckets (default 512k,
currently support up to 8M). There is an in-memory array that represents these
buckets, and the allocator periodically scans this array to fill up freelists -
this is currently a pain point, and eliminating this scanning is a high priority
item. Secondarily, we want to get rid of this in memory array - buckets are
primarily represented in the alloc btree. Almost all of the work to get rid of
this has been done, fixing the allocator to not rescan for empty buckets is the
last major item.

### Copygc:

Because allocation is bucket based, we're subject to internal fragmentation that
we need copygc to deal with, and when copygc needs to run it has to scan the
bucket arrays to determine which buckets to evacuate, and then it has to scan
the extents and reflink btrees for extents to be moved.

This is something we'd like to improve, but is not a major pain point issue -
being a filesystem, we're better positioned to avoid mixing unrelated writes
which tends to be the main cause of write amplification in SSDs. Improving this
will require adding a backpointer index, which will necessarily add overhead to
the write path - thus, I intend to defer this until after we've diagnosed our
current lack of performance in the write path and worked to make is fast as we
think it can go. We may also want backpointers to be an optional feature.

### Rebalance:

Various filesystem features may want data to be moved around or rewritten in the
background: e.g. using a device as a writeback cache, or using the
`background_compression` option - excluding copygc, these are all handled by the
rebalance thread. Rebalance currently has to scan the entire extents and reflink
btrees to find work items - and this is more of a pain point than copygc, as
this is a normal filesystem feature. We should add a secondary index for extents
that rebalance will need to move or rewrite - this will be a "pay if you're
using it" feature so the performance impact should be less of a concern than
with copygc.

Some thought and attention should be given to what else we might want to use
rebalance for in the future.

There are also current bug reports of rebalance reading data and not doing any
work; while work is being done on this subsystem we should think about improving
visibility as to what it's doing.

### Disk space accounting:

Currently, disk space accounting is primarily persisted in the journal (on clean
shutdown, it's also stored in the superblock) - and each disk space counter is
added to every journal entry. As the number of devices in a filesystem increases
this overhead grows, but where this will really become an issue is adding
per-snapshot disk space accounting. We really need these counters to be stored
in btree keys, but this will be a nontrivial change as we use a complicated
system of percpu counters for disk space accounting, with multiple sets of
counters for each open journal entry - rolling them up just prior to each
journal write. My plan is to extend the btree key cache so that it can support a
similar mechanism.

Also, we want disk space accounting to be broken by compression type, for
compressed data, and compressed/uncompressed totals to both be kept - right now
there's no good way for users to find out the compression ratio they're getting.

### Number of devices in a filesystem:

Currently, we're limited to a maximum of 64 devices in a filesystem. It would be
desirable and not much work to lift that limit - it's not encoded in any on disk
data structures, just in memory bitmasks. After that we'll support 256 devices,
and to go past that we'll need another, larger `bch_extent_ptr` type and a new
superblock section for larger device index fields.

`bch_extent_ptr` uses 44 bits for the offset field (in units of 512 byte
sectors), giving us a current maximum device size of 8 petabytes. We already
have code for different types of checksum fields within an extent for different
size checksums, so adding support for a new, larger extent ptr type will be
fairly trivial.

### Device management:

We need a concept of "failure domains", and perhaps something more, for managing
large numbers of devices with bcachefs to make sense. Currently, disks can be
assigned labels that are essentially paths, delimited by periods, e.g.

  controller1.ssd.ssd1
  controller1.ssd.ssd2
  controller1.hdd.hdd1
  controller2.hdd.hdd1
  controller2.hdd.hdd2

This lets disks be organized into a heirarchy, and referring to any part of the
heirarchy will include all the disks underneath that point: e.g. controller1
would refer to the first three disks in the example.

This is insufficient, though. If a user has a filesystem with 60 drives, all
alike, we need a way to specify that some disks are on the same controller and
that multiple replicas should not be stored on the same controller, and we also
need a way to constrain how replicated writes are laid out. Today, in a 60
devices filesystem with 3x replication, each allocation will be unconstrained
which means that the failure of _any_ three devices would lead to loss of data.

A feature that addresses these issues still needs to be designed.

### Fsck

Fsck performance is good, but for supporting huge filesystems we are going to
need online fsck.

Fsck in bcachefs is divided up into two main components: there is `btree_gc`,
which walks the btree and regenerates allocation information (and also checks
btree topology), and there is fsck proper which checks higher level filesystme
invariants (e.g. extents/dirents/xattrs must belong to an inode, and filesystem
connectivity).

so named because
bachefs originally (when it was bcache) not only did not have persistent
allocation information, it also didn't keep bucket sector counts up to date when
adding/removing extents from the btree - it relied entirely on periodically
rescanning the btree.

This capability was kept for a very long time, partly as an aid to testing an
debugging when uptodate allocation information was being developed, then
persistent, then transactionally persistent allocation information was
developed. The capability to run `btree_gc` at runtime has been disabled for
roughly the past year or so - it was a bit too much to deal with while getting
everything else stabilized - but the code has been retained and there haven't
been any real architectural changes that would preclude making it a supported
feature again - this is something that we'd like to do, after bcachefs is
upstreamed and when more manpower is available.

Additionally, fsck itself is part of the kernel bcachefs codebase - the
userspace fsck tool uses a shim layer to run the kernel bcachefs codebase in
userspace, and fsck uses the normal btree iterators interface that the rest of
the filesystem uses. The in-kernel fsck code can be run at mount time by
mounting wiht the `-o  fsck` option - this is essentially what the userspace
fsck tool does, just running the same code in userspace.

This means that there's nothing preventing us from running the current fsck
implementation at runtime, while the filesystem is in use, and there's actually
not much work that needs to be done for this to work properly and reliably: we
need to add locking for the parts of fsck are checking nonlocal invariants and
can't rely solely on btree iterators for locking.

Since backpointers from inodes to dirents were recently added, the checking of
directory connectivity is about to become drastically simpler and shouldn't need
any external data structures or locking with the rest of the filesystem. The
only fsck passes that appear to require new locking are:

* Counting of extents/dirents to validate `i_sectors` and subdirectory counts.

* Checking of `i_nlink` for inodes with hardlinks - but since inode backpointers
  have now been added, this pass now only needs to consider inodes that actually
  have hardlinks.

So that's cool.

### Snapshots:

I expect bcachefs to scale well into the millions of snapshots. There will need
to be some additional work to make this happen, but it shouldn't be much.

* We have a function, `bch2_snapshot_is_ancestor()`, that needs to be fast -
  it's called by the btree iterator code (in `BTREE_ITER_FILTER_SNAPSHOTS` mode)
  for every key it looks at. The current implementation is linear in the depth
  of the snapshot tree. We'll probably need to improve this - perhaps by caching
  a bit vector, for each snapshot ID, of ancestor IDs.

* Too many overwrites in different snapshots at the same part of the keyspace
  will slow down lookups - we haven't seen this come up yet, but it certainly
  will at some point.

  When this happens, it won't be for most files in the filesystem, in anything
  like typical usage - most files are created, written to once, and then
  overwritten with a rename; it will be for a subset of files that are long
  lived and continuously being overwritten - i.e. databases.

  To deal with this we'll need to implement per-inode counters so that we can
  detect this situation - and then when we detect too many overwrites, we can
  allocate a new inode number internally, and move all keys that aren't visible
  in the main subvolume to that inode number.