Bcachefs.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287


# Bcachefs

It's a next generation copy on write filesystem for Linux with a long list of
features - tiering/caching, data checksumming, compression, encryption, multiple
devices, et cetera.

It's not vaporware - it's a real filesystem you can run on your laptop or server
today.

We prioritize robustness and reliability over features and hype: we make every
effort to ensure you won't lose data. It's building on top of a codebase with a
pedigree - bcache already has a reasonably good track record for reliability
(particularly considering how young upstream bcache is, in terms of engineer
man/years). Starting from there, bcachefs development has prioritized
incremental development, and keeping things stable, and aggressively fixing
design issues as they are found; the bcachefs codebase is considerably more
robust and mature than upstream bcache.

Developing a filesystem is also not cheap or quick or easy; we need funding!
Please chip in on [[Patreon|https://www.patreon.com/bcachefs]] - the Patreon
page also has more information on the motivation for bcachefs and the state of
Linux filesystems, as well as some bcachefs status updates and information on
development.

If you don't want to use Patreon, I'm also happy to take donations via paypal:
kent.overstreet@gmail.com.

Join us in the bcache IRC channel, we have a small group of bcachefs users and
testers there: #bcache on OFTC (irc.oftc.net).

## Why bcachefs?

For existing bcache users, we've got a particularly compelling argument: a block
layer cache implements the core functionality of a filesystem (allocation,
reclamation, and mapping from one address space to another). By running a
filesystem on top of a block cache every IO must traverse two different mapping
layers - each taking up memory and space for their index, each adding to every
IO's latency (and tail latency!), and adding more complexity to your IO path -
in particular, making it that much harder to debug performance issues.

By using a filesystem that was designed for caching from the start (among many
other things) we're able to collapse the two mapping layers and eliminate a lot
of redundant complexity - many things also become easier when they're done
within the context of an (appropriately designed) filesystem, versus the block
layer - cache coherency, for example, is a tricky problem in bcache but trivial
in bcachefs. This has real world impact - some of the most pernicious bugs and
performance issues bcache users have hit have been because of the writeback
lock, which is needed for cache coherency - that code is all gone in bcachefs.

What if you don't care about caching, what if you just want a filesystem that
works? Bcachefs is not just targeted at caching - it's meant to be a superior
replacement for ext4, xfs and btrfs. We intend to deliver a copy on write
filesystem, with all the features you'd expect from a modern copy on write
filesystem - but with the performance and robustness to be a very viable
replacement for ext4 and xfs. We will deliver on that goal.

## Documentation

End user documentation is currently fairly minimal; this would be a very helpful
area for anyone who wishes to contribute - I would like the bcache man page in
the bcache-tools repository to be rewritten and expanded.

There is some fairly substantial developer documentation: see [[BcacheGuide]].

## Getting started

Bcachefs is not upstream, and won't be for awhile. If you want to try out
bcachefs now, you'll need to be comfortable with building your own kernel. Also,
as bcachefs has had many incompatible on disk format changes, you cannot
currently build a kernel with support for both bcachefs and the existing,
upstream bcache on disk format (this will change prior to bcachefs going
upstream).

First, check out the bcache kernel and tools repositories:

    git clone -b bcache-dev https://evilpiepirate.org/git/linux-bcache.git
    git clone -b dev https://evilpiepirate.org/git/bcache-tools.git

Build and install as usual. Then, to format and mount a single device with the
default options, run:

    bcache format /dev/sda1
    mount /dev/sda1 /mnt

See `bcache format --help` for more options.

## Status

Bcachefs can currently be considered beta quality. It has a small pool of
outside users and has been quite stable and reliable so far; there's no reason
to expect issues as long as you stick to the currently supported feature set.
Being a new filesystem, backups are still recommended though.

Performance is generally quite good - generally faster than btrfs, and not far
behind xfs/ext4. There are still performance bugs to be found and optimizations
we'd like to do, but performance isn't currently the primary focus - the main
focus is on making sure it's production quality and finishing the core feature
set.

Normal posix filesystem functionality is all finished - if you're using bcachefs
as a replacement for ext4 on a desktop, you shouldn't find anything missing. For
servers, NFS export support is still missing (but coming soon) and we don't yet
support quotas (probably further off).

Pretty much all the normal posix filesystem stuff is supported (things like
xattrs, acls, etc. - no quotas yet, though).

The on disk format is not yet set in stone - there will be future breaking
changes to the on disk format, but we will make every effort make transitioning
easy for users (e.g. when there are breaking changes there will be kernel
branches maintained in parallel that support old and new formats to give users
time to transition, users won't be left stranded with data they can't access).
We'll need at least one more breaking change for encryption and possibly
snapshots, but I'm trying to batch up all the breaking changes as much as
possible.

### Feature status

 - Full data checksumming

   Fully supported and enabled by default. We do need to implement scrubbing,
   once we've got replication and can take advantage of it.

 - Compression

   Not _quite_ finished - it's safe to enable, but there's some work left
   related to copy GC before we can enable free space accounting based on
   compressed size: right now, enabling compression won't actually let you store
   any more data in your filesystem than if the data was uncompressed

 - Tiering

   Works (there are users using it), but recent testing and development has not
   focused enough on multiple devices to call it supported. In particular, the
   device add/remove functionality is known to be currently buggy.

 - Multiple devices, replication

   Roughly 80% or 90% implemented, but it's been on the back burner for quite
   awhile in favor of making the core functionality production quality -
   replication is not currently suitable for outside testing.

 - [[Encryption]]

   Implementation is finished, and passes all the tests. The blocker on rolling
   it out is finishing the design doc and getting outside review (as feedback
   any changes based on outside review will almost definitely require on disk
   format changes), as well as finishing up some unrelated on disk format
   changes (particularly for replication) that I'm batching up with the on disk
   format changes for encryption.

 - Snapshots

   Snapshot implementation has been started, but snapshots are by far the most
   complex of the remaining features to implement - it's going to be quite
   awhile before I can dedicate enough time to finishing them, but I'm very much
   looking forward to showing off what it'll be able to do.

### Known issues/caveats

 - Mount time

   We currently walk all metadata at mount time (multiple times, in fact) - on
   flash this shouldn't even be noticeable unless your filesystem is very large,
   but on rotating disk expect mount times to be slow.

   This will be addressed in the future - mount times will likely be the next
   big push after the next big batch of on disk format changes.

 - Fsck

   There is a fsck - it's just in kernel, done at mount time, not in userspace.
   We shouldn't be missing any checks - we should be able to detect any
   filesystem inconsistencies. Repair is only implemented for a few
   inconsistencies, though.

   By default, fsck is run on every mount - mount with -o nofsck if you don't
   want to run it. Errors are not fixed by default, because I want to make sure
   I get bug reports if inconsistencies are found - if you do run into fixable
   errors, mount with -o fix_errors (and send a bug report!).

## FAQ

Please ask questions and ask for them to be added here!

## Todo list

### Current priorities:

 * Encryption is pretty much done - just finished the design doc.

   Cryptographers, security experts, etc. please review: [[Encryption]].

 * Compression is almost done: it's quite thoroughly tested, the only remaining
   issue is a problem with copygc fragmenting existing compressed extents that
   only breaks accounting.

 * NFS export support is almost done: implementing i_generation correctly
   required some new transaction machinery, but that's mostly done. What's left
   is implementing a new kind of reservation of journal space for the new, long
   running transactions.

### Breaking changes:

 * Need incompatible superblock changes - encryption key used up remaining
   reserved space. we need:
    * more flag bits
    * a feature bits field
    * bring some structure to the variable length portion, so we can add more
      crap later - do it like inode optional fields
    * on clean shutdown, write current journal sequence number to superblock -
      help guard against corruption or an encrypted filesystem being tampered
      with

 * More bits (once we have feature bits) for "has this feature ever been used", e.g.
   * encryption - if we don't have encrypted data, we don't need to load cyphers
   * compression - if gzip has never been used, we don't need gzip's crazy huge
     compression workspace

 * journal format tweaks:
   * right now btree node roots are added to every journal entry - we really
     only need to journal them when we change, and with the generic journal pin
     infrastructure this'll be easy to implement. this is a slight on disk
     format change - old kernels won't be able to read filesystems from newer
     kernels, but it's not a breaking change

   * prio bucket pointers - We also add to every journal entry a pointer to
     each device's starting prio bucket. this one is more important to fix,
     because with large numbers of devices we'll be wasting more and more of
     each journal entry on these prio pointers that mostly aren't changing. We
     just need to break out this journal entry into one entry per component
     device (and do like with btree node roots, and change it to only journal
     when it changes).

     when tweaking prio bucket pointers, should add a random sequence field so
     we can distinguish reading valid prio_sets that aren't the one we actually
     wanted

 * fallocate + compression - calling fallocate is supposed to ensure that a
   future write call won't return -ENOSPC, regardless of what the file range
   already contains. We have persistent reservations to support fallocate, but
   if the file already contains compressed data we currently can't put a
   persistent reservation where we've already got an extent. We need another
   type of persistent reservation, that we can add to a normal data extent.

 * checksumming stuff:
    * configurable action for nonfatal IO errors & data checksum errors
    * RO, continue or threshold
    * absolute threshold, or moving average threshold (error rate)
    * when we get a read error/data checksum error, flip a bit in the key - "has
      seen read error" - so we don't blow through the global limit on one bad
      extent
    * global and per device options: per device options take precedence if set,
      but may be unset
    * how should configuration handle multiple devices? we probably want to just
      continue by default in single device mode, but in multi device mode kick
      it RO

### Other wishlist items:

 * When we're using compression, we end up wasting a fair amount of space on
   internal fragmentation because compressed extents get rounded up to the
   filesystem block size when they're written - usually 4k. It'd be really nice
   if we could pack them in more efficiently - probably 512 byte sector
   granularity.

   On the read side this is no big deal to support - we have to bounce
   compressed extents anyways. The write side is the annoying part. The options
   are:
    * Buffer up writes when we don't have full blocks to write? Highly
      problematic, not going to do this.
    * Read modify write? Not an option for raw flash, would prefer it to not be
      our only option
    * Do data journalling when we don't have a full block to write? Possible
      solution, we want data journalling anyways

 * Inline extents - good for space efficiency for both small files, and
   compression when extents happen to compress particularly well.

 * Full data journalling - we're definitely going to want this for when the
   journal is on an NVRAM device (also need to implement external journalling
   (easy), and direct journal on NVRAM support (what's involved here?)).

   Would be good to get a simple implementation done and tested so we know what
   the on disk format is going to be.