Encryption.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273

# bcache/bcachefs encryption design:

This document is intended for review by cryptographers and other experience
implementers of cryptography code, before the design is frozen. Everything
described in this document has been implemented and tested.

The code may be found at
[[https://evilpiepirate.org/git/linux-bcache.git/log/?h=bcache-encryption]]

Feedback and comments should be sent to Kent Overstreet,
kent.overstreet@gmail.com.

Main bcachefs page: [[Bcachefs]]

## Intro:

Bcachefs provides whole-filesystem encryption, using ChaCha20/Poly1305.
Encryption may be enabled when creating a filesystem, or encryption may be
enabled on an existing filesystem (TODO: implement interface for enabling
encryption on an existing filesystem - kernel code exists).

Example:

    $ bcache format --encrypted /dev/sda
(Enter passphrase when prompted)

    $ bcache unlock /dev/sda
(Enter passphrase again)

Then mount as normal:

    $ mount /dev/sda /mnt

## Goals:

Bcachefs encryption is meant to be a clean slate design that prioritizes
security and robustness, and is meant to defend against a wider variety of
adversarial models than is typical in existing filesystem level or block level
encryption.

In particular, the goal is to be secure even when the attacker controls the
storage device itself, and can see reads and writes as they happen and return
arbitrary data from read requests.

## Filesystem vs. directory encryption

We do not currently offer per directory encryption; instead, we take an "encrypt
everything" approach.

Rationale:

With per directory encryption, preventing metadata from leaking is highly
problematic.  For example, file sizes - file sizes are required for fsck, so
they would have to be stored unencrypted - or failing that some complicated way
of deferring fsck for that part of the filesystem until the key has been
provided. There would be additional complications around filenames, xattrs,
extents (and inline extents), etc. - not necessarily insurmountable, but they
would definitely lead to a more complicated, more fragile design.

With whole filesystem encryption, it’s much easier to say what is and isn’t
encrypted, we can encrypt at the granularity of an entire metadata write (a
journal entry, a btree node) and it's much easier to show that the contents -
everything after the header for that particular metadata write - will not leak.

### Algorithms

By virtue of working within a copy on write filesystem with provisions for ZFS
style checksums (that is, checksums with the pointers, not the data), we’re
able to use a modern AEAD style construction. We use ChaCha20 and Poly1305. We
use the cyphers directly instead of using the kernel AEAD library. However, we
do follow pretty closely the approach of [[RFC 7539|https://tools.ietf.org/html/rfc7539]].

Note that ChaCha20 is a stream cypher. This means that it’s critical that we use
a cryptographic MAC (which would be highly desirable anyways), and also avoiding
nonce reuse is critical. Getting nonces right is where most of the trickiness is
involved in bcachefs’s encryption.

The current algorithm choices are not hard coded. Bcachefs already has
selectable checksum types, and every individual data and metadata write has a
field that describes the checksum algorithm that was used. On disk, encrypted
data is represented as a new checksum type - so we now have [none, crc32c,
crc64, chacha20/poly1305] as possible methods for data to be
checksummed/encrypted. If in the future we add new encryption algorithms, users
will be able to switch to the new algorithm on existing encrypted filesystems;
new data will be written with the new algorithm and old data will be read with
the old algorithm until it is rewritten.

### Key derivation, master key

Userspace tooling takes the user's passphrase and derives an encryption key with
scrypt. This key is made available to the kernel (via the Linux kernel's keyring
service) prior to mounting the filesystem.

On filesystem mount, the userspace provided key is used to decrypt the master
key, which is stored in the superblock - also with ChaCha20. The master key is
encrypted with an 8 byte header, so that we can tell if the correct key was
supplied.

TODO: Add a field to the superblock specifying the key derivation function, so
that we can transition to newer KDFs later (e.g. Argon2) or specify cost
parameters.

### Metadata

Except for the superblock, no metadata in bcache/bcachefs is updated in place -
everything is more or less log structured. Only the superblock is stored
unencrypted; other metadata is stored with an unencrypted header and encrypted
contents.

The superblock contains:

 * Label and UUIDs identifying the filesystem
 * A list of component devices (for multi-device filesystems), and information
   on their size, geometry, status (active/failed), last used timestamp
 * Filesystem options
 * The location of the journal

For the rest of the metadata, the unencrypted portion contains:

 * 128 bit checksum/MAC field
 * Magic number - identifies a given structure as btree/journal/allocation
   information, for that filesystem
 * Version number (of on disk format), flags (including checksum/encryption
   type).
 * Sequence numbers: journal entries have an ascending 64 bit sequence number,
   btree node entries have a random 64 bit sequence number identifying them as
   belonging to that node. Btree nodes also have a field containing the sequence
   number of the most recent journal entry they contain updates from; this is
   stored unencrypted so it can be used as part of the nonce.
 * Size of the btree node entry/journal entry, in u64s

Btree node layout information is encrypted; an attacker could tell that a given
location on disk was a btree node, but the part of the header that indicates
what range of the keyspace, or which btree ID (extents/dirents/xattrs/etc.), or
which level of the btree is all encrypted.

#### Metadata nonces

 * Journal entries use their sequence number - which is unique for a given
   filesystem. When metadata is being replicated and we're doing multiple
   journal writes with the same sequence number - and thus nonce - we really are
   writing the same data (we only checksum once, not once per write).

 * Btree nodes concatenate a few things for the nonce:
   - A 64 bit random integer, which is generated per btree node (but btree nodes
     are log structured, so entries within a given btree node share the same
     integer).
   - A journal sequence number. For btree node writes done at around the same
     point in time, this field can be identical in unrelated btree node writes -
     but only for btree nodes writes done relatively close in time, so the
     journal sequence number plus the previous random integer should be more
     than sufficient entropy.
   - And lastly the offset within the btree node, so that btree node entries
     sharing the same random integer are guaranteed a different nonce.

 * Allocation information (struct prio_set):
   bcache/bcachefs doesn't have allocation information persisted like other
   filesystems, but this is our closest equivalent - this structure mainly
   stores generation numbers that correspond to extent pointers.

   Allocation information uses a dedicated randomly generated 96 bit nonce
   field.

### Data

Data writes have no unencrypted header: checksums/MACs, nonces, etc. are stored
with the pointers, ZFS style.

Bcache/bcachefs is extent based, not block based: pointers point to variable
sized chunks of data, and we store one checksum/MAC per extent, not per block: a
checksum or MAC might cover up to 64k (extents that aren't checksummed or
compressed may be larger). Nonces are thus also per extent, not per block.

By default, for data extents the Poly1305 MAC is truncated to 80 bits, for space
efficiency reasons. Optionally the full 128 bit macs may be stored, at the cost
increasing the size of extents by 8 bytes (with 80 bit macs, an extent with a
single replica will typically be 32 bytes, or 40 bytes with 128 bit macs).

This should be completely safe for the vast majority of uses cases. Most uses of
cryptographic MACs are in networked applications, where an attacker may be able
to send an unlimited number of forged messages: in that environment, a 64 bit
mac is clearly insufficient - if an attacker is able to send 2^32 forgery
attempts (not a huge number these days), probability of success is 1 / 2^32 -
which is not considered a remotely safe margin by cryptographers.

However, with a filesystem, even in the case of a completely compromised device
(say an attacker has compromised the firmware on the disk, and is able to return
whatever they want when we read a sector) - if the MAC doesn't match (because
the attacker is attempting to forge data), we consider the device to be failing
and very shortly we're going to stop using it - we won't attempt to reread data
that appears to be corrupt indefinitely. So, attacker gets a very small (on the
order of 10) attempts to forge a particular extent. In the very worst case, if
we're trying very hard to migrate data off a device that appears to be bad, the
attacker might get ~10 attempts multiplied by the number of extents on the
device - but the number of forgery attempts should be clearly bounded.

If the user is in an environment where transient failures/corruption are
expected should be tolerated, instead of assuming the device is bad (e.g. the
disks are accessed over the network, and the network path is known to corrupt
data) - in that situation 128 bit macs should be used (and in the future we may
enforce that if the maximum number of read retries is set to more than a small
number, 128 bit macs must be used if encryption is in use).

#### Extent nonces

We don't wish to simply add a random 96 bit nonce to every extent - that would
inflate our metadata size by a very significant amount. Instead, keys (of which
extents are a subset) have a 96 bit version number field; when encryption is
enabled, we ensure that version numbers are enabled and every new extent gets a
new, unique version number.

However, extents may be partially overwritten or split, and then to defragment
we may have to rewrite those partially overwritten extents elsewhere. We cannot
simply pick a new version number when we rewrite an extent - that would break
semantics other uses of version numbers expect.

When we rewrite an extent, we only write the currently live portions of the
extent - we don't rewrite the parts that were overwritten. We can't write it out
with the same nonce as the original extent.

If we concatenated the version number with the offset within the file, and the
extent's current size - that would work, except that it would break fcollapse(),
which moves extents to a new position within a file. We are forced to add some
additional state to extents.

We could add a small counter that is incremented every time the size of an
extent is reduced (and the data it points to changes); we can easily bound the
size of the counter we need by the maximum size of a checksummed extent. But
this approach fails when extents are split.

What can work is if we add a field for "offset from the start of the original
extent to the start of the current extent" - updating that field whenever we
trim the front of an extent.

If we have that, then we could simply skip ahead in the keystream to where the
currently live data lived in the original extent - there's no problem with nonce
reuse if you're encrypting exactly the same data. Except - that fails with
compression, since if we take an extent, drop the first 4k, and compress it,
that won't give the same data as if we compress it and then drop the first 4k of
the compressed data.

The approach almost works though, if we take that offset and use it as part of
our nonce: what we want to do is construct a function that will output the same
nonce iff two extents (fragments of the same original extent) really are the
same data.

Offset into the original extent works in the absence of compression - two
fragments with the same offset but different sizes will be equal in their common
prefix, ignoring compression. We can handle compression if we also include both
the current size, and the current compression function - offset + current size
uniquely determines the uncompressed data, so, offset + current size +
compression function will uniquely determine the compressed output.

#### Nonce reuse on startup

After recovery, we must ensure we don't reuse existing version numbers - we must
ensure that newly allocated version numbers are strictly greater than any
version number that has every been used before.

The problem here is that we use the version number to write the data before
adding the extent with that version number to the btree: after unclean shutdown,
there will have been version numbers used to write data for which we have no
record in the btree.

The rigorous solution to this is to add a field (likely to the journal header)
that indicates version numbers smaller than that field may have been used.
However, we don't do that yet - it's not completely trivial since it'll add
another potential dependency in the IO path that needs some analysis.

The current solution implemented by the code is to scan every existing version
number (as part of an existing pass), and set the next version number to
allocate to be 64k greater than the highest existing version number that was
found.