summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorBrian Foster <bfoster@redhat.com>2023-05-01 07:09:33 -0400
committerKent Overstreet <kent.overstreet@linux.dev>2023-05-12 19:42:33 -0400
commitc335c304e13faf861f786f0c7aaebb031a5d4947 (patch)
treebd993e4f68d9f0c120cdefb8fa546f549af63da0
parentf1de5fe05168eee89e59a86cdddbd494d82f4957 (diff)
bcachefs: fix accounting corruption race between reclaim and dev add
When a device is removed from a bcachefs volume, the associated content is removed from the various btrees. The alloc tree uses the key cache, so when keys are removed the deletes exist in cache for a period of time until reclaim comes along and flushes outstanding updates. When a device is re-added to the bcachefs volume, the add process re-adds some of these previously deleted keys. When marking device superblock locations on device add, the keys will likely refer to some of the same alloc keys that were just removed. The memory triggers for these key updates are responsible for further updates, such as bch2_mark_alloc() calling into bch2_dev_usage_update() to update per-device usage accounting. When a new key is added to key cache, the trans update path also flushes the key to the backing btree for coherency reasons for tree walks. With all of this context, if a device is removed and re-added quickly enough such that some key deletes from the remove are still pending a key cache flush, the trans update path can view this as addition of a new key because the old key in the insert entry refers to a deleted key. However the deleted cached key has not been filled by absence of a btree key, but rather refers to an explicit deletion of an existing key that occurred during device removal. The trans update path adds a new update to flush the key and tags the original (cached) update to skip running the memory triggers. This results in running triggers on the non-cached update instead, which in turn will perform accounting updates based on incoherent values. For example, bch2_dev_usage_update() subtracts the the old alloc key dirty sector count in the non-cached btree key from the newly initialized (i.e. zeroed) per device counters, leading to underflow and accounting corruption. There are at least a few ways to avoid this problem, the simplest of which may be to run triggers against the cached update rather than the non-cached update. If the key only needs to be flushed when the key is not present in the tree, however, then this still performs an unnecessary update. We could potentially use the cached key dirty state to determine whether the delete is a dirty, cached update vs. a clean cache fill, but this may require transmitting key cache dirty state across layers, which adds complexity and seems to be of limited value. Instead, update flush_new_cached_update() to handle this by simply checking for the key in the btree and only perform the flush when a backing key is not present. Signed-off-by: Brian Foster <bfoster@redhat.com>
-rw-r--r--fs/bcachefs/btree_update_leaf.c22
1 files changed, 16 insertions, 6 deletions
diff --git a/fs/bcachefs/btree_update_leaf.c b/fs/bcachefs/btree_update_leaf.c
index b8299914a0ab..be4c5df42be8 100644
--- a/fs/bcachefs/btree_update_leaf.c
+++ b/fs/bcachefs/btree_update_leaf.c
@@ -1501,21 +1501,31 @@ static noinline int flush_new_cached_update(struct btree_trans *trans,
unsigned long ip)
{
struct btree_path *btree_path;
+ struct bkey k;
int ret;
- i->key_cache_already_flushed = true;
- i->flags |= BTREE_TRIGGER_NORUN;
-
btree_path = bch2_path_get(trans, path->btree_id, path->pos, 1, 0,
BTREE_ITER_INTENT, _THIS_IP_);
-
ret = bch2_btree_path_traverse(trans, btree_path, 0);
if (ret)
- goto err;
+ goto out;
+
+ /*
+ * The old key in the insert entry might actually refer to an existing
+ * key in the btree that has been deleted from cache and not yet
+ * flushed. Check for this and skip the flush so we don't run triggers
+ * against a stale key.
+ */
+ bch2_btree_path_peek_slot_exact(btree_path, &k);
+ if (!bkey_deleted(&k))
+ goto out;
+
+ i->key_cache_already_flushed = true;
+ i->flags |= BTREE_TRIGGER_NORUN;
btree_path_set_should_be_locked(btree_path);
ret = bch2_trans_update_by_path_trace(trans, btree_path, i->k, flags, ip);
-err:
+out:
bch2_path_put(trans, btree_path, true);
return ret;
}