summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-03-17foobcachefs-btree-write-buffer-via-journalKent Overstreet
2023-03-17fooKent Overstreet
2023-03-17bcachefs: btree write buffer: Do write buffer updates via journalKent Overstreet
Instead of appending to the write buffer in the transaction commit path, remember that we also have everything we need in the journal: This adds a new journal entry type, BCH_JSET_ENTRY_buffered_keys, for keys that needed to be added to the write buffer. Before doing a journal write, in our compaction pass, we find those journal entries and add them to the write buffer, and write them out with the normal BCH_JSET_ENTRY_btree_keys type. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-17bcachefs: Improved transaction restart debugging XXXKent Overstreet
In debug mode, on transaction restart we now save the full backtrace. We're chasing a bug where we lose an a transaction restart error - this will make it easy to track down. ------------[ cut here ]------------ Feb 17 15:57:47 extravaganza.localdomain 4,2210,1695278469,-;do not call blocking ops when !TASK_RUNNING; state=2 set at [<00000000f728d589>] __six_lock_type_slowpath.constprop.0+0x407/0x7a0 Feb 17 15:57:47 extravaganza.localdomain 4,2211,1695278530,-;WARNING: CPU: 4 PID: 76 at kernel/sched/core.c:9862 __might_sleep+0xd9/0xe0 Feb 17 15:57:47 extravaganza.localdomain 4,2212,1695278578,-;Modules linked in: netconsole nfnetlink snd_seq_dummy snd_hrtimer binfmt_misc zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) intel_rapl_msr intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp snd_hda_codec_realtek kvm_intel snd_hda_codec_generic mei_pxp snd_hda_codec_hdmi mei_hdcp snd_hda_intel kvm snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_seq_midi eeepc_wmi rapl snd_hda_core snd_seq_midi_event wmi_bmof intel_cstate joydev serio_raw snd_hwdep snd_rawmidi snd_pcm ee1004 input_leds snd_seq snd_seq_device snd_timer snd mei_me soundcore mei mac_hid acpi_pad msr parport_pc ppdev lp parport pstore_blk ramoops reed_solomon pstore_zone efi_pstore ip_tables x_tables autofs4 overlay isofs nls_iso8859_1 dm_mirror dm_region_hash dm_log amdgpu iommu_v2 gpu_sched drm_buddy hid_logitech_hidpp hid_logitech_dj hid_generic uas usbhid hid usb_storage nouveau radeon i2c_algo_bit drm_ttm_helper ttm Feb 17 15:57:47 extravaganza.localdomain 4,2213,1695278892,c; drm_display_helper crct10dif_pclmul crc32_pclmul cec polyval_clmulni rc_core polyval_generic ghash_clmulni_intel drm_kms_helper syscopyarea nvme sysfillrect sha512_ssse3 sysimgblt aesni_intel fb_sys_fops crypto_simd mfd_aaeon cryptd i2c_i801 nvme_core psmouse drm e1000e asus_wmi xhci_pci ledtrig_audio i2c_smbus sparse_keymap nvme_common ahci xhci_pci_renesas libahci platform_profile mxm_wmi video wmi Feb 17 15:57:47 extravaganza.localdomain 4,2214,1695279104,-;CPU: 4 PID: 76 Comm: kworker/4:1 Tainted: P W O 6.1.12+bcachefs.git20230214.375685a54-1-debug #1 Feb 17 15:57:47 extravaganza.localdomain 4,2215,1695279151,-;Hardware name: System manufacturer System Product Name/Z170 PRO GAMING, BIOS 1904 07/05/2016 Feb 17 15:57:47 extravaganza.localdomain 4,2216,1695279192,-;Workqueue: bcachefs_btree_io btree_node_write_work Feb 17 15:57:47 extravaganza.localdomain 4,2217,1695279243,-;RIP: 0010:__might_sleep+0xd9/0xe0 Feb 17 15:57:47 extravaganza.localdomain 4,2218,1695279279,-;Code: a0 14 00 00 4c 89 ff 48 89 4d d0 e8 61 b9 43 00 49 8b 95 a0 14 00 00 48 8b 4d d0 44 89 f6 48 c7 c7 a0 65 6a a7 e8 a2 bf 6e 01 <0f> 0b eb 88 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 Feb 17 15:57:47 extravaganza.localdomain 4,2219,1695279331,-;RSP: 0018:ffffc900005b6f40 EFLAGS: 00010246 Feb 17 15:57:47 extravaganza.localdomain 4,2220,1695279377,-;RAX: 0000000000000000 RBX: ffffffffa7f2a392 RCX: 0000000000000000 Feb 17 15:57:47 extravaganza.localdomain 4,2221,1695279418,-;RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 Feb 17 15:57:47 extravaganza.localdomain 4,2222,1695279464,-;RBP: ffffc900005b6f70 R08: 0000000000000000 R09: 0000000000000000 Feb 17 15:57:47 extravaganza.localdomain 4,2223,1695279497,-;R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000112 Feb 17 15:57:47 extravaganza.localdomain 4,2224,1695279538,-;R13: ffff888102c30000 R14: 0000000000000002 R15: ffff888102c314a0 Feb 17 15:57:47 extravaganza.localdomain 4,2225,1695279584,-;FS: 0000000000000000(0000) GS:ffff8887d4e00000(0000) knlGS:0000000000000000 Feb 17 15:57:47 extravaganza.localdomain 4,2226,1695279627,-;CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 17 15:57:47 extravaganza.localdomain 4,2227,1695279668,-;CR2: 00005572cc5e70d8 CR3: 0000000114978005 CR4: 00000000003706e0 Feb 17 15:57:47 extravaganza.localdomain 4,2228,1695279708,-;DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 17 15:57:47 extravaganza.localdomain 4,2229,1695279753,-;DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 17 15:57:47 extravaganza.localdomain 4,2230,1695279801,-;Call Trace: Feb 17 15:57:47 extravaganza.localdomain 4,2231,1695279845,-; <TASK> Feb 17 15:57:47 extravaganza.localdomain 4,2232,1695279888,-; ? __six_lock_type_slowpath.constprop.0+0x407/0x7a0 Feb 17 15:57:47 extravaganza.localdomain 4,2233,1695279928,-; ? bch2_save_backtrace+0x5b/0x210 Feb 17 15:57:47 extravaganza.localdomain 4,2234,1695279980,-; __kmem_cache_alloc_node+0x290/0x2f0 Feb 17 15:57:47 extravaganza.localdomain 4,2235,1695280030,-; ? load_balance+0x904/0xc20 Feb 17 15:57:47 extravaganza.localdomain 4,2236,1695280074,-; ? bch2_save_backtrace+0x5b/0x210 Feb 17 15:57:47 extravaganza.localdomain 4,2237,1695280122,-; ? bch2_save_backtrace+0x5b/0x210 Feb 17 15:57:47 extravaganza.localdomain 4,2238,1695280172,-; __kmalloc_node_track_caller+0x51/0xf0 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-17bcachefs: btree write buffer: inline sort functionKent Overstreet
lib/sort.c is expensive here, due to the indirect function calls for the comparison function: this adds an inlined version Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-16bcachefs: Fix try_decrease_writepoints()Kent Overstreet
- We may need to drop btree locks before taking the writepoint_lock, as is done in other places. - We should be using open_bucket_free_unused(), so that we don't waste space. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-16bcachefs: Add an assert in inode_write for -ENOENTKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-16bcachefs: Fix bch2_evict_subvolume_inodes()Kent Overstreet
This fixes a bug in bch2_evict_subvolume_inodes(): d_mark_dontcache() doesn't handle the case where i_count is already 0, we need to grab and put the inode in order for it to be dropped. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-16bcachefs: Improve error handling in bch2_ioctl_subvolume_destroy()Kent Overstreet
Pure style fixes Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-16bcachefs: Fix for 'missing subvolume' errorKent Overstreet
Subvolumes, including their root inodes, get deleted asynchronously after an unlink. But we still need to ensure that we tell the VFS the inode has been deleted, otherwise VFS writeback could fire after asynchronous deletion has finished, and try to write to an inode/subvolume that no longer exists. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-16bcachefs: Don't run transaction hooks multiple timesKent Overstreet
transaction hooks aren't supposed to run unless we know the transaction is going to commit succesfully: this fixes a bug with attempting to delete a subvolume multiple times. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-15bcachefs: Add a fallback when journal_keys doesn't fit in ramKent Overstreet
We may end up in a situation where allocating the buffer for the sorted journal_keys fails - but it would likely succeed, post compaction where we drop duplicates. We've had reports of this allocation failing, so this adds a slowpath to do the compaction incrementally. This is only a band-aid fix; we need to look at limiting the number of keys in the journal based on the amount of system RAM. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-15fixup! bcachefs: All held locks must be in a btree pathKent Overstreet
2023-03-14bcachefs: Improve the backpointer to missing extent messageKent Overstreet
We now print the pos where the backpointer was found in the btree, as well as the exact bucket:bucket_offset of the data, to aid in grepping through logs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: Add error message for failing to allocate sorted journal keysKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: New erasure coding shutdown pathKent Overstreet
This implements a new shutdown path for erasure coding, which is needed for the upcoming BCH_WRITE_WAIT_FOR_EC write path. The process is: - Cancel new stripes being built up - Close out/cancel open buckets on write points or the partial list that are for stripes - Shutdown rebalance/copygc - Then wait for in flight new stripes to finish With BCH_WRITE_WAIT_FOR_EC, move ops will be waiting on stripes to fill up before they complete; the new ec shutdown path is needed for shutting down copygc/rebalance without deadlocking. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: bch2_fs_moving_ctxts_to_text()Kent Overstreet
This also adds bch2_write_op_to_text(): now we can see outstand moves, useful for debugging shutdown with the upcoming BCH_WRITE_WAIT_FOR_EC and likely for other things in the future. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: Private error codes: ENOMEMKent Overstreet
This adds private error codes for most (but not all) of our ENOMEM uses, which makes it easier to track down assorted allocation failures. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: Fix bch2_check_extents_to_backpointers()Kent Overstreet
In rare cases, bch2_check_extents_to_backpointers() would incorrectly flag an extent has having a missing backpointer when we just needed to flush the btree write buffer - we weren't tracking the last flushed position correctly. This adds a level field to the last_flushed pos, fixing a bug where we'd sometimes fail on a new root node. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: Fix an assert in copygc thread shutdown pathKent Overstreet
We're not supposed to have nested (locked) btree_trans on the stack: this means copygc shutdown needs to exit our btree_trans before exiting the move_ctxt, which calls bch2_write(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: bch2_bucket_is_movable() -> BTREE_ITER_CACHEDKent Overstreet
BTREE_ITER_CACHED should really be the default for cached btrees - this is an easy mistake to make. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: Don't use BTREE_ITER_INTENT in make_extent_indirect()Kent Overstreet
This is a workaround for a btree path overflow - searching with BTREE_ITER_INTENT periodically saves the iterator position for updates, which eventually overflows. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14fixup! mm: enable page allocation taggingKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14fixup! lib: add allocation tagging support for memory allocation profilingKent Overstreet
2023-03-14fixup! lib: code tagging module supportKent Overstreet
2023-03-14fixup! mm: enable page allocation taggingKent Overstreet
2023-03-14TESTING: set required configurations and request some context capturesSuren Baghdasaryan
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2023-03-14bcachefs: Fix stripe create error pathKent Overstreet
If we errored out on a new stripe before fully allocating it, we shouldn't be zeroing out unwritten data. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: Mark new snapshots earlier in create pathKent Overstreet
This fixes a null ptr deref when creating new snapshots: bch2_create_trans() will lookup the subvolume and find the _new_ snapshot in the BCH_CREATE_SUBVOL path that's being created in that transaction. We have to call bch2_mark_snapshot() earlier so that it's properly initialized, instead of leaving it for transaction commit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: Improve bch2_new_stripes_to_text()Kent Overstreet
Print out the alloc reserve, and format it a bit more nicely. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: Kill bch_write_op->btree_update_readyKent Overstreet
This changes the write path to not add write ops to to the write_point's list of pending work items until it's ready; this means we have to change the lock protecting it to an irq-safe lock, but means bch2_write_point_do_index_updates() no longer has to iterate over the list, which is beneficial with the way the new BCH_WRITE_WAIT_FOR_EC code works. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: Simplify stripe_idx_to_deleteKent Overstreet
This is not technically correct - it's subject to a race if we ever end up with a stripe with all empty blocks (that needs to be deleted) being held open. But the "correct" version was much too inefficient, and soon we'll be adding a stripes LRU. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: Fix next_bucket()Kent Overstreet
This fixes an infinite loop in bch2_get_key_or_real_bucket_hole(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: Second layer of refcounting for new stripesKent Overstreet
This will be used for move writes, which will be waiting until the stripe is created to do the index update. They need to prevent the stripe from being reclaimed until their index update is done, so we need another refcount that just keeps the stripe open. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> # Conflicts: # fs/bcachefs/ec.c # fs/bcachefs/io.c
2023-03-14bcachefs: ec: fall back to creating new stripes for copygcKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: Rework __bch2_data_update_index_update()Kent Overstreet
This makes some improvements to the logic for adding/removing replicas, as part of the larger erasure coding improvements. We now directly consider number of replicas desired for the given inode, and extent/pointer durability: this ensures that the extent ends up with the desired number of replicas when we're replacing multiple pointers with one that has higher durability (e.g. erasure coded). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-14bcachefs: Extent helper improvementsKent Overstreet
- __bch2_bkey_drop_ptr() -> bch2_bkey_drop_ptr_noerror(), now available outside extents. - Split bch2_bkey_has_device() and bch2_bkey_has_device_c(), const and non const versions - bch2_extent_has_ptr() now returns the pointer it found Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-13bcachefs: evacuate_bucket() no longer moves cached ptrsKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-13bcachefs: evacuate_bucket() no longer calls verify_bucket_evacuated()Kent Overstreet
The copygc code itself now calls this when all moves from a given bucket are complete. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-13bcachefs: Suppress transaction restart err messageKent Overstreet
This isn't a real error, and doesn't need to be printed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-13bcachefs: Rework open bucket partial list allocationKent Overstreet
Now, any open_bucket can go on the partial list: allocating from the partial list has been moved to its own dedicated function, open_bucket_add_bucets() -> bucket_alloc_set_partial(). In particular, this means that erasure coded buckets can safely go on the partial list; the new location works with the "allocate an ec bucket first, then the rest" logic. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-13bcachefs: don't bump key cache journal seq on nojournal commitsBrian Foster
fstest generic/388 occasionally reproduces corruptions where an inode has extents beyond i_size. This is a deliberate crash and recovery test, and the post crash+recovery characteristics are usually the same: the inode exists on disk in an early (i.e. just allocated) state based on the journal sequence number associated with the inode. Subsequent inode updates exist in the journal at higher sequence numbers, but the inode hadn't been written back before the associated crash and the post-crash recovery processes a set of journal sequence numbers that doesn't include updates to the inode. In fact, the sequence with the most recent inode key update always happens to be the sequence just before the front of the journal processed by recovery. This last bit is a significant hint that the problem relates to an on-disk journal update of the front of the journal. The root cause of this problem is basically that the inode is updated (multiple times) in-core and in the key cache, each time bumping the key cache sequence number used to control the cache flush. The cache flush skips one or more times, bumping the associated key cache journal pin to the key cache seq value. This has a side effect of holding the inode in memory a bit longer than normal, which helps exacerbate this problem, but is also unsafe in certain cases where the key cache seq may have been updated by a transaction commit that didn't journal the associated key. For example, consider an inode that has been allocated, updated several times in the key cache, journaled, but not yet written back. At this stage, everything should be consistent if the fs happens to crash because the latest update has been journal. Now consider a key update via bch2_extent_update_i_size_sectors() that uses the BTREE_UPDATE_NOJOURNAL flag. While this update may not change inode state, it can have the side effect of bumping ck->seq in bch2_btree_insert_key_cached(). In turn, if a subsequent key cache flush skips due to seq not matching the former, the ck->journal pin is updated to ck->seq even though the most recent key update was not journaled. If this pin happens to reside at the front (tail) of the journal, this means a subsequent journal write can update last_seq to a value beyond that which includes the most recent update to the inode. If this occurs and the fs happens to crash before the inode happens to flush, recovery will see the latest last_seq, fail to recover the inode and leave the inode in the inconsistent state described above. To avoid this problem, skip the key cache seq update on NOJOURNAL commits, except on initial pin add. Pass the insert entry directly to bch2_btree_insert_key_cached() to make the associated flag available and be consistent with btree_insert_key_leaf(). Signed-off-by: Brian Foster <bfoster@redhat.com>
2023-03-13bcachefs: When shutting down, flush btree node writes lastKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-13bcachefs: Verbose on by default when CONFIG_BCACHEFS_DEBUG=yKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-13fixup bcachefs: Use for_each_btree_key_upto() more consistentlyKent Overstreet
2023-03-13six locks: be more careful about lost wakeupsKent Overstreet
This is a workaround for a lost wakeup bug we've been seeing - we still need to discover the actual bug. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-13bcachefs: Journal resize fixesKent Overstreet
- Fix a sleeping-in-atomic bug due to calling bch2_journal_buckets_to_sb() under the journal lock. - Additionally, now we mark buckets as journal buckets before adding them to the journal in memory and the superblock. This ensures that if we crash part way through we'll never be writing to journal buckets that aren't marked correctly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-13bcachefs: bch2_btree_iter_peek_node_and_restart()Kent Overstreet
Minor refactoring for the Rust interface. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-13bcachefs: bch2_btree_node_ondisk_to_text()Kent Overstreet
Pulling out a helper from cmd_list.c, as the rest is being rewritten in Rust but we're not ready to rewrite lower-level btree code in Rust. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-03-13bcachefs: bch2_btree_node_to_text() const correctnessKent Overstreet
This is for the Rust interface - Rust cares more about const than C does. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>