TODO: Pretty printing: Sun May 8 11:34:15 PM EDT 2022 X convert lib/vsprintf.c to printbuf this gets us a much saner calling convention and some new helpers for all our existing pretty printers X refactor existing pretty printers to not use separate buffers on the stack; with new helpers it's not needed X add %pf(%p) syntax for calling pretty printers X re-add other printbuf formatting (indenting, tab stops) X re-add optional heap allocation - pretty printers need to not be looking at the format string this should all be replaced with normal function arguments OOM debugging Fri Apr 22 07:36:08 PM EDT 2022 - OOM report improvements: this depends on pretty-printer improvements X pretty printer improvements: printbuf v3 - mail out printbuf v4 - shrinker_to_text: we want a shrinker_to_text() function and a shrinker.to_text method for getting information on shrinker internals. shrinker_to_text() (vmscan.c) will call shrinker.to_text() (private method) to get e.g. lru_list.c internals shrinker_to_text() should be used by _both_ the oom report, and the sysfs/debugfs interface that Roman's working on - add tracking so we know how many slab objets are allocated vs. how many are on lru - dump oom report to trace buffer, pstore? - Top level division: kernel memory vs. user memory - Other memory reclaim debugging: X idea from LSF talk: let's add a watchdog - update a timestamp every time vmscan.c makes some progress on some (shrinker or thing that can be freed from); when memory reclaim gets stuck and we OOM this way we can already say _who_ got stuck - another thing from the LSF talk: we should be including information on memory fragmentation in the OOM report - also: we should be tracking memory utilization by page type (from the page type heirarchy that Willy has been defining) and including this in the show_mem report - this gets us slab vs. lru memory, which hocko wanted - top level division: kernel vs. user memory - another wishlist item of mine: I want to consolidate between the show_mem() report and the actual OOM report as much as possible & reasonably - they're pretty similar conditions! - add top user tasks to show_mem() report - ratelimit show_mem report on allocation failure (not with standard printk_ratelimited!) - prune down OOM report, it's currently quite expensive (maybe not print all user tasks? need to look at what it's currently doing) IOCTL v2: - Registry by name: drivers allocate ioctl numbers at runtime, and userspace programs at startup query to get the ioctl number for a given ioctl name This gets us real namespacing, similarly to how opengl extensions work - Make ioctl arguments work like normal syscall arguments: probably internally passed as an array of u64s - Arguments must be integers, or things that can be casted to integers - If signed integer, sign extended to s64: this to avoid compat issues when kernel/userspace disagree on sizeof(long) kernel workqueues improvement Mon Apr 25 01:49:04 AM EDT 2022 Should rename themselves to name of the work function they're running Testing improvements Tue May 3 02:54:58 PM EDT 2022 - Make ktest better at bailing out and returning an error when a test dies, i.e. in a kernel oops - Create a virt block device driver that injects random delays with different statistical variations - Use dm-writelog to _reorder_ writes - RCU testing: from Paul > We discussed the possibility of providing sub-second RCU CPU stall warning > timeouts, and it turns out that current course and speed should get us > there in v5.19. > > The trick is that v5.19 will likely add a millisecond-scale > CONFIG_RCU_EXP_CPU_STALL_TIMEOUT Kconfig option that affect only expedited > grace periods. However, you can also specify the rcupdate.rcu_expedited=1 > kernel boot parameter, which causes calls to synchronize_rcu() to act > as if they were calls to synchronize_rcu_expedited(). This causes your > kernel to execute ample quantities of expedited grace periods, which > in turn allows you to detect RCU read-side critical sections exceeding > (say) 100 milliseconds. > > This patch is in -rcu, along with another that avoids false positives > due to the fact that expedited grace periods are currently handled > by non-realtime workqueues. This other patch that provides a > CONFIG_RCU_EXP_KTHREAD Kconfig option that uses a real-time-priority > variant of workqueues. Superblock being modified by another process - Add a cookie field to the superblock, generate a new one on every mount - that way we'll know definitively if it's another process screwing with us Rebalance: Wed Jun 22 06:09:44 PM EDT 2022 - We need to track the destination target on a per bucket basis, so that copygc/rebalance can do sensible things when one device is full and we need to destage data on it - Need to add background target to the writepoint, and make sure all writes to a given writepoint have the same background target Scrub Mon Jun 13 07:54:15 PM EDT 2022 - On IO error (not CRC error?), the device might have been unplugged/gone away: we need a way of checking if the device is still alive (if other IOs are completing) before declaring a replica dead - Now that we've got backpointers, we have the option of doing the scrub in keyspace order, or lba order - and for rotating disks lba order would be preferred. So we need to be able to tell the scrub path "check these replicas", like we've been teaching the data update path "move these replicas". - How is scrub for erasure coding going to work? Perhaps we should leave scrub of erasure coded pointers up to the EC code Rebalance is causing fragmentation with background_compression - migrate_index_update isn't checking the background_compression option, it's not keeping the right pointers - Need to redo the data update path so we're telling it explicitly which pointers in the original extent we don't want Zoned device support (ZNS SSDs!) Zones just map to buckets, existing copygc does the work of the FTL - initial code is done X getting zone map X devices with variable sized zones X superblock in a ring buffer across two buckets X hooks for journal code - normal data IO is easy: open buckets map to appendable zones - btree is harder: we need to add fragmented extents, and move btree allocation to when we append to a btree node - but the hard part is done, since we're updating pointers in the parent node after every write Journal: Mon Apr 25 02:47:29 AM EDT 2022 - fsync lock contention on journal lock dbench -t 30 -D /mnt/scratch 256 compare to ext4, ext4 seems to be fastest - bcachefs fs usage needs to print information about where the journal is: X gb allocated, X gb currently in use on each device - We need an easy way to specify at format time which devices should be used for journal, currently mixed SSD/HDD filesystems mount slowly because of journal read on the HDDs - also, old filesystems allocated the journal in reverse order - is it worth writing a tool to defragment the journal? we shouldn't be doing this in new filesystems - the journal resize command can't yet decrease the amount of journal used on a device - can it? should be an easy fix Disk space accounting - Needs to be reworked to move accounting from the journal to a new btree - Need to add compressed disk space accounting/compression ratio - Want average extent size Persistent counters: Mon Apr 25 02:29:16 AM EDT 2022 - break out io stats (amount of io done by type) more - data writes should be broken out into foreground, copygc, rebalance, etc. - per-device iodone stats looked wrong at one point, check them manually Compression: Mon Apr 25 01:35:16 AM EDT 2022 - Multithreaded compression: - We bottleneck on single threaded compression currently in the write path: To fix this, we need to - move compression out from under the write point lock we will occasionally have to recompress if we discover the compressed extent is bigger than the amount of space left in the write point, but not usually - decide where to punt to workqueue in the write path - go over the implications of punting writes to workqueue; currently, we avoid punting to workqueue in the write path whenever possible, because writes need to block to signal backpressure from lower down in the stack. however, sometimes (AIO) we do _not_ want writes to block if it can be avoided - something to think about - Dictionaries: - Add a superblock section for compression dictonary (plural?) - Actually no: compression dictonaries are too big for a single btree key - they should be files, and referred to by inode number - can't ever get rid of a compression dictonary that's been used! - compression_dictionary option, per filesystem & per file - Option improvements: Mon Apr 25 01:49:08 AM EDT 2022 - Currently, the way options are stored in the superblock is a bit of a mess, they're scattered aroud the flags portion of the main (fixed length) superblock section - Options should be moved to their own variable length superblock section, where they are referred to by an enum from the master option x-macro list - Currently, an option - after parsing - is always just a single integer, but some options might want more structure - e.g. compression, where it would be nice to also be able to specify the compression level - why not just use high bits of the compression option for the compression level? Disk groups Mon Apr 25 12:00:18 AM EDT 2022 - The disk groups we have right now allow a disk to be in just one group, which is good for giving us a heirarchy for all devices in a filesystem. But probably want another type of group - "symlink group" - that isn't used for recursive enumeration, but lets disks be in more than one group Inherited attributes Mon Apr 25 12:02:47 AM EDT 2022 - fsck doesn't check that inode attributes are inherited correctly - are inherited attributes fully transactional now? They should be - mv returns -EXDEV if inherited attributes have to change, which tells coreutils to fall back to file-by-file move - but we've never checked that all is good here, and we have a report (via Daniel) that it's not - No they're not: setting attributes on a directory is not transactional, cmd_attr.c recursively calls the reinherit attrs ioctl on children - This needs to be documented ACL inheritence Mon Apr 25 01:07:44 AM EDT 2022 - look into how this works and if it works correctly on bcachefs Fri Apr 22 06:54:29 PM EDT 2022 - in nochanges mode, the btree node cannibalize path can get stuck: we need to make it check for all nodes being dirty and have it bail out Userspace kernel shim Sat May 7 10:50:02 PM EDT 2022 - our shrinker implementation is really stupid: it looks at /proc/meminfo and runs shrinkers if the ratio of memavailable vs memtotal is too low Thu Apr 21 10:33:43 PM EDT 2022 - journal_key_search causing mount to use 100% cpu at points btree_iter needs to remember our position in journal_keys Tue Apr 19 03:16:00 AM EDT 2022 - device remove tests: occasionally get invalid journal entries with accounting for device being removed - backpointers X decide what to do about cached data. if cached pointers get backpointers, btree backpointers need the bucket gen because we can't atomically delete all btree backpointers X backpointers to btree nodes X backpointers for compressed data X check & repair code - error injection tests - Fsck - fsck passes should be enumerated, with an x-macro - add commandline option to specify which pass(es) to run - enumerate inconsistentent/fsck errors - record error counts in superblock, error counts when last fsck was completed, timestamp of last error - by default only run fsck passes for errors we've seen - memoize fsck errors so transaction restarts don't print them twice - Add code to fsck to check for extents that have both encrypted and unencrypted replicas - fsck should be doing some checking of sb_layouts - fsck isn't verifying that pointers match the stripe they point to (trans_mark_stripe_ptr is, mark_stripe_ptr isn't) - fsck doesn't check that inode attributes are inherited correctly - lift out progress bar indicator code from data job command, also use it in fsck code to give each pass a progress bar Hash tables Sat May 7 10:55:03 PM EDT 2022 - we're still using jhash in places - jhash is terrible, we should replace it with siphash - inodes: Mon Apr 25 02:53:40 AM EDT 2022 - add another inode field that's like i_sectors, but takes into account replication & compression - We should've used zigzag encoding for inode timestamp fields - do we have to rev the inode format again? - we always return stat.st_size==0 for directories; breaks some gentoo build scripts - should we/can we match what other filesystems are doing? - test more thoroughly with journal_write_delay=0 - extend btree root repair to integrate with topology repair and handle any missing interior btree node - Still needs to be merged! - Topology repair could probably be refactored X add accounting for nr of buckets that need gc before they can be reused - cache coherency with the btree key cache Mon Apr 25 02:37:51 AM EDT 2022 - this is now almost complete: we just need to ensure that on insertion of a new key (not an overwrite of an existing key), the flush to the btree happens in the same transaction (so that scanning the btree knows that a key exists and can check the key cache). - once this happens, we can re-enable the btree key cache for inodes - Data jobs Mon Apr 25 03:13:12 AM EDT 2022 - Rebalance - now that backpointers is almost done, this is the last major CPU hog - implement a rebalance_work btree that uses KEY_TYPE_set, where a set key indicates an extent at the same pos needs something done by rebalance (moved to background device, compressed) - rebalance should limit itself when copygc can't keep up - unconditionally start copygc & rebalance when going rw, kill the rw_late thing - Add a way to scan and list/make visible extents that don't match data options - Data job visibility - we have some of this, but it needs to be better documented - add current rates of keys seen, moved, sectors seen, moved to sysfs - data_progress, rolling average over the past few seconds - we have data job _progress_ in sysfs, but we also want _rate_ - progress bars for fsck & mount with fsck enabled would be really nice - Scrub still needs to be implemented - Add a scrub option, pass it to data job path - Self healing: on read error and successful retry, rewrite failed replica (initiate scrub operation on that extent?) - we're not checking the return value of bch2_inconsistent_error() - it returns whether we're continuing or halting; some uses should be switched bch2_fatal_error() - jset_entry_dev_usage & related shouldn't be packed (or they should also be aligned(8), but this will change sizeof which we currently use - when a drive is yanked, keep track of uncommitted writes (writes since last journal flush) - those will be silently degraded (corrupted) - filesystem status: - need an easy way to tell if fs is degraded - how many devices required to mount - we have this internally, we just need to expose it better - snapshots - add option for mounting a specific subvol - fsck needs to check that there is always a single subvol dirent for each subvol - ro snapshots - need to mark page cache as unreserved on snapshot creation - snapshot creation is not atomic w.r.t. page cache - it would be really nice for page cache to be shared between snapshots (upstream is working on this for reflink, finally!) - different st_dev for each subvolume - disk space accounting - write more tests - add mounting by subvolume - nfs - figure out where nfs gets fsid from, seeing weird hangs with nfs4 - Superblock - sb_layout - write a copy 4k from the end of the device - format should be doing this now, but we should have tooling for examining/changing superblock layout - fsck should be doing some checking of sb_layouts - encryption: superblock is not using nonces correctly, which is why it doesn't get a MAC when encryption is on - Superblock writes aren't marked flush/fua - need to analyze this for correctness, particularly for superblock writes that are done while we're RW and fs is dirty - We need to detect when we're blocking on an allocation that will never complete - i.e. because disks are mismatched in size, so we can fail the allocation and also document clearly what is going on - Core btree code Mon Apr 25 02:32:36 AM EDT 2022 - generic/176 - lots of transaction restarts - with transactional interior btree updates, b->will_make_reachable can probably be simplified - can we change bch2_btree_update_start() to not use uninterruptible sleep? common place to get stuck in D state - We need to improve running in low memory conditions: since we unlock to read in btree nodes, nothing is preventing nodes we need in traverse_all() from being reclaimed while we read in the next node - Replication - Add a notification mechanism for drive failure - Add an option to automatically rereplicate when a drive dies, if we have sufficient space - replication.ktest recovery: with a 32k bucket size, we go into this infinite loop where the do_discards worker is continually discarding buckets and updating the needs_discard and freespace btrees and then those btree updates require new allocations + free old nodes which means the do_discards worker has more work to do - on device removal, we got wedged with all open_buckets allocated, mostly for btree: it appears we need a new watermark for btree_update_nodes_written/btree_node_update_key - Tiering - copygc/rebalance/filesystem capacity doesn't behave reasonably on tiered filesystems: because filesystem capacity is sum(all fs devices), rebalance will try to shove more data on backing devices than will fit, and then copygc spins because it's trying to free up space on those completely full devices we need to think more about how we calculate/expose filesystem capacity to reproduce: stress.ktest - Erasure coding - when creating a new stripe reusing an existing stripe, we don't need to read all of existing stripe unless we get an error reading the blocks we want to reuse - fsck isn't verifying that pointers match the stripe they point to (trans_mark_stripe_ptr is, mark_stripe_ptr isn't) - once backpointers is merged, erasure coding needs to be update to use it instead of its own in-memory backpointers - and check for pointers that weren't updated with the stripe pointer after an unclean shutdown - currently updating extents to add the stripe ptr can race with reflink and fail - Performance - 4 context switches per dio write - what? - eliminate indirect function call to bch2_write_index_default - bch2_rbio_narrow_crcs() path needs to change to deliver read completion first, then punt to a different workqueue - io_in_flight semaphore - is it time to nuke that? block layer has writeback throttling now, we should at least make it configurable - see if we can do something about delet times - Performance visibility - need improved lock contention stats - add time_stats for btree lock held time, broken out by transaction->fn & call site - support idmapped mounts: crib off of xfs commit f736d93d76d3e97d6986c6d26c8eaa32536ccc5c - done? - fix all sparse warnings - make C=2 fs/bcachefs/ throws errors - Make sure tests fail on data checksum error - we're getting more btree node cache eviction than we should be, because a single numa node can be oom: might be worth moving nodes to different numa nodes. also - shrinker needs to be numa aware