TODO:

Pretty printing:					Sun May  8 11:34:15 PM EDT 2022
 X convert lib/vsprintf.c to printbuf
   this gets us a much saner calling convention and some new helpers for all our
   existing pretty printers
 X refactor existing pretty printers to not use separate buffers on the stack;
   with new helpers it's not needed
 X add %pf(%p) syntax for calling pretty printers
 X re-add other printbuf formatting (indenting, tab stops)
 X re-add optional heap allocation
 - pretty printers need to not be looking at the format string
   this should all be replaced with normal function arguments

OOM debugging						Fri Apr 22 07:36:08 PM EDT 2022
 - OOM report improvements: this depends on pretty-printer improvements
  X pretty printer improvements: printbuf v3
  - mail out printbuf v4

  - shrinker_to_text: we want a shrinker_to_text() function and a
    shrinker.to_text method for getting information on shrinker internals.
    shrinker_to_text() (vmscan.c) will call shrinker.to_text() (private method)
    to get e.g. lru_list.c internals

    shrinker_to_text() should be used by _both_ the oom report, and the
    sysfs/debugfs interface that Roman's working on
    - add tracking so we know how many slab objets are allocated vs. how many
      are on lru 

 - dump oom report to trace buffer, pstore?

 - Top level division: kernel memory vs. user memory

 - Other memory reclaim debugging:
   X idea from LSF talk: let's add a watchdog - update a timestamp every time
     vmscan.c makes some progress on some (shrinker or thing that can be freed
     from); when memory reclaim gets stuck and we OOM this way we can already
     say _who_ got stuck

   - another thing from the LSF talk: we should be including information on
     memory fragmentation in the OOM report

   - also: we should be tracking memory utilization by page type (from the page
     type heirarchy that Willy has been defining) and including this in the
     show_mem report
     - this gets us slab vs. lru memory, which hocko wanted
     - top level division: kernel vs. user memory

   - another wishlist item of mine: I want to consolidate between the show_mem()
     report and the actual OOM report as much as possible & reasonably - they're
     pretty similar conditions!
     - add top user tasks to show_mem() report
     - ratelimit show_mem report on allocation failure (not with standard
       printk_ratelimited!)
     - prune down OOM report, it's currently quite expensive (maybe not print
       all user tasks? need to look at what it's currently doing)

IOCTL v2:
 - Registry by name: drivers allocate ioctl numbers at runtime, and userspace
   programs at startup query to get the ioctl number for a given ioctl name
   This gets us real namespacing, similarly to how opengl extensions work

 - Make ioctl arguments work like normal syscall arguments: probably internally
   passed as an array of u64s
   - Arguments must be integers, or things that can be casted to integers
   - If signed integer, sign extended to s64: this to avoid compat issues when
     kernel/userspace disagree on sizeof(long)

kernel workqueues improvement				Mon Apr 25 01:49:04 AM EDT 2022
   Should rename themselves to name of the work function they're running

Testing improvements					Tue May  3 02:54:58 PM EDT 2022
 - Make ktest better at bailing out and returning an error when a test dies,
   i.e. in a kernel oops
 - Create a virt block device driver that injects random delays with different
   statistical variations
 - Use dm-writelog to _reorder_ writes
 - RCU testing: from Paul
   > We discussed the possibility of providing sub-second RCU CPU stall warning
   > timeouts, and it turns out that current course and speed should get us
   > there in v5.19.
   > 
   > The trick is that v5.19 will likely add a millisecond-scale
   > CONFIG_RCU_EXP_CPU_STALL_TIMEOUT Kconfig option that affect only expedited
   > grace periods.  However, you can also specify the rcupdate.rcu_expedited=1
   > kernel boot parameter, which causes calls to synchronize_rcu() to act
   > as if they were calls to synchronize_rcu_expedited().  This causes your
   > kernel to execute ample quantities of expedited grace periods, which
   > in turn allows you to detect RCU read-side critical sections exceeding
   > (say) 100 milliseconds.
   > 
   > This patch is in -rcu, along with another that avoids false positives
   > due to the fact that expedited grace periods are currently handled
   > by non-realtime workqueues.  This other patch that provides a
   > CONFIG_RCU_EXP_KTHREAD Kconfig option that uses a real-time-priority
   > variant of workqueues.

Superblock being modified by another process
 - Add a cookie field to the superblock, generate a new one on every mount -
   that way we'll know definitively if it's another process screwing with us

Rebalance:						Wed Jun 22 06:09:44 PM EDT 2022
 - We need to track the destination target on a per bucket basis, so that
   copygc/rebalance can do sensible things when one device is full and we need
   to destage data on it
   - Need to add background target to the writepoint, and make sure all writes
     to a given writepoint have the same background target

Scrub							Mon Jun 13 07:54:15 PM EDT 2022
 - On IO error (not CRC error?), the device might have been unplugged/gone
   away: we need a way of checking if the device is still alive (if other IOs are
   completing) before declaring a replica dead

 - Now that we've got backpointers, we have the option of doing the scrub in
   keyspace order, or lba order - and for rotating disks lba order would be
   preferred.

   So we need to be able to tell the scrub path "check these replicas", like
   we've been teaching the data update path "move these replicas".

 - How is scrub for erasure coding going to work? Perhaps we should leave scrub
   of erasure coded pointers up to the EC code
  
Rebalance is causing fragmentation with background_compression
 - migrate_index_update isn't checking the background_compression option, it's
not keeping the right pointers - Need to redo the data update path so we're
telling it explicitly which
   pointers in the original extent we don't want

Zoned device support (ZNS SSDs!)
 Zones just map to buckets, existing copygc does the work of the FTL
 - initial code is done
   X getting zone map
   X devices with variable sized zones
   X superblock in a ring buffer across two buckets
   X hooks for journal code
   - normal data IO is easy: open buckets map to appendable zones
   - btree is harder: we need to add fragmented extents, and move btree
     allocation to when we append to a btree node
     - but the hard part is done, since we're updating pointers in the parent
       node after every write

Journal:						Mon Apr 25 02:47:29 AM EDT 2022
 - fsync lock contention on journal lock
   dbench -t 30 -D /mnt/scratch 256
   compare to ext4, ext4 seems to be fastest
 - bcachefs fs usage needs to print information about where the journal is:
   X gb allocated, X gb currently in use on each device
 - We need an easy way to specify at format time which devices should be used
   for journal, currently mixed SSD/HDD filesystems mount slowly because of
   journal read on the HDDs
   - also, old filesystems allocated the journal in reverse order - is it worth
     writing a tool to defragment the journal? we shouldn't be doing this in new
     filesystems
   - the journal resize command can't yet decrease the amount of journal used on
     a device - can it? should be an easy fix

Disk space accounting
 - Needs to be reworked to move accounting from the journal to a new btree
 - Need to add compressed disk space accounting/compression ratio
 - Want average extent size

Persistent counters:					Mon Apr 25 02:29:16 AM EDT 2022
 - break out io stats (amount of io done by type) more - data writes should be
   broken out into foreground, copygc, rebalance, etc.
 - per-device iodone stats looked wrong at one point, check them manually

Compression:						Mon Apr 25 01:35:16 AM EDT 2022
 - Multithreaded compression:
   - We bottleneck on single threaded compression currently in the write path:
     To fix this, we need to
   - move compression out from under the write point lock
     we will occasionally have to recompress if we discover the compressed
     extent is bigger than the amount of space left in the write point, but not
     usually
   - decide where to punt to workqueue in the write path
   - go over the implications of punting writes to workqueue; currently, we
     avoid punting to workqueue in the write path whenever possible, because
     writes need to block to signal backpressure from lower down in the stack.
     however, sometimes (AIO) we do _not_ want writes to block if it can be
     avoided - something to think about
 - Dictionaries:
   - Add a superblock section for compression dictonary (plural?)
   - Actually no: compression dictonaries are too big for a single btree key -
     they should be files, and referred to by inode number
   - can't ever get rid of a compression dictonary that's been used!
   - compression_dictionary option, per filesystem & per file

 - Option improvements:					Mon Apr 25 01:49:08 AM EDT 2022
   - Currently, the way options are stored in the superblock is a bit of a mess,
     they're scattered aroud the flags portion of the main (fixed length)
     superblock section
   - Options should be moved to their own variable length superblock section,
     where they are referred to by an enum from the master option x-macro list
   - Currently, an option - after parsing - is always just a single integer, but
     some options might want more structure - e.g. compression, where it would
     be nice to also be able to specify the compression level
   - why not just use high bits of the compression option for the compression
     level?

Disk groups						Mon Apr 25 12:00:18 AM EDT 2022
 - The disk groups we have right now allow a disk to be in just one group, which
   is good for giving us a heirarchy for all devices in a filesystem. But
   probably want another type of group - "symlink group" - that isn't used for
   recursive enumeration, but lets disks be in more than one group

Inherited attributes					Mon Apr 25 12:02:47 AM EDT 2022
 - fsck doesn't check that inode attributes are inherited correctly
   - are inherited attributes fully transactional now? They should be - mv
     returns -EXDEV if inherited attributes have to change, which tells
     coreutils to fall back to file-by-file move - but we've never checked that
     all is good here, and we have a report (via Daniel) that it's not
   - No they're not: setting attributes on a directory is not transactional,
     cmd_attr.c recursively calls the reinherit attrs ioctl on children
   - This needs to be documented

ACL inheritence						Mon Apr 25 01:07:44 AM EDT 2022
 - look into how this works and if it works correctly on bcachefs

Fri Apr 22 06:54:29 PM EDT 2022
- in nochanges mode, the btree node cannibalize path can get stuck: we need to
  make it check for all nodes being dirty and have it bail out

Userspace kernel shim					Sat May  7 10:50:02 PM EDT 2022
 - our shrinker implementation is really stupid: it looks at /proc/meminfo and
   runs shrinkers if the ratio of memavailable vs memtotal is too low

Thu Apr 21 10:33:43 PM EDT 2022
- journal_key_search causing mount to use 100% cpu at points
  btree_iter needs to remember our position in journal_keys

Tue Apr 19 03:16:00 AM EDT 2022
- device remove tests: occasionally get invalid journal entries with accounting
  for device being removed

- backpointers
  X decide what to do about cached data. if cached pointers get backpointers,
    btree backpointers need the bucket gen because we can't atomically delete
    all btree backpointers
  X backpointers to btree nodes
  X backpointers for compressed data
  X check & repair code
  - error injection tests

- Fsck
  - fsck passes should be enumerated, with an x-macro
    - add commandline option to specify which pass(es) to run
    - enumerate inconsistentent/fsck errors
    - record error counts in superblock, error counts when last fsck was
      completed, timestamp of last error
    - by default only run fsck passes for errors we've seen
    - memoize fsck errors so transaction restarts don't print them twice
  - Add code to fsck to check for extents that have both encrypted and unencrypted
    replicas
  - fsck should be doing some checking of sb_layouts
  - fsck isn't verifying that pointers match the stripe they point to
    (trans_mark_stripe_ptr is, mark_stripe_ptr isn't)
   - fsck doesn't check that inode attributes are inherited correctly
  - lift out progress bar indicator code from data job command, also use it in
    fsck code to give each pass a progress bar

Hash tables						Sat May  7 10:55:03 PM EDT 2022
 - we're still using jhash in places - jhash is terrible, we should replace it
   with siphash

- inodes:						Mon Apr 25 02:53:40 AM EDT 2022
  - add another inode field that's like i_sectors, but takes into account
    replication & compression
  - We should've used zigzag encoding for inode timestamp fields - do we have to
    rev the inode format again?

- we always return stat.st_size==0 for directories; breaks some gentoo build
  scripts - should we/can we match what other filesystems are doing?

- test more thoroughly with journal_write_delay=0

- extend btree root repair to integrate with topology repair and handle any
  missing interior btree node
  - Still needs to be merged!
  - Topology repair could probably be refactored

X add accounting for nr of buckets that need gc before they can be reused

- cache coherency with the btree key cache		Mon Apr 25 02:37:51 AM EDT 2022
  - this is now almost complete: we just need to ensure that on insertion of a
    new key (not an overwrite of an existing key), the flush to the btree
    happens in the same transaction (so that scanning the btree knows that a key
    exists and can check the key cache).
  - once this happens, we can re-enable the btree key cache for inodes


- Data jobs						Mon Apr 25 03:13:12 AM EDT 2022
  - Rebalance
    - now that backpointers is almost done, this is the last major CPU hog
    - implement a rebalance_work btree that uses KEY_TYPE_set, where a set key
      indicates an extent at the same pos needs something done by rebalance
      (moved to background device, compressed)
    - rebalance should limit itself when copygc can't keep up
    - unconditionally start copygc & rebalance when going rw, kill the rw_late thing

  - Add a way to scan and list/make visible extents that don't match data options
  
  - Data job visibility
    - we have some of this, but it needs to be better documented
    - add current rates of keys seen, moved, sectors seen, moved to sysfs
    - data_progress, rolling average over the past few seconds
    - we have data job _progress_ in sysfs, but we also want _rate_
    - progress bars for fsck & mount with fsck enabled would be really nice

  - Scrub still needs to be implemented
    - Add a scrub option, pass it to data job path
    - Self healing: on read error and successful retry, rewrite failed replica
      (initiate scrub operation on that extent?)

- we're not checking the return value of bch2_inconsistent_error() - it returns
  whether we're continuing or halting; some uses should be switched
  bch2_fatal_error()

- jset_entry_dev_usage & related shouldn't be packed (or they should also be
  aligned(8), but this will change sizeof which we currently use

- when a drive is yanked, keep track of uncommitted writes (writes since last
  journal flush) - those will be silently degraded (corrupted)

- filesystem status:
 - need an easy way to tell if fs is degraded - how many devices required to
   mount
   - we have this internally, we just need to expose it better

- snapshots
 - add option for mounting a specific subvol
 - fsck needs to check that there is always a single subvol dirent for each
   subvol
 - ro snapshots
 - need to mark page cache as unreserved on snapshot creation
 - snapshot creation is not atomic w.r.t. page cache
  - it would be really nice for page cache to be shared between snapshots
    (upstream is working on this for reflink, finally!)
 - different st_dev for each subvolume
 - disk space accounting
 - write more tests
 - add mounting by subvolume

- nfs
 - figure out where nfs gets fsid from, seeing weird hangs with nfs4

- Superblock
  - sb_layout
    - write a copy 4k from the end of the device
    - format should be doing this now, but we should have tooling for
      examining/changing superblock layout
    - fsck should be doing some checking of sb_layouts
  - encryption: superblock is not using nonces correctly, which is why it
    doesn't get a MAC when encryption is on
  - Superblock writes aren't marked flush/fua - need to analyze this for
    correctness, particularly for superblock writes that are done while we're RW
    and fs is dirty 

- We need to detect when we're blocking on an allocation that will never
  complete - i.e. because disks are mismatched in size, so we can fail the
  allocation and also document clearly what is going on

- Core btree code					Mon Apr 25 02:32:36 AM EDT 2022
  - generic/176 - lots of transaction restarts
  - with transactional interior btree updates, b->will_make_reachable can probably
    be simplified
  - can we change bch2_btree_update_start() to not use uninterruptible sleep?
    common place to get stuck in D state
  - We need to improve running in low memory conditions: since we unlock to read
    in btree nodes, nothing is preventing nodes we need in traverse_all() from
    being reclaimed while we read in the next node

- Replication
  - Add a notification mechanism for drive failure
  - Add an option to automatically rereplicate when a drive dies, if we have
    sufficient space
  - replication.ktest recovery:
    with a 32k bucket size, we go into this infinite loop where the do_discards
    worker is continually discarding buckets and updating the needs_discard and
    freespace btrees and then those btree updates require new allocations + free
    old nodes which means the do_discards worker has more work to do
  - on device removal, we got wedged with all open_buckets allocated, mostly for
    btree: it appears we need a new watermark for
    btree_update_nodes_written/btree_node_update_key

- Tiering
  - copygc/rebalance/filesystem capacity doesn't behave reasonably on tiered
    filesystems: because filesystem capacity is sum(all fs devices), rebalance
    will try to shove more data on backing devices than will fit, and then
    copygc spins because it's trying to free up space on those completely full
    devices

    we need to think more about how we calculate/expose filesystem capacity

    to reproduce: stress.ktest

- Erasure coding
  - when creating a new stripe reusing an existing stripe, we don't need to read
    all of existing stripe unless we get an error reading the blocks we want to
    reuse
  - fsck isn't verifying that pointers match the stripe they point to
    (trans_mark_stripe_ptr is, mark_stripe_ptr isn't)
  - once backpointers is merged, erasure coding needs to be update to use it
    instead of its own in-memory backpointers
  - and check for pointers that weren't updated with the stripe pointer after an
    unclean shutdown
  - currently updating extents to add the stripe ptr can race with reflink and fail

- Performance
  - 4 context switches per dio write - what?
  - eliminate indirect function call to bch2_write_index_default 
  - bch2_rbio_narrow_crcs() path needs to change to deliver read completion
    first, then punt to a different workqueue
  - io_in_flight semaphore - is it time to nuke that? block layer has writeback
    throttling now, we should at least make it configurable
  - see if we can do something about delet times

- Performance visibility
  - need improved lock contention stats
    - add time_stats for btree lock held time, broken out by transaction->fn & call site

- support idmapped mounts: crib off of xfs commit f736d93d76d3e97d6986c6d26c8eaa32536ccc5c
  - done?

- fix all sparse warnings
  - make C=2 fs/bcachefs/ throws errors

- Make sure tests fail on data checksum error

- we're getting more btree node cache eviction than we should be, because a
  single numa node can be oom: might be worth moving nodes to different numa
  nodes. also - shrinker needs to be numa aware