Age | Commit message (Collapse) | Author |
|
Now that we have a standard ioctl for querying the filesystem UUID,
initialize sb->s_uuid so that it works.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a new generic ioctls for querying the filesystem UUID.
'struct fsuuid' has a length field, so that these can be used for old
weird filesystems without a real UUID (vfat). The length must never
be more than 16 bytes.
This patch adds a generic implementation of FS_IOC_GETFSUUID, which
reads from super_block->s_uuid.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Alexander Viro <viro@zeniv.linux.org.uk
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.or
|
|
Some weird old filesytems have UUID-like things that we wish to expose
as UUIDs, but are smaller; add a length field so that the new
FS_IOC_(GET|SET)UUID ioctls can handle them in generic code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Need to fix this oversight for the new FS_IOC_(GET|SET)UUID ioctls.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
switch the statfs code from something horrible and open coded to the
more standard uuid_to_fsid()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Files within a subvolume cannot be renamed into another subvolume, but
subvolumes themselves were intended to be.
This implements subvolume renaming - we need to ensure that there's only
a single dirent that points to a subvolume key (not multiple versions in
different snapshots), and we need to ensure that dirent.d_parent_subol
and inode.bi_parent_subvol are updated.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add mempool_init_kvmalloc_pool() and mempool_create_kvmalloc_pool(),
which wrap kvmalloc() instead of kmalloc() - kmalloc() with a vmalloc()
fallback.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-mm@kvack.org
|
|
We were failing to set path->uptodate when reaching the end of a btree
node iterator, causing the new prefetch code for backpointers gc to go
into an infinite loop.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Memory allocation failures print backtraces by default, but when we're
running out of a rhashtable worker the backtrace is useless - it doesn't
tell us which hashtable the allocation failure was for.
This adds a dedicated warning that prints out functions from the
rhashtable params, which will be a bit more useful.
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
delete bcache's time stats code, convert to newer version from bcachefs.
example output:
root@moria-kvm:/sys/fs/bcache/bdaedb8c-4554-4dd2-87e4-276e51eb47cc# cat internal/btree_sort_times
count: 6414
since mount recent
duration of events
min: 440 ns
max: 1102 us
total: 674 ms
mean: 105 us 102 us
stddev: 101 us 88 us
time between events
min: 881 ns
max: 3 s
mean: 7 ms 6 ms
stddev: 52 ms 6 ms
Cc: Coly Li <colyli@suse.de>
Cc: linux-bcache@vger.kernel.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Library code from bcachefs for tracking latency measurements.
The main interface is
time_stats_update(stats, start_time);
which collects a new event with an end time of the current time.
It features percpu buffering of input values, making it very low
overhead, and nicely formatted output to printbufs or seq_buf.
Sample output, from the bcache conversion:
root@moria-kvm:/sys/fs/bcache/bdaedb8c-4554-4dd2-87e4-276e51eb47cc# cat internal/btree_sort_times
count: 6414
since mount recent
duration of events
min: 440 ns
max: 1102 us
total: 674 ms
mean: 105 us 102 us
stddev: 101 us 88 us
time between events
min: 881 ns
max: 3 s
mean: 7 ms 6 ms
stddev: 52 ms 6 ms
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Coly Li <colyli@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
eytzinger trees are a faster alternative to binary search. They're a bit
more expensive to setup, but lookups perform much better assuming the
tree isn't entirely in cache.
Binary search is a worst case scenario for branch prediction and
prefetching, but eytzinger trees have children adjacent in memory and
thus we can prefetch before knowing the result of a comparison.
An eytzinger tree is a binary tree laid out in an array, with the same
geometry as the usual binary heap construction, but used as a search
tree instead.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
|
Small statistics library, for taking in a series of value and computing
mean, weighted mean, standard deviation and weighted deviation.
The main use case is for statistics on latency measurements.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Daniel Hill <daniel@gluo.nz>
Cc: Darrick J. Wong <djwong@kernel.org>
|
|
When a dirent points to a missing inode, we really should print out the
dirent.
This requires quite a bit of refactoring, but there's some other
benefits: we now do the entire looup (dirent and inode) in a single
btree transaction, and copy to the VFS inode with btree locks still
held, like the create path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Introduce PF_MEMALLOC_* equivalents of some GFP_ flags:
PF_MEMALLOC_NORECLAIM -> GFP_NOWAIT
PF_MEMALLOC_NOWARN -> __GFP_NOWARN
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Our proliferation of memalloc_*_{save,restore} APIs is getting a bit
silly, this adds a generic version and converts the existing
save/restore functions to wrappers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
|
|
btree_and_journal_iter is old code that we want to get rid of, but we're
not ready to yet.
lack of btree node prefetching is, it turns out, a real performance
issue for fsck on spinning rust, so - add it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
we now always have a btree_trans when using a btree_and_journal_iter;
prep work for adding prefetching to btree_and_journal_iter
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Recently a severe performance regression was discovered, which bisected
to
a6548c8b5eb5 bcachefs: Avoid flushing the journal in the discard path
It turns out the old behaviour, which issued excessive journal flushes,
worked around a performance issue where queueing delays would cause the
journal to not be able to write quickly enough and stall.
The journal flushes masked the issue because they periodically flushed
the device write cache, reducing write latency for non flushes.
This patch reworks the journalling code to allow more than one
(non-flush) write to be in flight at a time. With this patch, doing 4k
random writes and an iodepth of 128, we are now able to hit 560k iops to
a Samsung 970 EVO Plus - previously, we were stuck in the ~200k range.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Prep work for having multiple journal writes in flight.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Prep work for having multiple journal writes in flight.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This gives us a way to record the date and time every journal entry was
written - useful for debugging.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Eliminates some error paths - no longer have a hardcoded
BCH_REPLICAS_MAX limit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Drop an unnecessary bch2_subvolume_get_snapshot() call, and drop the __
from the name - this is a normal interface.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Minor renaming for clarity, bit of refactoring.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Most bcachefs workqueues are used for completions, and should be
WQ_HIGHPRI - this helps reduce queuing delays, we want to complete
quickly once we can no longer signal backpressure by blocking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
For DT_SUBVOL, we now print both parent and child subvol IDs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Previously, any time we failed to get a journal reservation we'd retry,
with the journal lock held; but this isn't necessary given
wait_event()/wake_up() ordering.
This avoids performance cliffs when the journal starts to get backed up
and lock contention shoots up.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We don't want journal write completions to be blocked behind btree
transactions - io_complete_wq is used for btree updates after data and
metadata writes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Parent dir is locked by user_path_locked_at() before validating the
required dentry. It should be unlocked if we can not perform the
deletion.
This fixes the problem:
$ bcachefs subvolume delete not-exist-entry
BCH_IOCTL_SUBVOLUME_DESTROY ioctl error: No such file or directory
$ bcachefs subvolume delete not-exist-entry
the second will stuck because the parent dir is locked in the previous
deletion.
Signed-off-by: Guoyu Ou <benogy@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The gcc compiler on paric does support the __int128 type, although the
architecture does not have native 128-bit support.
The effect is, that the bcachefs u128_square() function will pull in the
libgcc __multi3() helper, which breaks the kernel build when bcachefs is
built as module since this function isn't currently exported in
arch/parisc/kernel/parisc_ksyms.c.
The build failure can be seen in the latest debian kernel build at:
https://buildd.debian.org/status/fetch.php?pkg=linux&arch=hppa&ver=6.7.1-1%7Eexp1&stamp=1706132569&raw=0
We prefer to not export that symbol, so fall back to the optional 64-bit
implementation provided by bcachefs and thus avoid usage of __multi3().
Signed-off-by: Helge Deller <deller@gmx.de>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a new helper, bch2_hash_lookup_in_snapshot(), for when we're not
operating in a subvolume and already have a snapshot ID, and then use it
in lookup_lostfound() -> __lookup_dirent().
This is a bugfix - lookup_lostfound() doesn't take a subvolume ID, we
were passing a nonsense subvolume ID before, and don't have one to pass
since we may be operating in an interior snapshot node that doesn't have
a subvolume ID.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Some (bad) devices can have really terrible discard latency; we don't
want them blocking memory reclaim and causing warnings.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
REQ_OP_FLUSH is only for internal use in the blk-mq and request based
drivers. File systems and other block layer consumers must use
REQ_OP_WRITE | REQ_PREFLUSH as documented in
Documentation/block/writeback_cache_control.rst.
While REQ_OP_FLUSH appears to work for blk-mq drivers it does not
get the proper flush state machine handling, and completely fails
for any bio based drivers, including all the stacking drivers. The
block layer will also get a check in 6.8 to reject this use case
entirely.
[Note: completely untested, but as this never got fixed since the
original bug report in November:
https://bugzilla.kernel.org/show_bug.cgi?id=218184
and the the discussion in December:
https://lore.kernel.org/all/20231221053016.72cqcfg46vxwohcj@moria.home.lan/T/
this seems to be best way to force it]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fixes: e6a2566f7a00 ("bcachefs: Better journal tracepoints")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reported-by: smatch
|
|
|
|
Pull more bcachefs updates from Kent Overstreet:
"Some fixes, Some refactoring, some minor features:
- Assorted prep work for disk space accounting rewrite
- BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this
makes our trigger context more explicit
- A few fixes to avoid excessive transaction restarts on
multithreaded workloads: fstests (in addition to ktest tests) are
now checking slowpath counters, and that's shaking out a few bugs
- Assorted tracepoint improvements
- Starting to break up bcachefs_format.h and move on disk types so
they're with the code they belong to; this will make room to start
documenting the on disk format better.
- A few minor fixes"
* tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits)
bcachefs: Improve inode_to_text()
bcachefs: logged_ops_format.h
bcachefs: reflink_format.h
bcachefs; extents_format.h
bcachefs: ec_format.h
bcachefs: subvolume_format.h
bcachefs: snapshot_format.h
bcachefs: alloc_background_format.h
bcachefs: xattr_format.h
bcachefs: dirent_format.h
bcachefs: inode_format.h
bcachefs; quota_format.h
bcachefs: sb-counters_format.h
bcachefs: counters.c -> sb-counters.c
bcachefs: comment bch_subvolume
bcachefs: bch_snapshot::btime
bcachefs: add missing __GFP_NOWARN
bcachefs: opts->compression can now also be applied in the background
bcachefs: Prep work for variable size btree node buffers
bcachefs: grab s_umount only if snapshotting
...
|