Age | Commit message (Collapse) | Author |
|
It now includes more info - whether the bucket was for metadata or data
- and also call it in the same place as the bucket_alloc_fail
tracepoint.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This deletes our old lock ordering based deadlock avoidance code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
We've outgrown our own deadlock avoidance strategy.
The btree iterator API provides an interface where the user doesn't need
to concern themselves with lock ordering - different btree iterators can
be traversed in any order. Without special care, this will lead to
deadlocks.
Our previous strategy was to define a lock ordering internally, and
whenever we attempt to take a lock and trylock() fails, we'd check if
the current btree transaction is holding any locks that cause a lock
ordering violation. If so, we'd issue a transaction restart, and then
bch2_trans_begin() would re-traverse all previously used iterators, but
in the correct order.
That approach had some issues, though.
- Sometimes we'd issue transaction restarts unnecessarily, when no
deadlock would have actually occured. Lock ordering restarts have
become our primary cause of transaction restarts, on some workloads
totally 20% of actual transaction commits.
- To avoid deadlock or livelock, we'd often have to take intent locks
when we only wanted a read lock: with the lock ordering approach, it
is actually illegal to hold _any_ read lock while blocking on an intent
lock, and this has been causing us unnecessary lock contention.
- It was getting fragile - the various lock ordering rules are not
trivial, and we'd been seeing occasional livelock issues related to
this machinery.
So, since bcachefs is already a relational database masquerading as a
filesystem, we're stealing the next traditional database technique and
switching to a cycle detector for avoiding deadlocks.
When we block taking a btree lock, after adding ourself to the waitlist
but before sleeping, we do a DFS of btree transactions waiting on other
btree transactions, starting with the current transaction and walking
our held locks, and transactions blocking on our held locks.
If we find a cycle, we emit a transaction restart. Occasionally (e.g.
the btree split path) we can not allow the lock() operation to fail, so
if necessary we'll tell another transaction that it has to fail.
Result: trans_restart_would_deadlock events are reduced by a factor of
10 to 100, and we'll be able to delete a whole bunch of grotty, fragile
code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This is needed by the cycle detector in bcachefs - we need a way to
iterater over waitlist entries while dropping and retaking the waitlist
lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
With the new method where the thread doing the wakeup after unlock takes
the lock on behalf of the thread waiting for the lock, we don't want to
spin calling trylock() anymore - we can instead spin on
wait->lock_acquired, and not have to touch the lock's cacheline.
Also, osq_lock doesn't make much sense for six locks; multiple readers
may be waiting on a single thread to drop the write lock, and dropping
it simplifies the code a bit more.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This allows passing in the wait list entry - to be used for a deadlock
cycle detector.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This brings back an important optimization, to avoid touching the wait
lists an extra time, while preserving the property that a thread is on a
lock waitlist iff it is waiting - it is never removed from the waitlist
until it has the lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This switches to a single list of waiters, instead of separate lists for
read and intent, and switches write locks to also use the wait lists
instead of being handled differently.
Also, removal from the wait list is now done by the process waiting on
the lock, not the process doing the wakeup. This is needed for the new
deadlock cycle detector - we need tasks to stay on the waitlist until
they've successfully acquired the lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This adds a method to tell lockdep not to check lock ordering within a
lock class - but to still check lock ordering w.r.t. other lock types.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Centralizing the transaction restart/tracepoint in
bch2_btree_path_upgrade() lets us improve the tracepoint - now it emits
old and new locks_want.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Didn't have any users, and wasn't a good idea to begin with - delete it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Also, do some reorganizing/renaming, convert atomic counters in bch_fs
to persistent counters, and add a few missing counters.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
It now includes journal_flags.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
It now prints the error name when the btree node is an error pointer;
also, don't trace failures when the the btree node is
BCH_ERR_no_btree_node_up.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This is just some type safety cleanup.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
six_lock_count now counts up whether a write lock held, and this patch
now also correctly counts six_lock->intent_lock_recurse.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This message fires when the data update path races with a foreground
write that overwrote the data that was being moved - this isn't a
concerning event as long as it's not happening too often, so switch it
to a tracepoint.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This is to fix a livelock in the btree split path described by the
previous patch.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
- Add a flag, has_indent_or_tabstops, that is set if indent level or
tabstops are set.
- Tabstops can no longer be set by modifying the tabstop array
directly: instead, the new functions are provided:
printbuf_tabstop_push() - add a new tabstop, n spaces after
previous tabstop
printbuf_tabtstop_pop() - remove previous tabstop
printbuf_tabstops_reset() - remove all tabstops
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This adds a new helper, prt_str_indented(), which handles embedded
control characters by calling prt_newline(), prt_tab(), and
prt_tab_rjust() as needed.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Our types are exported to the tracepoint code, so it's not necessary to
break things out individually when passing them to tracepoints - we can
also call other functions from TP_fast_assign().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
- use strlcpy(), not strncpy()
- add tracepoints for btree_path alloc and free
- give the tracepoint for key cache upgrade fail a proper name
- add a tracepoint for btree_node_upgrade_fail
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
In CONFIG_BCACHEFS_DEBUG mode, we'll now randomly issue transaction
restarts - with a decaying probability based on the number of restarts
we've already had, to ensure that transactions eventually make forward
progress. This should help shake out some bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
All transaction restarts need a tracepoint - this is essential for
debugging
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Instead of overloading standard error codes (EINTR/EAGAIN), and defining
short lists of error codes in multiple places that potentially end up
overlapping & conflicting, we're now going to have one master list of
error codes.
Error codes are defined with an x-macro: thus we also have
bch2_err_str() now.
Also, error codes have a class field. Now, instead of checking for
errors with ==, code should use bch2_err_matches(), which returns true
if the error is equal to or a sub-error of the error class.
This means we can define unique errors for every source location where
an error is generated, which will help improve our error messages.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
We should be printing the number of free buckets, not just the number of
available buckets.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
- invalidate_one_bucket() now returns 1 when we don't have any buckets
on this device to invalidate, ensuring we don't spin
- the tracepoint invocation is moved to after the transaction commit,
and we now include the number of cached sectors in the tracepoint
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
The next step in this patch series for improving debugging of shrinker
related issues: keep counts of number of objects we request to free vs.
actually freed, and prints them in shrinker_to_text().
Shrinkers won't necessarily free all objects requested for a variety of
reasons, but if the two counts are wildly different something is likely
amiss.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This adds a new callback method to shrinkers which they can use to
describe anything relevant to memory reclaim about their internal state,
for example object dirtyness.
This uses the new printbufs to output to heap allocated strings, so that
the .to_text() methods can be used both for messages logged to the
console, and also sysfs/debugfs.
This patch also adds shrinkers_to_text(), which reports on the top 10
shrinkers - by object count - in sorted order, to be used in OOM
reporting.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This implements a new %p format string extensions for passing a pretty
printer and its arguments to printk, which will then be inserted into
the formatted output.
A pretty-printer is a function that takes as its first argument a
pointer to a struct printbuf, and then zero or more additional pointer
arguments - these being the objects to format and print.
The arguments to the pretty-printer function are denoted in the format
string by %p, i.e
%pf() foo0_to_text(struct printbuf *out)
%pf(%p) foo1_to_text(struct printbuf *out, struct foo *)
%pf(%p,%p) foo2_to_text(struct printbuf *out, struct foo *)
We'd also like to eventually support non pointer arguments - in
particular, integers - but this will probably require libffi.
Typechecking is accomplished with the CALL_PP macro, which verifies that
the arguments passed to sprintf match the types of the pp-function
arguments, and passes a struct with a cookie to sprintf so that sprintf
can verify that the CALL_PP() macro was used.
Full example:
static void foo_to_text(struct printbuf *out, struct foo *foo)
{
prt_printf(out, "bar=%u baz=%u", foo->bar, foo->baz);
}
printf("%pf(%p)", CALL_PP(foo_to_text, foo));
The goal is to replace most of our %p format extensions with this
interface, and to move pretty-printers out of the core vsprintf.c code -
this will get us better organization and better discoverability (you'll
be able to cscope to pretty printer calls!), as well as eliminate a lot
of dispatch code in vsprintf.c.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
No longer has any users, so delete it.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This converts the seq_bufs in dynevent_cmd and trace_seq to printbufs.
- read_pos in seq_buf doesn't exist in printbuf, so is added to
trace_seq
- seq_buf_to_user doesn't have a printbuf equivalent, so is inlined
into trace_seq_to_user
- seq_buf_putmem_hex currently swabs bytes on little endian, hardcoded
to 8 byte units. This patch switches it to prt_hex_bytes(), which
does _not_ swab.
Otherwise this is largely a direct conversion, with a few slight
refactorings and cleanups.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This implements a new printbuf version of d_path()/mangle_path(), which
will replace the seq_buf version.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This adds two new-style printbuf helpers for printing simple u64s, and
converts num_to_str() to be a simple wrapper around prt_u64_minwidth().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This adds options to printbuf for specifying whether units should be
printed raw (default) or with human readable units, and for controlling
whether human-readable units should be base 2 (default), or base 10.
This also adds new helpers that obey these options:
- pr_human_readable_u64
- pr_human_readable_s64
These obey printbuf->si_units
- pr_units_u64
- pr_units_s64
These obey both printbuf-human_readable_units and printbuf->si_units
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This patch adds two new features to printbuf for structured formatting:
- Indent level: the indent level, as a number of spaces, may be
increased with pr_indent_add() and decreased with pr_indent_sub().
Subsequent lines, when started with pr_newline() (not "\n", although
that may change) will then be intended according to the current
indent level. This helps with pretty-printers that structure a large
amonut of data across multiple lines and multiple functions.
- Tabstops: Tabstops may be set by assigning to the printbuf->tabstops
array.
Then, pr_tab() may be used to advance to the next tabstop, printing
as many spaces as required - leaving previous output left justified
to the previous tabstop. pr_tab_rjust() advances to the next tabstop
but inserts the spaces just after the previous tabstop - right
justifying the previously-outputted text to the next tabstop.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This makes printbufs optionally heap allocated: a printbuf initialized
with the PRINTBUF initializer will automatically heap allocate and
resize as needed.
Allocations are done with GFP_KERNEL: code should use e.g.
memalloc_nofs_save()/restore() as needed. Since we do not currently have
memalloc_nowait_save()/restore(), in contexts where it is not safe to
block we provide the helpers
printbuf_atomic_inc()
printbuf_atomic_dec()
When the atomic count is nonzero, memory allocations will be done with
GFP_NOWAIT.
On memory allocation failure, output will be truncated. Code that wishes
to check for memory allocation failure (in contexts where we should
return -ENOMEM) should check if printbuf->allocation_failure is set.
Since printbufs are expected to be typically used for log messages and
on a best effort basis, we don't return errors directly.
Other helpers provided by this patch:
- printbuf_make_room(buf, extra)
Reallocates if necessary to make room for @extra bytes (not including
terminating null).
- printbuf_str(buf)
Returns a null terminated string equivalent to the contents of @buf.
If @buf was never allocated (or allocation failed), returns a
constant empty string.
- printbuf_exit(buf)
Releases memory allocated by a printbuf.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
printbuf now needs to know the number of characters that would have been
written if the buffer was too small, like snprintf(); this changes
string_get_size() to return the the return value of snprintf().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This converts most of the hexdump code to printbufs, along with some
significant cleanups and a bit of reorganization. The old non-printbuf
functions are mostly left as wrappers around the new printbuf versions.
Big note: byte swabbing behaviour
Previously, hex_dump_to_buffer() would byteswab the groups of bytes
being printed on little endian machines. This behaviour is... not
standard or typical for a hex dumper, and this behaviour was silently
added/changed without documentation (in 2007).
Given that the hex dumpers are just used for debugging output, nothing
is likely to break, and hopefully by reverting to more standard
behaviour the end result will be _less_ confusion, modulo a few kernel
developers who will certainly be annoyed by their tools changing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This converts vsnprintf() to printbufs: instead of passing around raw
char * pointers for current buf position and end of buf, we have a real
type!
This makes the calling convention for our existing pretty printers a lot
saner and less error prone, plus printbufs add some new helpers that
make the code smaller and more readable, with a lot less crazy pointer
arithmetic.
There are a lot more refactorings to be done: this patch tries to stick
to just converting the calling conventions, as that needs to be done all
at once in order to avoid introducing a ton of wrappers that will just
be deleted.
Thankfully we have good unit tests for printf, and they have been run
and are all passing with this patch.
We have two new exported functions with this patch:
- prt_printf(), which is like snprintf but outputs to a printbuf
- prt_vprintf, like vsnprintf
These are the actual core print routines now - vsnprintf() is a wrapper
around prt_vprintf().
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Like the upcoming vsprintf.c conversion, this converts string_escape_mem
to prt_escaped_string(), which uses and outputs to a printbuf, and makes
string_escape_mem() a smaller wrapper to support existing users.
The new printbuf helpers greatly simplify the code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This adds printbufs: a printbuf points to a char * buffer and knows the
size of the output buffer as well as the current output position.
Future patches will be adding more features to printbuf, but initially
printbufs are targeted at refactoring and improving our existing code in
lib/vsprintf.c - so this initial printbuf patch has the features
required for that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
Delete some obsolete tracepoints, organize alloc tracepoints better,
make a few tracepoints more consistent.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Now that the bucket_alloc_fail tracepoint includes the error code, the
open_bucket_alloc_fail tracepoint is redundant.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This behavior dates from the early, early days of bcache, and upon
further delving appears to not make any sense. The shrinker only works
in terms of 'objects' of unknown size; normalizing to pages only had the
effect of changing the batch size, which we could do directly - if we
wanted; we probably don't. Normalizing to pages meant our batch size was
very small, which seems to have been keeping us from doing as much
shrinking as we should be under heavy memory pressure; this patch
appears to alleviate some OOMs we've been seeing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
- bch2_clear_need_discard() was using bch2_trans_relock() incorrectly,
and always bailing out before doing any work - ouch.
- Add a tracepoint that fires every time bch2_do_discards() runs, and
tells us about the work it did
- When too many buckets aren't able to be discarded because they need a
journal commit, bch2_do_discards now flushes the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
For backpointers, we'll need to delete old backpointers before adding
new backpointers - otherwise we'll run into spurious duplicate
backpointer errors.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|
|
This adds counters for each of the reasons we may skip allocating a
bucket - we're seeing a bug where we loop endlessly trying to allocate
when we should have plenty of buckets available, so hopefully this will
help us track down why.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
|