summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2022-07-22swiotlb: clean up some coding style and minor issuesTianyu Lan
- Fix the used field of struct io_tlb_area wasn't initialized - Set area number to be 0 if input area number parameter is 0 - Use array_size() to calculate io_tlb_area array size - Make parameters of swiotlb_do_find_slots() more reasonable Fixes: 26ffb91fa5e0 ("swiotlb: split up the global swiotlb lock") Signed-off-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2022-07-21bpf: Add support for forcing kfunc args to be trustedKumar Kartikeya Dwivedi
Teach the verifier to detect a new KF_TRUSTED_ARGS kfunc flag, which means each pointer argument must be trusted, which we define as a pointer that is referenced (has non-zero ref_obj_id) and also needs to have its offset unchanged, similar to how release functions expect their argument. This allows a kfunc to receive pointer arguments unchanged from the result of the acquire kfunc. This is required to ensure that kfunc that operate on some object only work on acquired pointers and not normal PTR_TO_BTF_ID with same type which can be obtained by pointer walking. The restrictions applied to release arguments also apply to trusted arguments. This implies that strict type matching (not deducing type by recursively following members at offset) and OBJ_RELEASE offset checks (ensuring they are zero) are used for trusted pointer arguments. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220721134245.2450-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-21bpf: Switch to new kfunc flags infrastructureKumar Kartikeya Dwivedi
Instead of populating multiple sets to indicate some attribute and then researching the same BTF ID in them, prepare a single unified BTF set which indicates whether a kfunc is allowed to be called, and also its attributes if any at the same time. Now, only one call is needed to perform the lookup for both kfunc availability and its attributes. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220721134245.2450-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-21Merge branch 'ctxt.2022.07.05a' into HEADPaul E. McKenney
ctxt.2022.07.05a: Linux-kernel memory model development branch.
2022-07-21Merge branches 'doc.2022.06.21a', 'fixes.2022.07.19a', 'nocb.2022.07.19a', ↵Paul E. McKenney
'poll.2022.07.21a', 'rcu-tasks.2022.06.21a' and 'torture.2022.06.21a' into HEAD doc.2022.06.21a: Documentation updates. fixes.2022.07.19a: Miscellaneous fixes. nocb.2022.07.19a: Callback-offload updates. poll.2022.07.21a: Polled grace-period updates. rcu-tasks.2022.06.21a: Tasks RCU updates. torture.2022.06.21a: Torture-test updates.
2022-07-21rcu: Add irqs-disabled indicator to expedited RCU CPU stall warningsZqiang
If a CPU has interrupts disabled continuously starting before the beginning of a given expedited RCU grace period, that CPU will not execute that grace period's IPI handler. This will in turn mean that the ->cpu_no_qs.b.exp field in that CPU's rcu_data structure will continue to contain the boolean value false. Knowing whether or not a CPU has had interrupts disabled can be helpful when debugging an expedited RCU CPU stall warning, so this commit adds a "D" indicator expedited RCU CPU stall warnings that signifies that the corresponding CPU has had interrupts disabled throughout. This capability was tested as follows: runqemu kvm slirp nographic qemuparams="-m 4096 -smp 4" bootparams= "isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3 rcutree.dump_tree=1 rcutorture.stall_cpu_holdoff=30 rcutorture.stall_cpu=40 rcutorture.stall_cpu_irqsoff=1 rcutorture.stall_cpu_block=0 rcutorture.stall_no_softlockup=1" -d The rcu_torture_stall() function ran on CPU 1, which displays the "D" as expected given the rcutorture.stall_cpu_irqsoff=1 module parameter: ............ rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 1-...D } 26467 jiffies s: 13317 root: 0x1/. rcu: blocking rcu_node structures (internal RCU debug): l=1:0-1:0x2/. Task dump for CPU 1: task:rcu_torture_sta state:R running task stack: 0 pid: 76 ppid: 2 flags:0x00004008 Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21rcu: Diagnose extended sync_rcu_do_polled_gp() loopsPaul E. McKenney
This commit dumps out state when the sync_rcu_do_polled_gp() function loops more than expected. This is a debugging aid. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21rcu: Put panic_on_rcu_stall() after expedited RCU CPU stall warningsZqiang
When a normal RCU CPU stall warning is encountered with the panic_on_rcu_stall sysfs variable is set, the system panics only after the stall warning is printed. But when an expedited RCU CPU stall warning is encountered with the panic_on_rcu_stall sysfs variable is set, the system panics first, thus never printing the stall warning. This commit therefore brings the expedited stall warning into line with the normal stall warning by printing first and panicking afterwards. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21rcutorture: Test polled expedited grace-period primitivesPaul E. McKenney
This commit adds tests of start_poll_synchronize_rcu_expedited() and poll_state_synchronize_rcu_expedited(). Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/ Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing Cc: Brian Foster <bfoster@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21rcu: Add polled expedited grace-period primitivesPaul E. McKenney
This commit adds expedited grace-period functionality to RCU's polled grace-period API, adding start_poll_synchronize_rcu_expedited() and cond_synchronize_rcu_expedited(), which are similar to the existing start_poll_synchronize_rcu() and cond_synchronize_rcu() functions, respectively. Note that although start_poll_synchronize_rcu_expedited() can be invoked very early, the resulting expedited grace periods are not guaranteed to start until after workqueues are fully initialized. On the other hand, both synchronize_rcu() and synchronize_rcu_expedited() can also be invoked very early, and the resulting grace periods will be taken into account as they occur. [ paulmck: Apply feedback from Neeraj Upadhyay. ] Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/ Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing Cc: Brian Foster <bfoster@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21rcutorture: Verify that polled GP API sees synchronous grace periodsPaul E. McKenney
This commit causes rcu_torture_writer() to use WARN_ON_ONCE() to check that the cookie returned by the current RCU flavor's ->get_gp_state() function (get_state_synchronize_rcu() for vanilla RCU) causes that flavor's ->poll_gp_state function (poll_state_synchronize_rcu() for vanilla RCU) to unconditionally return true. Note that a pair calls to synchronous grace-period-wait functions are used. This is necessary to account for partially overlapping normal and expedited grace periods aligning in just the wrong way with polled API invocations, which can cause those polled API invocations to ignore one or the other of those partially overlapping grace periods. It is unlikely that this sort of ignored grace period will be a problem in production, but rcutorture can make it happen quite within a few tens of seconds. This commit is in preparation for polled expedited grace periods. [ paulmck: Apply feedback from Frederic Weisbecker. ] Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/ Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing Cc: Brian Foster <bfoster@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21rcu: Make Tiny RCU grace periods visible to polled APIsPaul E. McKenney
This commit makes the Tiny RCU implementation of synchronize_rcu() increment the rcu_ctrlblk.gp_seq counter, thus making both synchronize_rcu() and synchronize_rcu_expedited() visible to get_state_synchronize_rcu() and friends. Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/ Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing Cc: Brian Foster <bfoster@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21rcu: Make polled grace-period API account for expedited grace periodsPaul E. McKenney
Currently, this code could splat: oldstate = get_state_synchronize_rcu(); synchronize_rcu_expedited(); WARN_ON_ONCE(!poll_state_synchronize_rcu(oldstate)); This situation is counter-intuitive and user-unfriendly. After all, there really was a perfectly valid full grace period right after the call to get_state_synchronize_rcu(), so why shouldn't poll_state_synchronize_rcu() know about it? This commit therefore makes the polled grace-period API aware of expedited grace periods in addition to the normal grace periods that it is already aware of. With this change, the above code is guaranteed not to splat. Please note that the above code can still splat due to counter wrap on the one hand and situations involving partially overlapping normal/expedited grace periods on the other. On 64-bit systems, the second is of course much more likely than the first. It is possible to modify this approach to prevent overlapping grace periods from causing splats, but only at the expense of greatly increasing the probability of counter wrap, as in within milliseconds on 32-bit systems and within minutes on 64-bit systems. This commit is in preparation for polled expedited grace periods. Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/ Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing Cc: Brian Foster <bfoster@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21rcu: Switch polled grace-period APIs to ->gp_seq_polledPaul E. McKenney
This commit switches the existing polled grace-period APIs to use a new ->gp_seq_polled counter in the rcu_state structure. An additional ->gp_seq_polled_snap counter in that same structure allows the normal grace period kthread to interact properly with the !SMP !PREEMPT fastpath through synchronize_rcu(). The first of the two to note the end of a given grace period will make knowledge of this transition available to the polled API. This commit is in preparation for polled expedited grace periods. [ paulmck: Fix use of rcu_state.gp_seq_polled to start normal grace period. ] Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/ Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing Cc: Brian Foster <bfoster@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ian Kent <raven@themaw.net> Co-developed-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21resource: Introduce alloc_free_mem_region()Dan Williams
The core of devm_request_free_mem_region() is a helper that searches for free space in iomem_resource and performs __request_region_locked() on the result of that search. The policy choices of the implementation conform to what CONFIG_DEVICE_PRIVATE users want which is memory that is immediately marked busy, and a preference to search for the first-fit free range in descending order from the top of the physical address space. CXL has a need for a similar allocator, but with the following tweaks: 1/ Search for free space in ascending order 2/ Search for free space relative to a given CXL window 3/ 'insert' rather than 'request' the new resource given downstream drivers from the CXL Region driver (like the pmem or dax drivers) are responsible for request_mem_region() when they activate the memory range. Rework __request_free_mem_region() into get_free_mem_region() which takes a set of GFR_* (Get Free Region) flags to control the allocation policy (ascending vs descending), and "busy" policy (insert_resource() vs request_region()). As part of the consolidation of the legacy GFR_REQUEST_REGION case with the new default of just inserting a new resource into the free space some minor cleanups like not checking for NULL before calling devres_free() (which does its own check) is included. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/linux-cxl/20220420143406.GY2120790@nvidia.com/ Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/165784333333.1758207.13703329337805274043.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-07-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-21watch-queue: remove spurious double semicolonLinus Torvalds
Sedat Dilek noticed that I had an extraneous semicolon at the end of a line in the previous patch. It's harmless, but unintentional, and while compilers just treat it as an extra empty statement, for all I know some other tooling might warn about it. So clean it up before other people notice too ;) Fixes: 353f7988dd84 ("watchqueue: make sure to serialize 'wqueue->defunct' properly") Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
2022-07-21cxl/acpi: Track CXL resources in iomem_resourceDan Williams
Recall that CXL capable address ranges, on ACPI platforms, are published in the CEDT.CFMWS (CXL Early Discovery Table: CXL Fixed Memory Window Structures). These windows represent both the actively mapped capacity and the potential address space that can be dynamically assigned to a new CXL decode configuration (region / interleave-set). CXL endpoints like DDR DIMMs can be mapped at any physical address including 0 and legacy ranges. There is an expectation and requirement that the /proc/iomem interface and the iomem_resource tree in the kernel reflect the full set of platform address ranges. I.e. that every address range that platform firmware and bus drivers enumerate be reflected as an iomem_resource entry. The hard requirement to do this for CXL arises from the fact that facilities like CONFIG_DEVICE_PRIVATE expect to be able to treat empty iomem_resource ranges as free for software to use as proxy address space. Without CXL publishing its potential address ranges in iomem_resource, the CONFIG_DEVICE_PRIVATE mechanism may inadvertently steal capacity reserved for runtime provisioning of new CXL regions. So, iomem_resource needs to know about both active and potential CXL resource ranges. The active CXL resources might already be reflected in iomem_resource as "System RAM". insert_resource_expand_to_fit() handles re-parenting "System RAM" underneath a CXL window. The "_expand_to_fit()" behavior handles cases where a CXL window is not a strict superset of an existing entry in the iomem_resource tree. The "_expand_to_fit()" behavior is acceptable from the perspective of resource allocation. The expansion happens because a conflicting resource range is already populated, which means the resource boundary expansion does not result in any additional free CXL address space being made available. CXL address space allocation is always bounded by the orginal unexpanded address range. However, the potential for expansion does mean that something like walk_iomem_res_desc(IORES_DESC_CXL...) can only return fuzzy answers on corner case platforms that cause the resource tree to expand a CXL window resource over a range that is not decoded by CXL. This would be an odd platform configuration, but if it becomes a problem in practice the CXL subsytem could just publish an API that returns definitive answers. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/165784325943.1758207.5310344844375305118.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-07-21bpf: Check attach_func_proto more carefully in check_helper_callStanislav Fomichev
Syzkaller found a problem similar to d1a6edecc1fd ("bpf: Check attach_func_proto more carefully in check_return_code") where attach_func_proto might be NULL: RIP: 0010:check_helper_call+0x3dcb/0x8d50 kernel/bpf/verifier.c:7330 do_check kernel/bpf/verifier.c:12302 [inline] do_check_common+0x6e1e/0xb980 kernel/bpf/verifier.c:14610 do_check_main kernel/bpf/verifier.c:14673 [inline] bpf_check+0x661e/0xc520 kernel/bpf/verifier.c:15243 bpf_prog_load+0x11ae/0x1f80 kernel/bpf/syscall.c:2620 With the following reproducer: bpf$BPF_PROG_RAW_TRACEPOINT_LOAD(0x5, &(0x7f0000000780)={0xf, 0x4, &(0x7f0000000040)=@framed={{}, [@call={0x85, 0x0, 0x0, 0xbb}]}, &(0x7f0000000000)='GPL\x00', 0x0, 0x0, 0x0, 0x0, 0x0, '\x00', 0x0, 0x2b, 0xffffffffffffffff, 0x8, 0x0, 0x0, 0x10, 0x0}, 0x80) Let's do the same here, only check attach_func_proto for the prog types where we are certain that attach_func_proto is defined. Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor") Reported-by: syzbot+0f8d989b1fba1addc5e0@syzkaller.appspotmail.com Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220720164729.147544-1-sdf@google.com
2022-07-21sched/core: Fix the bug that task won't enqueue into core tree when update ↵Cruz Zhao
cookie In function sched_core_update_cookie(), a task will enqueue into the core tree only when it enqueued before, that is, if an uncookied task is cookied, it will not enqueue into the core tree until it enqueue again, which will result in unnecessary force idle. Here follows the scenario: CPU x and CPU y are a pair of SMT siblings. 1. Start task a running on CPU x without sleeping, and task b and task c running on CPU y without sleeping. 2. We create a cookie and share it to task a and task b, and then we create another cookie and share it to task c. 3. Simpling core_forceidle_sum of task a and b from /proc/PID/sched And we will find out that core_forceidle_sum of task a takes 30% time of the sampling period, which shouldn't happen as task a and b have the same cookie. Then we migrate task a to CPU x', migrate task b and c to CPU y', where CPU x' and CPU y' are a pair of SMT siblings, and sampling again, we will found out that core_forceidle_sum of task a and b are almost zero. To solve this problem, we enqueue the task into the core tree if it's on rq. Fixes: 6e33cad0af49("sched: Trivial core scheduling cookie management") Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1656403045-100840-2-git-send-email-CruzZhao@linux.alibaba.com
2022-07-21nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()Nicolas Saenz Julienne
dequeue_task_rt() only decrements 'rt_rq->rt_nr_running' after having called sched_update_tick_dependency() preventing it from re-enabling the tick on systems that no longer have pending SCHED_RT tasks but have multiple runnable SCHED_OTHER tasks: dequeue_task_rt() dequeue_rt_entity() dequeue_rt_stack() dequeue_top_rt_rq() sub_nr_running() // decrements rq->nr_running sched_update_tick_dependency() sched_can_stop_tick() // checks rq->rt.rt_nr_running, ... __dequeue_rt_entity() dec_rt_tasks() // decrements rq->rt.rt_nr_running ... Every other scheduler class performs the operation in the opposite order, and sched_update_tick_dependency() expects the values to be updated as such. So avoid the misbehaviour by inverting the order in which the above operations are performed in the RT scheduler. Fixes: 76d92ac305f2 ("sched: Migrate sched to use new tick dependency mask model") Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lore.kernel.org/r/20220628092259.330171-1-nsaenzju@redhat.com
2022-07-21sched/deadline: Fix BUG_ON condition for deboosted tasksJuri Lelli
Tasks the are being deboosted from SCHED_DEADLINE might enter enqueue_task_dl() one last time and hit an erroneous BUG_ON condition: since they are not boosted anymore, the if (is_dl_boosted()) branch is not taken, but the else if (!dl_prio) is and inside this one we BUG_ON(!is_dl_boosted), which is of course false (BUG_ON triggered) otherwise we had entered the if branch above. Long story short, the current condition doesn't make sense and always leads to triggering of a BUG. Fix this by only checking enqueue flags, properly: ENQUEUE_REPLENISH has to be present, but additional flags are not a problem. Fixes: 64be6f1f5f71 ("sched/deadline: Don't replenish from a !SCHED_DEADLINE entity") Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20220714151908.533052-1-juri.lelli@redhat.com
2022-07-20module: Replace kmap() with kmap_local_page()Fabio M. De Francesco
kmap() is being deprecated in favor of kmap_local_page(). Two main problems with kmap(): (1) It comes with an overhead as mapping space is restricted and protected by a global lock for synchronization and (2) it also requires global TLB invalidation when the kmap’s pool wraps and it might block when the mapping space is fully utilized until a slot becomes available. With kmap_local_page() the mappings are per thread, CPU local, can take page faults, and can be called from any context (including interrupts). Tasks can be preempted and, when scheduled to run again, the kernel virtual addresses are restored and still valid. kmap_local_page() is faster than kmap() in kernels with HIGHMEM enabled. Since the use of kmap_local_page() in module_gzip_decompress() and in module_xz_decompress() is safe (i.e., it does not break the strict rules of use), it should be preferred over kmap(). Therefore, replace kmap() with kmap_local_page(). Tested on a QEMU/KVM x86_32 VM with 4GB RAM, booting kernels with HIGHMEM64GB enabled. Modules compressed with XZ or GZIP decompress properly. Cc: Matthew Wilcox <willy@infradead.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2022-07-20watchqueue: make sure to serialize 'wqueue->defunct' properlyLinus Torvalds
When the pipe is closed, we mark the associated watchqueue defunct by calling watch_queue_clear(). However, while that is protected by the watchqueue lock, new watchqueue entries aren't actually added under that lock at all: they use the pipe->rd_wait.lock instead, and looking up that pipe happens without any locking. The watchqueue code uses the RCU read-side section to make sure that the wqueue entry itself hasn't disappeared, but that does not protect the pipe_info in any way. So make sure to actually hold the wqueue lock when posting watch events, properly serializing against the pipe being torn down. Reported-by: Noam Rathaus <noamr@ssd-disclosure.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-20Merge branch irq/loongarch into irq/irqchip-nextMarc Zyngier
* irq/loongarch: : . : Merge the long awaited IRQ support for the LoongArch architecture. : : From the cover letter: : : "Currently, LoongArch based processors (e.g. Loongson-3A5000) : can only work together with LS7A chipsets. The irq chips in : LoongArch computers include CPUINTC (CPU Core Interrupt : Controller), LIOINTC (Legacy I/O Interrupt Controller), : EIOINTC (Extended I/O Interrupt Controller), PCH-PIC (Main : Interrupt Controller in LS7A chipset), PCH-LPC (LPC Interrupt : Controller in LS7A chipset) and PCH-MSI (MSI Interrupt Controller)." : : Note that this comes with non-official, arch private ACPICA : definitions until the official ACPICA update is realeased. : . irqchip / ACPI: Introduce ACPI_IRQ_MODEL_LPIC for LoongArch irqchip: Add LoongArch CPU interrupt controller support irqchip: Add Loongson Extended I/O interrupt controller support irqchip/loongson-liointc: Add ACPI init support irqchip/loongson-pch-msi: Add ACPI init support irqchip/loongson-pch-pic: Add ACPI init support irqchip: Add Loongson PCH LPC controller support LoongArch: Prepare to support multiple pch-pic and pch-msi irqdomain LoongArch: Use ACPI_GENERIC_GSI for gsi handling genirq/generic_chip: Export irq_unmap_generic_chip ACPI: irq: Allow acpi_gsi_to_irq() to have an arch-specific fallback APCI: irq: Add support for multiple GSI domains LoongArch: Provisionally add ACPICA data structures Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-07-20genirq: Use for_each_action_of_desc in actions_show()Paran Lee
Refactor action_show() to use for_each_action_of_desc instead of a similar open-coded loop. Signed-off-by: Paran Lee <p4ranlee@gmail.com> [maz: reword commit message] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220710112614.19410-1-p4ranlee@gmail.com
2022-07-20genirq/generic_chip: Export irq_unmap_generic_chipJianmin Lv
Some irq controllers have to re-implement a private version for irq_generic_chip_ops, because they have a different xlate to translate hwirq. Export irq_unmap_generic_chip to allow reusing in drivers. Signed-off-by: Jianmin Lv <lvjianmin@loongson.cn> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/1658314292-35346-5-git-send-email-lvjianmin@loongson.cn
2022-07-19rcu/nocb: Avoid polling when my_rdp->nocb_head_rdp list is emptyZqiang
Currently, if the 'rcu_nocb_poll' kernel boot parameter is enabled, all rcuog kthreads enter polling mode. However, if all of a given group of rcuo kthreads correspond to CPUs that have been de-offloaded, the corresponding rcuog kthread will nonetheless still wake up periodically, unnecessarily consuming power and perturbing workloads. Fortunately, this situation is easily detected by the fact that the rcuog kthread's CPU's rcu_data structure's ->nocb_head_rdp list is empty. This commit saves power and avoids unnecessarily perturbing workloads by putting an rcuog kthread to sleep during any time period when all of its rcuo kthreads' CPUs are de-offloaded. Co-developed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu/nocb: Add option to opt rcuo kthreads out of RT priorityUladzislau Rezki (Sony)
This commit introduces a RCU_NOCB_CPU_CB_BOOST Kconfig option that prevents rcuo kthreads from running at real-time priority, even in kernels built with RCU_BOOST. This capability is important to devices needing low-latency (as in a few milliseconds) response from expedited RCU grace periods, but which are not running a classic real-time workload. On such devices, permitting the rcuo kthreads to run at real-time priority results in unacceptable latencies imposed on the application tasks, which run as SCHED_OTHER. See for example the following trace output: <snip> <...>-60 [006] d..1 2979.028717: rcu_batch_start: rcu_preempt CBs=34619 bl=270 <snip> If that rcuop kthread were permitted to run at real-time SCHED_FIFO priority, it would monopolize its CPU for hundreds of milliseconds while invoking those 34619 RCU callback functions, which would cause an unacceptably long latency spike for many application stacks on Android platforms. However, some existing real-time workloads require that callback invocation run at SCHED_FIFO priority, for example, those running on systems with heavy SCHED_OTHER background loads. (It is the real-time system's administrator's responsibility to make sure that important real-time tasks run at a higher priority than do RCU's kthreads.) Therefore, this new RCU_NOCB_CPU_CB_BOOST Kconfig option defaults to "y" on kernels built with PREEMPT_RT and defaults to "n" otherwise. The effect is to preserve current behavior for real-time systems, but for other systems to allow expedited RCU grace periods to run with real-time priority while continuing to invoke RCU callbacks as SCHED_OTHER. As you would expect, this RCU_NOCB_CPU_CB_BOOST Kconfig option has no effect except on CPUs with offloaded RCU callbacks. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread()Zqiang
Callbacks are invoked in RCU kthreads when calbacks are offloaded (rcu_nocbs boot parameter) or when RCU's softirq handler has been offloaded to rcuc kthreads (use_softirq==0). The current code allows for the rcu_nocbs case but not the use_softirq case. This commit adds support for the use_softirq case. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu/nocb: Add an option to offload all CPUs on bootJoel Fernandes
Systems built with CONFIG_RCU_NOCB_CPU=y but booted without either the rcu_nocbs= or rcu_nohz_full= kernel-boot parameters will not have callback offloading on any of the CPUs, nor can any of the CPUs be switched to enable callback offloading at runtime. Although this is intentional, it would be nice to have a way to offload all the CPUs without having to make random bootloaders specify either the rcu_nocbs= or the rcu_nohz_full= kernel-boot parameters. This commit therefore provides a new CONFIG_RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that switches the default so as to offload callback processing on all of the CPUs. This default can still be overridden using the rcu_nocbs= and rcu_nohz_full= kernel-boot parameters. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Reviewed-by: Uladzislau Rezki <urezki@gmail.com> (In v4.1, fixed issues with CONFIG maze reported by kernel test robot). Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() ↵Zqiang
direct call If the rcuog/o[p] kthreads spawn failed, the offloaded rdp needs to be explicitly deoffloaded, otherwise the target rdp is still considered offloaded even though nothing actually handles the callbacks. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking orderZqiang
In case of failure to spawn either rcuog or rcuo[p] kthreads for a given rdp, rcu_nocb_rdp_deoffload() needs to be called with the hotplug lock and the barrier_mutex held. However cpus write lock is already held while calling rcutree_prepare_cpu(). It's not possible to call rcu_nocb_rdp_deoffload() from there with just locking the barrier_mutex or this would result in a locking inversion against rcu_nocb_cpu_deoffload() which holds both locks in the reverse order. Simply solve this with inverting the locking order inside rcu_nocb_cpu_[de]offload(). This will be a pre-requisite to toggle NOCB states toward cpusets anyway. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu/nocb: Add/del rdp to iterate from rcuog itselfFrederic Weisbecker
NOCB rdp's are part of a group whose list is iterated by the corresponding rdp leader. This list is RCU traversed because an rdp can be either added or deleted concurrently. Upon addition, a new iteration to the list after a synchronization point (a pair of LOCK/UNLOCK ->nocb_gp_lock) is forced to make sure: 1) we didn't miss a new element added in the middle of an iteration 2) we didn't ignore a whole subset of the list due to an element being quickly deleted and then re-added. 3) we prevent from probably other surprises... Although this layout is expected to be safe, it doesn't help anybody to sleep well. Simplify instead the nocb state toggling with moving the list modification from the nocb (de-)offloading workqueue to the rcuog kthreads instead. Whenever the rdp leader is expected to (re-)set the SEGCBLIST_KTHREAD_GP flag of a target rdp, the latter is queued so that the leader handles the flag flip along with adding or deleting the target rdp to the list to iterate. This way the list modification and iteration happen from the same kthread and those operations can't race altogether. As a bonus, the flags for each rdp don't need to be checked locklessly before each iteration, which is one less opportunity to produce nightmares. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu/tree: Add comment to describe GP-done condition in fqs loopNeeraj Upadhyay
Add a comment to explain why !rcu_preempt_blocked_readers_cgp() condition is required on root rnp node, for GP completion check in rcu_gp_fqs_loop(). Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs()Paul E. McKenney
This commit saves a line of code by initializing the rcu_gp_fqs() function's first_gp_fqs local variable in its declaration. Reported-by: Frederic Weisbecker <frederic@kernel.org> Reported-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19rcu/kvfree: Remove useless monitor_todo flagJoel Fernandes (Google)
monitor_todo is not needed as the work struct already tracks if work is pending. Just use that to know if work is pending using schedule_delayed_work() helper. Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu: Cleanup RCU urgency state for offline CPUZqiang
When a CPU is slow to provide a quiescent state for a given grace period, RCU takes steps to encourage that CPU to get with the quiescent-state program in a more timely fashion. These steps include these flags in the rcu_data structure: 1. ->rcu_urgent_qs, which causes the scheduling-clock interrupt to request an otherwise pointless context switch from the scheduler. 2. ->rcu_need_heavy_qs, which causes both cond_resched() and RCU's context-switch hook to do an immediate momentary quiscent state. 3. ->rcu_need_heavy_qs, which causes the scheduler-clock tick to be enabled even on nohz_full CPUs with only one runnable task. These flags are of course cleared once the corresponding CPU has passed through a quiescent state. Unless that quiescent state is the CPU going offline, which means that when the CPU comes back online, it will needlessly consume additional CPU time and incur additional latency, which constitutes a minor but very real performance bug. This commit therefore adds the call to rcu_disable_urgency_upon_qs() that clears these flags to the CPU-hotplug offlining code path. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu: tiny: Record kvfree_call_rcu() call stack for KASANJohannes Berg
When running KASAN with Tiny RCU (e.g. under ARCH=um, where a working KASAN patch is now available), we don't get any information on the original kfree_rcu() (or similar) caller when a problem is reported, as Tiny RCU doesn't record this. Add the recording, which required pulling kvfree_call_rcu() out of line for the KASAN case since the recording function (kasan_record_aux_stack_noalloc) is neither exported, nor can we include kasan.h into rcutiny.h. without KASAN, the patch has no size impact (ARCH=um kernel): text data bss dec hex filename 6151515 4423154 33148520 43723189 29b29b5 linux 6151515 4423154 33148520 43723189 29b29b5 linux + patch with KASAN, the impact on my build was minimal: text data bss dec hex filename 13915539 7388050 33282304 54585893 340ea25 linux 13911266 7392114 33282304 54585684 340e954 linux + patch -4273 +4064 +-0 -209 Acked-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19locking/csd_lock: Change csdlock_debug from early_param to __setupChen Zhongjin
The csdlock_debug kernel-boot parameter is parsed by the early_param() function csdlock_debug(). If set, csdlock_debug() invokes static_branch_enable() to enable csd_lock_wait feature, which triggers a panic on arm64 for kernels built with CONFIG_SPARSEMEM=y and CONFIG_SPARSEMEM_VMEMMAP=n. With CONFIG_SPARSEMEM_VMEMMAP=n, __nr_to_section is called in static_key_enable() and returns NULL, resulting in a NULL dereference because mem_section is initialized only later in sparse_init(). This is also a problem for powerpc because early_param() functions are invoked earlier than jump_label_init(), also resulting in static_key_enable() failures. These failures cause the warning "static key 'xxx' used before call to jump_label_init()". Thus, early_param is too early for csd_lock_wait to run static_branch_enable(), so changes it to __setup to fix these. Fixes: 8d0968cc6b8f ("locking/csd_lock: Add boot parameter for controlling CSD lock debugging") Cc: stable@vger.kernel.org Reported-by: Chen jingwen <chenjingwen6@huawei.com> Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19srcu: Make expedited RCU grace periods block even less frequentlyNeeraj Upadhyay
The purpose of commit 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") was to prevent a long series of never-blocking expedited SRCU grace periods from blocking kernel-live-patching (KLP) progress. Although it was successful, it also resulted in excessive boot times on certain embedded workloads running under qemu with the "-bios QEMU_EFI.fd" command line. Here "excessive" means increasing the boot time up into the three-to-four minute range. This increase in boot time was due to the more than 6000 back-to-back invocations of synchronize_rcu_expedited() within the KVM host OS, which in turn resulted from qemu's emulation of a long series of MMIO accesses. Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods") did not significantly help this particular use case. Zhangfei Gao and Shameerali Kolothum Thodi did experiments varying the value of SRCU_MAX_NODELAY_PHASE with HZ=250 and with various values of non-sleeping per phase counts on a system with preemption enabled, and observed the following boot times: +──────────────────────────+────────────────+ | SRCU_MAX_NODELAY_PHASE | Boot time (s) | +──────────────────────────+────────────────+ | 100 | 30.053 | | 150 | 25.151 | | 200 | 20.704 | | 250 | 15.748 | | 500 | 11.401 | | 1000 | 11.443 | | 10000 | 11.258 | | 1000000 | 11.154 | +──────────────────────────+────────────────+ Analysis on the experiment results show additional improvements with CPU-bound delays approaching one jiffy in duration. This improvement was also seen when number of per-phase iterations were scaled to one jiffy. This commit therefore scales per-grace-period phase number of non-sleeping polls so that non-sleeping polls extend for about one jiffy. In addition, the delay-calculation call to srcu_get_delay() in srcu_gp_end() is replaced with a simple check for an expedited grace period. This change schedules callback invocation immediately after expedited grace periods complete, which results in greatly improved boot times. Testing done by Marc and Zhangfei confirms that this change recovers most of the performance degradation in boottime; for CONFIG_HZ_250 configuration, specifically, boot times improve from 3m50s to 41s on Marc's setup; and from 2m40s to ~9.7s on Zhangfei's setup. In addition to the changes to default per phase delays, this change adds 3 new kernel parameters - srcutree.srcu_max_nodelay, srcutree.srcu_max_nodelay_phase, and srcutree.srcu_retry_check_delay. This allows users to configure the srcu grace period scanning delays in order to more quickly react to additional use cases. Fixes: 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods") Fixes: 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org> Reported-by: yueluck <yueluck@163.com> Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Tested-by: Marc Zyngier <maz@kernel.org> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-19rcu: Forbid RCU_STRICT_GRACE_PERIOD in TINY_RCU kernelsPaul E. McKenney
The RCU_STRICT_GRACE_PERIOD Kconfig option does nothing in kernels built with CONFIG_TINY_RCU=y, so this commit adjusts the dependencies to disallow this combination. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19srcu: Block less aggressively for expedited grace periodsPaul E. McKenney
Commit 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") fixed a problem where a long-running expedited SRCU grace period could block kernel live patching. It did so by giving up on expediting once a given SRCU expedited grace period grew too old. Unfortunately, this added excessive delays to boots of virtual embedded systems specifying "-bios QEMU_EFI.fd" to qemu. This commit therefore makes the transition away from expediting less aggressive, increasing the per-grace-period phase number of non-sleeping polls of readers from one to three and increasing the required grace-period age from one jiffy (actually from zero to one jiffies) to two jiffies (actually from one to two jiffies). Fixes: 282d8998e997 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org> Reported-by: chenxiang (M)" <chenxiang66@hisilicon.com> Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com> Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/
2022-07-19rcu: Immediately boost preempted readers for strict grace periodsZqiang
The intent of the CONFIG_RCU_STRICT_GRACE_PERIOD Konfig option is to cause normal grace periods to complete quickly in order to better catch errors resulting from improperly leaking pointers from RCU read-side critical sections. However, kernels built with this option enabled still wait for some hundreds of milliseconds before boosting RCU readers that have been preempted within their current critical section. The value of this delay is set by the CONFIG_RCU_BOOST_DELAY Kconfig option, which defaults to 500 milliseconds. This commit therefore causes kernels build with strict grace periods to ignore CONFIG_RCU_BOOST_DELAY. This causes rcu_initiate_boost() to start boosting immediately after all CPUs on a given leaf rcu_node structure have passed through their quiescent states. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu: Add rnp->cbovldmask check in rcutree_migrate_callbacks()Zqiang
Currently, the rcu_node structure's ->cbovlmask field is set in call_rcu() when a given CPU is suffering from callback overload. But if that CPU goes offline, the outgoing CPU's callbacks is migrated to the running CPU, which is likely to overload the running CPU. However, that CPU's bit in its leaf rcu_node structure's ->cbovlmask field remains zero. Initially, this is OK because the outgoing CPU's bit remains set. However, that bit will be cleared at the next end of a grace period, at which time it is quite possible that the running CPU will still be overloaded. If the running CPU invokes call_rcu(), then overload will be checked for and the bit will be set. Except that there is no guarantee that the running CPU will invoke call_rcu(), in which case the next grace period will fail to take the running CPU's overload condition into account. Plus, because the bit is not set, the end of the grace period won't check for overload on this CPU. This commit therefore adds a call to check_cb_ovld_locked() in rcutree_migrate_callbacks() to set the running CPU's ->cbovlmask bit appropriately. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu: Avoid tracing a few functions executed in stop machinePatrick Wang
Stop-machine recently started calling additional functions while waiting: ---------------------------------------------------------------- Former stop machine wait loop: do { cpu_relax(); => macro ... } while (curstate != STOPMACHINE_EXIT); ----------------------------------------------------------------- Current stop machine wait loop: do { stop_machine_yield(cpumask); => function (notraced) ... touch_nmi_watchdog(); => function (notraced, inside calls also notraced) ... rcu_momentary_dyntick_idle(); => function (notraced, inside calls traced) } while (curstate != MULTI_STOP_EXIT); ------------------------------------------------------------------ These functions (and the functions that they call) must be marked notrace to prevent them from being updated while they are executing. The consequences of failing to mark these functions can be severe: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: 1-...!: (0 ticks this GP) idle=14f/1/0x4000000000000000 softirq=3397/3397 fqs=0 rcu: 3-...!: (0 ticks this GP) idle=ee9/1/0x4000000000000000 softirq=5168/5168 fqs=0 (detected by 0, t=8137 jiffies, g=5889, q=2 ncpus=4) Task dump for CPU 1: task:migration/1 state:R running task stack: 0 pid: 19 ppid: 2 flags:0x00000000 Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174 Call Trace: Task dump for CPU 3: task:migration/3 state:R running task stack: 0 pid: 29 ppid: 2 flags:0x00000000 Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174 Call Trace: rcu: rcu_preempt kthread timer wakeup didn't happen for 8136 jiffies! g5889 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 rcu: Possible timer handling issue on cpu=2 timer-softirq=594 rcu: rcu_preempt kthread starved for 8137 jiffies! g5889 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2 rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior. rcu: RCU grace-period kthread stack dump: task:rcu_preempt state:I stack: 0 pid: 14 ppid: 2 flags:0x00000000 Call Trace: schedule+0x56/0xc2 schedule_timeout+0x82/0x184 rcu_gp_fqs_loop+0x19a/0x318 rcu_gp_kthread+0x11a/0x140 kthread+0xee/0x118 ret_from_exception+0x0/0x14 rcu: Stack dump where RCU GP kthread last ran: Task dump for CPU 2: task:migration/2 state:R running task stack: 0 pid: 24 ppid: 2 flags:0x00000000 Stopper: multi_cpu_stop+0x0/0x18c <- stop_machine_cpuslocked+0x128/0x174 Call Trace: This commit therefore marks these functions notrace: rcu_preempt_deferred_qs() rcu_preempt_need_deferred_qs() rcu_preempt_deferred_qs_irqrestore() [ paulmck: Apply feedback from Neeraj Upadhyay. ] Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19rcu: Decrease FQS scan wait time in case of callback overloadingPaul E. McKenney
The force-quiesce-state loop function rcu_gp_fqs_loop() checks for callback overloading and does an immediate initial scan for idle CPUs if so. However, subsequent rescans will be carried out at as leisurely a rate as they always are, as specified by the rcutree.jiffies_till_next_fqs module parameter. It might be tempting to just continue immediately rescanning, but this turns the RCU grace-period kthread into a CPU hog. It might also be tempting to reduce the time between rescans to a single jiffy, but this can be problematic on larger systems. This commit therefore divides the normal time between rescans by three, rounding up. Thus a small system running at HZ=1000 that is suffering from callback overload will wait only one jiffy instead of the normal three between rescans. [ paulmck: Apply Neeraj Upadhyay feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
2022-07-19bpf: remove obsolete KMALLOC_MAX_SIZE restriction on array map value sizeAndrii Nakryiko
Syscall-side map_lookup_elem() and map_update_elem() used to use kmalloc() to allocate temporary buffers of value_size, so KMALLOC_MAX_SIZE limit on value_size made sense to prevent creation of array map that won't be accessible through syscall interface. But this limitation since has been lifted by relying on kvmalloc() in syscall handling code. So remove KMALLOC_MAX_SIZE, which among other things means that it's possible to have BPF global variable sections (.bss, .data, .rodata) bigger than 8MB now. Keep the sanity check to prevent trivial overflows like round_up(map->value_size, 8) and restrict value size to <= INT_MAX (2GB). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220715053146.1291891-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-19bpf: make uniform use of array->elem_size everywhere in arraymap.cAndrii Nakryiko
BPF_MAP_TYPE_ARRAY is rounding value_size to closest multiple of 8 and stores that as array->elem_size for various memory allocations and accesses. But the code tends to re-calculate round_up(map->value_size, 8) in multiple places instead of using array->elem_size. Cleaning this up and making sure we always use array->size to avoid duplication of this (admittedly simple) logic for consistency. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220715053146.1291891-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-19bpf: fix potential 32-bit overflow when accessing ARRAY map elementAndrii Nakryiko
If BPF array map is bigger than 4GB, element pointer calculation can overflow because both index and elem_size are u32. Fix this everywhere by forcing 64-bit multiplication. Extract this formula into separate small helper and use it consistently in various places. Speculative-preventing formula utilizing index_mask trick is left as is, but explicit u64 casts are added in both places. Fixes: c85d69135a91 ("bpf: move memory size checks to bpf_map_charge_init()") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20220715053146.1291891-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>