Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"Two minor audit patches: on marks a function as static, the other
removes a redundant length check"
* tag 'audit-pr-20220801' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: make is_audit_feature_set() static
audit: remove redundant data_len check
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook:
- Allow unsharing time namespace on vfork+exec (Andrei Vagin)
- Replace usage of deprecated kmap APIs (Fabio M. De Francesco)
- Fix spelling mistake (Zhang Jiaming)
* tag 'execve-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
exec: Call kmap_local_page() in copy_string_kernel()
exec: Fix a spelling mistake
selftests/timens: add a test for vfork+exit
fs/exec: allow to unshare a time namespace on vfork+exec
|
|
Using msleep() is problematic because it's compared against
ratelimiter.c's ktime_get_coarse_boottime_ns(), which means on systems
with slow jiffies (such as UML's forced HZ=100), the result is
inaccurate. So switch to using schedule_hrtimeout().
However, hrtimer gives us access only to the traditional posix timers,
and none of the _COARSE variants. So now, rather than being too
imprecise like jiffies, it's too precise.
One solution would be to give it a large "range" value, but this will
still fire early on a loaded system. A better solution is to align the
timeout to the actual coarse timer, and then round up to the nearest
tick, plus change.
So add the timeout to the current coarse time, and then
schedule_hrtimer() until the absolute computed time.
This should hopefully reduce flakes in CI as well. Note that we keep the
retry loop in case the entire function is running behind, because the
test could still be scheduled out, by either the kernel or by the
hypervisor's kernel, in which case restarting the test and hoping to not
be scheduled out still helps.
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pull block updates from Jens Axboe:
- Improve the type checking of request flags (Bart)
- Ensure queue mapping for a single queues always picks the right queue
(Bart)
- Sanitize the io priority handling (Jan)
- rq-qos race fix (Jinke)
- Reserved tags handling improvements (John)
- Separate memory alignment from file/disk offset aligment for O_DIRECT
(Keith)
- Add new ublk driver, userspace block driver using io_uring for
communication with the userspace backend (Ming)
- Use try_cmpxchg() to cleanup the code in various spots (Uros)
- Finally remove bdevname() (Christoph)
- Clean up the zoned device handling (Christoph)
- Clean up independent access range support (Christoph)
- Clean up and improve block sysfs handling (Christoph)
- Clean up and improve teardown of block devices.
This turns the usual two step process into something that is simpler
to implement and handle in block drivers (Christoph)
- Clean up chunk size handling (Christoph)
- Misc cleanups and fixes (Bart, Bo, Dan, GuoYong, Jason, Keith, Liu,
Ming, Sebastian, Yang, Ying)
* tag 'for-5.20/block-2022-07-29' of git://git.kernel.dk/linux-block: (178 commits)
ublk_drv: fix double shift bug
ublk_drv: make sure that correct flags(features) returned to userspace
ublk_drv: fix error handling of ublk_add_dev
ublk_drv: fix lockdep warning
block: remove __blk_get_queue
block: call blk_mq_exit_queue from disk_release for never added disks
blk-mq: fix error handling in __blk_mq_alloc_disk
ublk: defer disk allocation
ublk: rewrite ublk_ctrl_get_queue_affinity to not rely on hctx->cpumask
ublk: fold __ublk_create_dev into ublk_ctrl_add_dev
ublk: cleanup ublk_ctrl_uring_cmd
ublk: simplify ublk_ch_open and ublk_ch_release
ublk: remove the empty open and release block device operations
ublk: remove UBLK_IO_F_PREFLUSH
ublk: add a MAINTAINERS entry
block: don't allow the same type rq_qos add more than once
mmc: fix disk/queue leak in case of adding disk failure
ublk_drv: fix an IS_ERR() vs NULL check
ublk: remove UBLK_IO_F_INTEGRITY
ublk_drv: remove unneeded semicolon
...
|
|
Pull io_uring updates from Jens Axboe:
- As per (valid) complaint in the last merge window, fs/io_uring.c has
grown quite large these days. io_uring isn't really tied to fs
either, as it supports a wide variety of functionality outside of
that.
Move the code to io_uring/ and split it into files that either
implement a specific request type, and split some code into helpers
as well. The code is organized a lot better like this, and io_uring.c
is now < 4K LOC (me).
- Deprecate the epoll_ctl opcode. It'll still work, just trigger a
warning once if used. If we don't get any complaints on this, and I
don't expect any, then we can fully remove it in a future release
(me).
- Improve the cancel hash locking (Hao)
- kbuf cleanups (Hao)
- Efficiency improvements to the task_work handling (Dylan, Pavel)
- Provided buffer improvements (Dylan)
- Add support for recv/recvmsg multishot support. This is similar to
the accept (or poll) support for have for multishot, where a single
SQE can trigger everytime data is received. For applications that
expect to do more than a few receives on an instantiated socket, this
greatly improves efficiency (Dylan).
- Efficiency improvements for poll handling (Pavel)
- Poll cancelation improvements (Pavel)
- Allow specifiying a range for direct descriptor allocations (Pavel)
- Cleanup the cqe32 handling (Pavel)
- Move io_uring types to greatly cleanup the tracing (Pavel)
- Tons of great code cleanups and improvements (Pavel)
- Add a way to do sync cancelations rather than through the sqe -> cqe
interface, as that's a lot easier to use for some use cases (me).
- Add support to IORING_OP_MSG_RING for sending direct descriptors to a
different ring. This avoids the usually problematic SCM case, as we
disallow those. (me)
- Make the per-command alloc cache we use for apoll generic, place
limits on it, and use it for netmsg as well (me).
- Various cleanups (me, Michal, Gustavo, Uros)
* tag 'for-5.20/io_uring-2022-07-29' of git://git.kernel.dk/linux-block: (172 commits)
io_uring: ensure REQ_F_ISREG is set async offload
net: fix compat pointer in get_compat_msghdr()
io_uring: Don't require reinitable percpu_ref
io_uring: fix types in io_recvmsg_multishot_overflow
io_uring: Use atomic_long_try_cmpxchg in __io_account_mem
io_uring: support multishot in recvmsg
net: copy from user before calling __get_compat_msghdr
net: copy from user before calling __copy_msghdr
io_uring: support 0 length iov in buffer select in compat
io_uring: fix multishot ending when not polled
io_uring: add netmsg cache
io_uring: impose max limit on apoll cache
io_uring: add abstraction around apoll cache
io_uring: move apoll cache to poll.c
io_uring: consolidate hash_locked io-wq handling
io_uring: clear REQ_F_HASH_LOCKED on hash removal
io_uring: don't race double poll setting REQ_F_ASYNC_DATA
io_uring: don't miss setting REQ_F_DOUBLE_POLL
io_uring: disable multishot recvmsg
io_uring: only trace one of complete or overflow
...
|
|
Show the syntax errors for event probes in error_log file as same as
other dynamic events, so that user can understand what is the problem.
Link: https://lkml.kernel.org/r/165932113556.2850673.3483079297896607612.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
In allocate_trace_buffers(), if allocating tr->max_buffer
fails, we can directly call free_trace_buffer to free
tr->array_buffer.
Link: https://lkml.kernel.org/r/65f0702d-07f6-08de-2a07-4c50af56a67b@huawei.com
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These are mostly minor improvements all over including new CPU IDs for
the Intel RAPL driver, an Energy Model rework to use micro-Watt as the
power unit, cpufreq fixes and cleanus, cpuidle updates, devfreq
updates, documentation cleanups and a new version of the pm-graph
suite of utilities.
Specifics:
- Make cpufreq_show_cpus() more straightforward (Viresh Kumar).
- Drop unnecessary CPU hotplug locking from store() used by cpufreq
sysfs attributes (Viresh Kumar).
- Make the ACPI cpufreq driver support the boost control interface on
Zhaoxin/Centaur processors (Tony W Wang-oc).
- Print a warning message on attempts to free an active cpufreq
policy which should never happen (Viresh Kumar).
- Fix grammar in the Kconfig help text for the loongson2 cpufreq
driver (Randy Dunlap).
- Use cpumask_var_t for an on-stack CPU mask in the ondemand cpufreq
governor (Zhao Liu).
- Add trace points for guest_halt_poll_ns grow/shrink to the haltpoll
cpuidle driver (Eiichi Tsukata).
- Modify intel_idle to treat C1 and C1E as independent idle states on
Sapphire Rapids (Artem Bityutskiy).
- Extend support for wakeirq to callback wrappers used during system
suspend and resume (Ulf Hansson).
- Defer waiting for device probe before loading a hibernation image
till the first actual device access to avoid possible deadlocks
reported by syzbot (Tetsuo Handa).
- Unify device_init_wakeup() for PM_SLEEP and !PM_SLEEP (Bjorn
Helgaas).
- Add Raptor Lake-P to the list of processors supported by the Intel
RAPL driver (George D Sworo).
- Add Alder Lake-N and Raptor Lake-P to the list of processors for
which Power Limit4 is supported in the Intel RAPL driver (Sumeet
Pawnikar).
- Make pm_genpd_remove() check genpd_debugfs_dir against NULL before
attempting to remove it (Hsin-Yi Wang).
- Change the Energy Model code to represent power in micro-Watts and
adjust its users accordingly (Lukasz Luba).
- Add new devfreq driver for Mediatek CCI (Cache Coherent
Interconnect) (Johnson Wang).
- Convert the Samsung Exynos SoC Bus bindings to DT schema of
exynos-bus.c (Krzysztof Kozlowski).
- Address kernel-doc warnings by adding the description for unused
function parameters in devfreq core (Mauro Carvalho Chehab).
- Use NULL to pass a null pointer rather than zero according to the
function propotype in imx-bus.c (Colin Ian King).
- Print error message instead of error interger value in
tegra30-devfreq.c (Dmitry Osipenko).
- Add checks to prevent setting negative frequency QoS limits for
CPUs (Shivnandan Kumar).
- Update the pm-graph suite of utilities to the latest revision 5.9
including multiple improvements (Todd Brandt).
- Drop pme_interrupt reference from the PCI power management
documentation (Mario Limonciello)"
* tag 'pm-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (27 commits)
powercap: RAPL: Add Power Limit4 support for Alder Lake-N and Raptor Lake-P
PM: QoS: Add check to make sure CPU freq is non-negative
PM: hibernate: defer device probing when resuming from hibernation
intel_idle: make SPR C1 and C1E be independent
cpufreq: ondemand: Use cpumask_var_t for on-stack cpu mask
cpufreq: loongson2: fix Kconfig "its" grammar
pm-graph v5.9
cpufreq: Warn users while freeing active policy
cpufreq: scmi: Support the power scale in micro-Watts in SCMI v3.1
firmware: arm_scmi: Get detailed power scale from perf
Documentation: EM: Switch to micro-Watts scale
PM: EM: convert power field to micro-Watts precision and align drivers
PM / devfreq: tegra30: Add error message for devm_devfreq_add_device()
PM / devfreq: imx-bus: use NULL to pass a null pointer rather than zero
PM / devfreq: shut up kernel-doc warnings
dt-bindings: interconnect: samsung,exynos-bus: convert to dtschema
PM / devfreq: mediatek: Introduce MediaTek CCI devfreq driver
dt-bindings: interconnect: Add MediaTek CCI dt-bindings
PM: domains: Ensure genpd_debugfs_dir exists before remove
PM: runtime: Extend support for wakeirq for force_suspend|resume
...
|
|
kernel_text_address() treats ftrace_trampoline, kprobe_insn_slot
and bpf_text_address as valid kprobe addresses - which is not ideal.
These text areas are removable and changeable without any notification
to kprobes, and probing on them can trigger unexpected behavior:
https://lkml.org/lkml/2022/7/26/1148
Considering that jump_label and static_call text are already
forbiden to probe, kernel_text_address() should be replaced with
core_kernel_text() and is_module_text_address() to check other text
areas which are unsafe to kprobe.
[ mingo: Rewrote the changelog. ]
Fixes: 5b485629ba0d ("kprobes, extable: Identify kprobes trampolines as kernel text area")
Fixes: 74451e66d516 ("bpf: make jited programs visible in traces")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20220801033719.228248-1-chenzhongjin@huawei.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Updates for interrupt core and drivers:
Core:
- Fix a few inconsistencies between UP and SMP vs interrupt
affinities
- Small updates and cleanups all over the place
New drivers:
- LoongArch interrupt controller
- Renesas RZ/G2L interrupt controller
Updates:
- Hotpath optimization for SiFive PLIC
- Workaround for broken PLIC edge triggered interrupts
- Simall cleanups and improvements as usual"
* tag 'irq-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
irqchip/mmp: Declare init functions in common header file
irqchip/mips-gic: Check the return value of ioremap() in gic_of_init()
genirq: Use for_each_action_of_desc in actions_show()
irqchip / ACPI: Introduce ACPI_IRQ_MODEL_LPIC for LoongArch
irqchip: Add LoongArch CPU interrupt controller support
irqchip: Add Loongson Extended I/O interrupt controller support
irqchip/loongson-liointc: Add ACPI init support
irqchip/loongson-pch-msi: Add ACPI init support
irqchip/loongson-pch-pic: Add ACPI init support
irqchip: Add Loongson PCH LPC controller support
LoongArch: Prepare to support multiple pch-pic and pch-msi irqdomain
LoongArch: Use ACPI_GENERIC_GSI for gsi handling
genirq/generic_chip: Export irq_unmap_generic_chip
ACPI: irq: Allow acpi_gsi_to_irq() to have an arch-specific fallback
APCI: irq: Add support for multiple GSI domains
LoongArch: Provisionally add ACPICA data structures
irqdomain: Use hwirq_max instead of revmap_size for NOMAP domains
irqdomain: Report irq number for NOMAP domains
irqchip/gic-v3: Fix comment typo
dt-bindings: interrupt-controller: renesas,rzg2l-irqc: Document RZ/V2L SoC
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
- Fix Intel Alder Lake PEBS memory access latency & data source
profiling info bugs.
- Use Intel large-PEBS hardware feature in more circumstances, to
reduce PMI overhead & reduce sampling data.
- Extend the lost-sample profiling output with the PERF_FORMAT_LOST ABI
variant, which tells tooling the exact number of samples lost.
- Add new IBS register bits definitions.
- AMD uncore events: Add PerfMonV2 DF (Data Fabric) enhancements.
* tag 'perf-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/ibs: Add new IBS register bits into header
perf/x86/intel: Fix PEBS data source encoding for ADL
perf/x86/intel: Fix PEBS memory access info encoding for ADL
perf/core: Add a new read format to get a number of lost samples
perf/x86/amd/uncore: Add PerfMonV2 RDPMC assignments
perf/x86/amd/uncore: Add PerfMonV2 DF event format
perf/x86/amd/uncore: Detect available DF counters
perf/x86/amd/uncore: Use attr_update for format attributes
perf/x86/amd/uncore: Use dynamic events array
x86/events/intel/ds: Enable large PEBS for PERF_SAMPLE_WEIGHT_TYPE
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"This was a fairly quiet cycle for the locking subsystem:
- lockdep: Fix a handful of the more complex lockdep_init_map_*()
primitives that can lose the lock_type & cause false reports. No
such mishap was observed in the wild.
- jump_label improvements: simplify the cross-arch support of initial
NOP patching by making it arch-specific code (used on MIPS only),
and remove the s390 initial NOP patching that was superfluous"
* tag 'locking-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Fix lockdep_init_map_*() confusion
jump_label: make initial NOP patching the special case
jump_label: mips: move module NOP patching into arch code
jump_label: s390: avoid pointless initial NOP patching
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Load-balancing improvements:
- Improve NUMA balancing on AMD Zen systems for affine workloads.
- Improve the handling of reduced-capacity CPUs in load-balancing.
- Energy Model improvements: fix & refine all the energy fairness
metrics (PELT), and remove the conservative threshold requiring 6%
energy savings to migrate a task. Doing this improves power
efficiency for most workloads, and also increases the reliability
of energy-efficiency scheduling.
- Optimize/tweak select_idle_cpu() to spend (much) less time
searching for an idle CPU on overloaded systems. There's reports of
several milliseconds spent there on large systems with large
workloads ...
[ Since the search logic changed, there might be behavioral side
effects. ]
- Improve NUMA imbalance behavior. On certain systems with spare
capacity, initial placement of tasks is non-deterministic, and such
an artificial placement imbalance can persist for a long time,
hurting (and sometimes helping) performance.
The fix is to make fork-time task placement consistent with runtime
NUMA balancing placement.
Note that some performance regressions were reported against this,
caused by workloads that are not memory bandwith limited, which
benefit from the artificial locality of the placement bug(s). Mel
Gorman's conclusion, with which we concur, was that consistency is
better than random workload benefits from non-deterministic bugs:
"Given there is no crystal ball and it's a tradeoff, I think
it's better to be consistent and use similar logic at both fork
time and runtime even if it doesn't have universal benefit."
- Improve core scheduling by fixing a bug in
sched_core_update_cookie() that caused unnecessary forced idling.
- Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs
for newly woken tasks.
- Fix a newidle balancing bug that introduced unnecessary wakeup
latencies.
ABI improvements/fixes:
- Do not check capabilities and do not issue capability check denial
messages when a scheduler syscall doesn't require privileges. (Such
as increasing niceness.)
- Add forced-idle accounting to cgroups too.
- Fix/improve the RSEQ ABI to not just silently accept unknown flags.
(No existing tooling is known to have learned to rely on the
previous behavior.)
- Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags.
Optimizations:
- Optimize & simplify leaf_cfs_rq_list()
- Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg().
Misc fixes & cleanups:
- Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems.
- Fix a full-NOHZ bug that can in some cases result in the tick not
being re-enabled when the last SCHED_RT task is gone from a
runqueue but there's still SCHED_OTHER tasks around.
- Various PREEMPT_RT related fixes.
- Misc cleanups & smaller fixes"
* tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
rseq: Kill process when unknown flags are encountered in ABI structures
rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags
sched/core: Fix the bug that task won't enqueue into core tree when update cookie
nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()
sched/core: Always flush pending blk_plug
sched/fair: fix case with reduced capacity CPU
sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling
sched/core: add forced idle accounting for cgroups
sched/fair: Remove the energy margin in feec()
sched/fair: Remove task_util from effective utilization in feec()
sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()
sched/fair: Rename select_idle_mask to select_rq_mask
sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
sched/fair: Decay task PELT values during wakeup migration
sched/fair: Provide u64 read for 32-bits arch helper
sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
sched: only perform capability check on privileged operation
sched: Remove unused function group_first_cpu()
sched/fair: Remove redundant word " *"
selftests/rseq: check if libc rseq support is registered
...
|
|
rseq_abi()->flags and rseq_abi()->rseq_cs->flags 29 upper bits are
currently unused.
The current behavior when those bits are set is to ignore them. This is
not an ideal behavior, because when future features will start using
those flags, if user-space fails to correctly validate that the kernel
indeed supports those flags (e.g. with a new sys_rseq flags bit) before
using them, it may incorrectly assume that the kernel will handle those
flags way when in fact those will be silently ignored on older kernels.
Validating that unused flags bits are cleared will allow a smoother
transition when those flags will start to be used by allowing
applications to fail early, and obviously, when they attempt to use the
new flags on an older kernel that does not support them.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20220622194617.1155957-2-mathieu.desnoyers@efficios.com
|
|
The pretty much unused RSEQ_CS_FLAG_NO_RESTART_ON_* flags introduce
complexity in rseq, and are subtly buggy [1]. Solving those issues
requires introducing additional complexity in the rseq implementation
for each supported architecture.
Considering that it complexifies the rseq ABI, I am proposing that we
deprecate those flags. [2]
So far there appears to be consensus from maintainers of user-space
projects impacted by this feature that its removal would be a welcome
simplification. [3]
The deprecation approach proposed here is to issue WARN_ON_ONCE() when
encountering those flags and kill the offending process with sigsegv.
This should allow us to quickly identify whether anyone yells at us for
removing this.
Link: https://lore.kernel.org/lkml/20220618182515.95831-1-minhquangbui99@gmail.com/ [1]
Link: https://lore.kernel.org/lkml/258546133.12151.1655739550814.JavaMail.zimbra@efficios.com/ [2]
Link: https://lore.kernel.org/lkml/87pmj1enjh.fsf@email.froward.int.ebiederm.org/ [3]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/lkml/20220622194617.1155957-1-mathieu.desnoyers@efficios.com
|
|
The only use case of the platform_has() infrastructure has been
removed again, so remove the whole feature.
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> # Arm64 guest using Xen
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Link: https://lore.kernel.org/r/20220622063838.8854-3-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Update the 'mitigations=' kernel param documentation
- Check the IBPB feature flag before enabling IBPB in firmware calls
because cloud vendors' fantasy when it comes to creating guest
configurations is unlimited
- Unexport sev_es_ghcb_hv_call() before 5.19 releases now that HyperV
doesn't need it anymore
- Remove dead CONFIG_* items
* tag 'x86_urgent_for_v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
docs/kernel-parameters: Update descriptions for "mitigations=" param with retbleed
x86/bugs: Do not enable IBPB at firmware entry when IBPB is not available
Revert "x86/sev: Expose sev_es_ghcb_hv_call() for use by HyperV"
x86/configs: Update configs in x86_debug.config
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:
- Avoid rwsem lockups in certain situations when handling the handoff
bit
* tag 'locking_urgent_for_v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Allow slowpath writer to ignore handoff bit if not set by first waiter
|
|
Sample reactor that panics the system when an exception is found. This
is useful both to capture a vmcore, or to fail-safe a critical system.
Link: https://lkml.kernel.org/r/729aae3aba95f35738b8f8180e626d747d1d9da2.1659052063.git.bristot@kernel.org
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tao Zhou <tao.zhou@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
A reactor that printks the reaction message.
Link: https://lkml.kernel.org/r/b65f18a7fd6dc6659a3008fd7b7392de3465d47b.1659052063.git.bristot@kernel.org
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tao Zhou <tao.zhou@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Per task wakeup while not running (wwnr) monitor.
This model is broken, the reason is that a task can be running in the
processor without being set as RUNNABLE. Think about a task about to
sleep:
1: set_current_state(TASK_UNINTERRUPTIBLE);
2: schedule();
And then imagine an IRQ happening in between the lines one and two,
waking the task up. BOOM, the wakeup will happen while the task is
running.
Q: Why do we need this model, so?
A: To test the reactors.
Link: https://lkml.kernel.org/r/473c0fc39967250fdebcff8b620311c11dccad30.1659052063.git.bristot@kernel.org
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tao Zhou <tao.zhou@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The wakeup in preemptive (wip) monitor verifies if the
wakeup events always take place with preemption disabled:
|
|
v
#==================#
H preemptive H <+
#==================# |
| |
| preempt_disable | preempt_enable
v |
sched_waking +------------------+ |
+--------------- | | |
| | non_preemptive | |
+--------------> | | -+
+------------------+
The wakeup event always takes place with preemption disabled because
of the scheduler synchronization. However, because the preempt_count
and its trace event are not atomic with regard to interrupts, some
inconsistencies might happen.
The documentation illustrates one of these cases.
Link: https://lkml.kernel.org/r/c98ca678df81115fddc04921b3c79720c836b18f.1659052063.git.bristot@kernel.org
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tao Zhou <tao.zhou@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
THIS CODE IS NOT LINKED TO THE MAKEFILE.
This model does not compile because it lacks the instrumentation
part, which will be added next. In the typical case, there will be
only one patch, but it was split into two patches for educational
purposes.
This is the direct output this command line:
$ dot2k -d tools/verification/models/wip.dot -t per_cpu
Link: https://lkml.kernel.org/r/5eb7a9118917e8a814c5e49853a72fc62be0a101.1659052063.git.bristot@kernel.org
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tao Zhou <tao.zhou@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Add the runtime-verification.rst document, explaining the basics of RV
and how to use the interface.
Link: https://lkml.kernel.org/r/4be7d1a88ab1e2eb0767521e1ab52a149a154bc4.1659052063.git.bristot@kernel.org
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tao Zhou <tao.zhou@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
In Linux terms, the runtime verification monitors are encapsulated
inside the "RV monitor" abstraction. The "RV monitor" includes a set
of instances of the monitor (per-cpu monitor, per-task monitor, and
so on), the helper functions that glue the monitor to the system
reference model, and the trace output as a reaction for event parsing
and exceptions, as depicted below:
Linux +----- RV Monitor ----------------------------------+ Formal
Realm | | Realm
+-------------------+ +----------------+ +-----------------+
| Linux kernel | | Monitor | | Reference |
| Tracing | -> | Instance(s) | <- | Model |
| (instrumentation) | | (verification) | | (specification) |
+-------------------+ +----------------+ +-----------------+
| | |
| V |
| +----------+ |
| | Reaction | |
| +--+--+--+-+ |
| | | | |
| | | +-> trace output ? |
+------------------------|--|----------------------+
| +----> panic ?
+-------> <user-specified>
Add the rv/da_monitor.h, enabling automatic code generation for the
*Monitor Instance(s)* using C macros, and code to support it.
The benefits of the usage of macro for monitor synthesis are 3-fold as it:
- Reduces the code duplication;
- Facilitates the bug fix/improvement;
- Avoids the case of developers changing the core of the monitor code
to manipulate the model in a (let's say) non-standard way.
This initial implementation presents three different types of monitor
instances:
- DECLARE_DA_MON_GLOBAL(name, type)
- DECLARE_DA_MON_PER_CPU(name, type)
- DECLARE_DA_MON_PER_TASK(name, type)
The first declares the functions for a global deterministic automata monitor,
the second for monitors with per-cpu instances, and the third with per-task
instances.
Link: https://lkml.kernel.org/r/51b0bf425a281e226dfeba7401d2115d6091f84e.1659052063.git.bristot@kernel.org
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tao Zhou <tao.zhou@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
A runtime monitor can cause a reaction to the detection of an
exception on the model's execution. By default, the monitors have
tracing reactions, printing the monitor output via tracepoints.
But other reactions can be added (on-demand) via this interface.
The user interface resembles the kernel tracing interface and
presents these files:
"available_reactors"
- Reading shows the available reactors, one per line.
For example:
# cat available_reactors
nop
panic
printk
"reacting_on"
- It is an on/off general switch for reactors, disabling
all reactions.
"monitors/MONITOR/reactors"
- List available reactors, with the select reaction for the given
MONITOR inside []. The default one is the nop (no operation)
reactor.
- Writing the name of a reactor enables it to the given
MONITOR.
For example:
# cat monitors/wip/reactors
[nop]
panic
printk
# echo panic > monitors/wip/reactors
# cat monitors/wip/reactors
nop
[panic]
printk
Link: https://lkml.kernel.org/r/1794eb994637457bdeaa6bad0b8263d2f7eece0c.1659052063.git.bristot@kernel.org
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tao Zhou <tao.zhou@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
RV is a lightweight (yet rigorous) method that complements classical
exhaustive verification techniques (such as model checking and
theorem proving) with a more practical approach to complex systems.
RV works by analyzing the trace of the system's actual execution,
comparing it against a formal specification of the system behavior.
RV can give precise information on the runtime behavior of the
monitored system while enabling the reaction for unexpected
events, avoiding, for example, the propagation of a failure on
safety-critical systems.
The development of this interface roots in the development of the
paper:
De Oliveira, Daniel Bristot; Cucinotta, Tommaso; De Oliveira, Romulo
Silva. Efficient formal verification for the Linux kernel. In:
International Conference on Software Engineering and Formal Methods.
Springer, Cham, 2019. p. 315-332.
And:
De Oliveira, Daniel Bristot. Automata-based formal analysis
and verification of the real-time Linux kernel. PhD Thesis, 2020.
The RV interface resembles the tracing/ interface on purpose. The current
path for the RV interface is /sys/kernel/tracing/rv/.
It presents these files:
"available_monitors"
- List the available monitors, one per line.
For example:
# cat available_monitors
wip
wwnr
"enabled_monitors"
- Lists the enabled monitors, one per line;
- Writing to it enables a given monitor;
- Writing a monitor name with a '!' prefix disables it;
- Truncating the file disables all enabled monitors.
For example:
# cat enabled_monitors
# echo wip > enabled_monitors
# echo wwnr >> enabled_monitors
# cat enabled_monitors
wip
wwnr
# echo '!wip' >> enabled_monitors
# cat enabled_monitors
wwnr
# echo > enabled_monitors
# cat enabled_monitors
#
Note that more than one monitor can be enabled concurrently.
"monitoring_on"
- It is an on/off general switcher for monitoring. Note
that it does not disable enabled monitors or detach events,
but stop the per-entity monitors of monitoring the events
received from the system. It resembles the "tracing_on" switcher.
"monitors/"
Each monitor will have its one directory inside "monitors/". There
the monitor specific files will be presented.
The "monitors/" directory resembles the "events" directory on
tracefs.
For example:
# cd monitors/wip/
# ls
desc enable
# cat desc
wakeup in preemptive per-cpu testing monitor.
# cat enable
0
For further information, see the comments in the header of
kernel/trace/rv/rv.c from this patch.
Link: https://lkml.kernel.org/r/a4bfe038f50cb047bfb343ad0e12b0e646ab308b.1659052063.git.bristot@kernel.org
Cc: Wim Van Sebroeck <wim@linux-watchdog.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Clark Williams <williams@redhat.com>
Cc: Tao Zhou <tao.zhou@linux.dev>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-devel@vger.kernel.org
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
first waiter
With commit d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more
consistent"), the writer that sets the handoff bit can be interrupted
out without clearing the bit if the wait queue isn't empty. This disables
reader and writer optimistic lock spinning and stealing.
Now if a non-first writer in the queue is somehow woken up or a new
waiter enters the slowpath, it can't acquire the lock. This is not the
case before commit d257cc8cb8d5 as the writer that set the handoff bit
will clear it when exiting out via the out_nolock path. This is less
efficient as the busy rwsem stays in an unlock state for a longer time.
In some cases, this new behavior may cause lockups as shown in [1] and
[2].
This patch allows a non-first writer to ignore the handoff bit if it
is not originally set or initiated by the first waiter. This patch is
shown to be effective in fixing the lockup problem reported in [1].
[1] https://lore.kernel.org/lkml/20220617134325.GC30825@techsingularity.net/
[2] https://lore.kernel.org/lkml/3f02975c-1a9d-be20-32cf-f1d8e3dfafcc@oracle.com/
Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more consistent")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Donnelly <john.p.donnelly@oracle.com>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/r/20220622200419.778799-1-longman@redhat.com
|
|
Andrii Nakryiko says:
====================
bpf-next 2022-07-29
We've added 22 non-merge commits during the last 4 day(s) which contain
a total of 27 files changed, 763 insertions(+), 120 deletions(-).
The main changes are:
1) Fixes to allow setting any source IP with bpf_skb_set_tunnel_key() helper,
from Paul Chaignon.
2) Fix for bpf_xdp_pointer() helper when doing sanity checking, from Joanne Koong.
3) Fix for XDP frame length calculation, from Lorenzo Bianconi.
4) Libbpf BPF_KSYSCALL docs improvements and fixes to selftests to accommodate
s390x quirks with socketcall(), from Ilya Leoshkevich.
5) Allow/denylist and CI configs additions to selftests/bpf to improve BPF CI,
from Daniel Müller.
6) BPF trampoline + ftrace follow up fixes, from Song Liu and Xu Kuohai.
7) Fix allocation warnings in netdevsim, from Jakub Kicinski.
8) bpf_obj_get_opts() libbpf API allowing to provide file flags, from Joe Burton.
9) vsnprintf usage fix in bpf_snprintf_btf(), from Fedor Tokarev.
10) Various small fixes and clean ups, from Daniel Müller, Rongguang Wei,
Jörn-Thorben Hinz, Yang Li.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (22 commits)
bpf: Remove unneeded semicolon
libbpf: Add bpf_obj_get_opts()
netdevsim: Avoid allocation warnings triggered from user space
bpf: Fix NULL pointer dereference when registering bpf trampoline
bpf: Fix test_progs -j error with fentry/fexit tests
selftests/bpf: Bump internal send_signal/send_signal_tracepoint timeout
bpftool: Don't try to return value from void function in skeleton
bpftool: Replace sizeof(arr)/sizeof(arr[0]) with ARRAY_SIZE macro
bpf: btf: Fix vsnprintf return value check
libbpf: Support PPC in arch_specific_syscall_pfx
selftests/bpf: Adjust vmtest.sh to use local kernel configuration
selftests/bpf: Copy over libbpf configs
selftests/bpf: Sort configuration
selftests/bpf: Attach to socketcall() in test_probe_user
libbpf: Extend BPF_KSYSCALL documentation
bpf, devmap: Compute proper xdp_frame len redirecting frames
bpf: Fix bpf_xdp_pointer return pointer
selftests/bpf: Don't assign outer source IP to host
bpf: Set flow flag to allow any source IP in bpf_tunnel_key
geneve: Use ip_tunnel_key flow flags in route lookups
...
====================
Link: https://lore.kernel.org/r/20220729230948.1313527-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The setup_profiling_timer() is mostly un-implemented by many
architectures. In many places it isn't guarded by CONFIG_PROFILE which is
needed for it to be used. Make it a weak symbol in kernel/profile.c and
remove the 'return -EINVAL' implementations from the kenrel.
There are a couple of architectures which do return 0 from the
setup_profiling_timer() function but they don't seem to do anything else
with it. To keep the /proc compatibility for now, leave these for a
future update or removal.
On ARM, this fixes the following sparse warning:
arch/arm/kernel/smp.c:793:5: warning: symbol 'setup_profiling_timer' was not declared. Should it be static?
Link: https://lkml.kernel.org/r/20220721195509.418205-1-ben-linux@fluff.org
Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The proc_dohung_task_timeout_secs() function is incorrectly marked
as having a __user buffer as argument 3. However this is not the
case and it is casing multiple sparse warnings. Fix the following
warnings by removing __user from the argument:
kernel/hung_task.c:237:52: warning: incorrect type in argument 3 (different address spaces)
kernel/hung_task.c:237:52: expected void *
kernel/hung_task.c:237:52: got void [noderef] __user *buffer
kernel/hung_task.c:287:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces))
kernel/hung_task.c:287:35: expected int ( [usertype] *proc_handler )( ... )
kernel/hung_task.c:287:35: got int ( * )( ... )
kernel/hung_task.c:295:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces))
kernel/hung_task.c:295:35: expected int ( [usertype] *proc_handler )( ... )
kernel/hung_task.c:295:35: got int ( * )( ... )
Link: https://lkml.kernel.org/r/20220714074744.189017-1-ben.dooks@sifive.com
Signed-off-by: Ben Dooks <ben.dooks@sifive.com>
Cc: <Conor.Dooley@microchip.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Eliminate the following coccicheck warning:
/kernel/bpf/trampoline.c:101:2-3: Unneeded semicolon
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220725222733.55613-1-yang.lee@linux.alibaba.com
|
|
A panic was reported on arm64:
[ 44.517109] audit: type=1334 audit(1658859870.268:59): prog-id=19 op=LOAD
[ 44.622031] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000010
[ 44.624321] Mem abort info:
[ 44.625049] ESR = 0x0000000096000004
[ 44.625935] EC = 0x25: DABT (current EL), IL = 32 bits
[ 44.627182] SET = 0, FnV = 0
[ 44.627930] EA = 0, S1PTW = 0
[ 44.628684] FSC = 0x04: level 0 translation fault
[ 44.629788] Data abort info:
[ 44.630474] ISV = 0, ISS = 0x00000004
[ 44.631362] CM = 0, WnR = 0
[ 44.632041] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000100ab5000
[ 44.633494] [0000000000000010] pgd=0000000000000000, p4d=0000000000000000
[ 44.635202] Internal error: Oops: 96000004 [#1] SMP
[ 44.636452] Modules linked in: xfs crct10dif_ce ghash_ce virtio_blk
virtio_console virtio_mmio qemu_fw_cfg
[ 44.638713] CPU: 2 PID: 1 Comm: systemd Not tainted 5.19.0-rc7 #1
[ 44.640164] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 44.641799] pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 44.643404] pc : ftrace_set_filter_ip+0x24/0xa0
[ 44.644659] lr : bpf_trampoline_update.constprop.0+0x428/0x4a0
[ 44.646118] sp : ffff80000803b9f0
[ 44.646950] x29: ffff80000803b9f0 x28: ffff0b5d80364400 x27: ffff80000803bb48
[ 44.648721] x26: ffff8000085ad000 x25: ffff0b5d809d2400 x24: 0000000000000000
[ 44.650493] x23: 00000000ffffffed x22: ffff0b5dd7ea0900 x21: 0000000000000000
[ 44.652279] x20: 0000000000000000 x19: 0000000000000000 x18: ffffffffffffffff
[ 44.654067] x17: 0000000000000000 x16: 0000000000000000 x15: ffffffffffffffff
[ 44.655787] x14: ffff0b5d809d2498 x13: ffff0b5d809d2432 x12: 0000000005f5e100
[ 44.657535] x11: abcc77118461cefd x10: 000000000000005f x9 : ffffa7219cb5b190
[ 44.659254] x8 : ffffa7219c8e0000 x7 : 0000000000000000 x6 : ffffa7219db075e0
[ 44.661066] x5 : ffffa7219d3130e0 x4 : ffffa7219cab9da0 x3 : 0000000000000000
[ 44.662837] x2 : 0000000000000000 x1 : ffffa7219cb7a5c0 x0 : 0000000000000000
[ 44.664675] Call trace:
[ 44.665274] ftrace_set_filter_ip+0x24/0xa0
[ 44.666327] bpf_trampoline_update.constprop.0+0x428/0x4a0
[ 44.667696] __bpf_trampoline_link_prog+0xcc/0x1c0
[ 44.668834] bpf_trampoline_link_prog+0x40/0x64
[ 44.669919] bpf_tracing_prog_attach+0x120/0x490
[ 44.671011] link_create+0xe0/0x2b0
[ 44.671869] __sys_bpf+0x484/0xd30
[ 44.672706] __arm64_sys_bpf+0x30/0x40
[ 44.673678] invoke_syscall+0x78/0x100
[ 44.674623] el0_svc_common.constprop.0+0x4c/0xf4
[ 44.675783] do_el0_svc+0x38/0x4c
[ 44.676624] el0_svc+0x34/0x100
[ 44.677429] el0t_64_sync_handler+0x11c/0x150
[ 44.678532] el0t_64_sync+0x190/0x194
[ 44.679439] Code: 2a0203f4 f90013f5 2a0303f5 f9001fe1 (f9400800)
[ 44.680959] ---[ end trace 0000000000000000 ]---
[ 44.682111] Kernel panic - not syncing: Oops: Fatal exception
[ 44.683488] SMP: stopping secondary CPUs
[ 44.684551] Kernel Offset: 0x2721948e0000 from 0xffff800008000000
[ 44.686095] PHYS_OFFSET: 0xfffff4a380000000
[ 44.687144] CPU features: 0x010,00022811,19001080
[ 44.688308] Memory Limit: none
[ 44.689082] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
It's caused by a NULL tr->fops passed to ftrace_set_filter_ip(). tr->fops
is initialized to NULL and is assigned to an allocated memory address if
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is enabled. Since there is no
direct call on arm64 yet, the config can't be enabled.
To fix it, call ftrace_set_filter_ip() only if tr->fops is not NULL.
Fixes: 00963a2e75a8 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)")
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Bruno Goncalves <bgoncalv@redhat.com>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20220728114048.3540461-1-xukuohai@huaweicloud.com
|
|
When multiple threads are attaching/detaching fentry/fexit programs to
the same trampoline, we may call register_fentry on the same trampoline
twice: register_fentry(), unregister_fentry(), then register_fentry again.
This causes ftrace_set_filter_ip() for the same ip on tr->fops twice,
which leaves duplicated ip in tr->fops. The extra ip is not cleaned up
properly on unregister and thus causes failures with further register in
register_ftrace_direct_multi():
register_ftrace_direct_multi()
{
...
for (i = 0; i < size; i++) {
hlist_for_each_entry(entry, &hash->buckets[i], hlist) {
if (ftrace_find_rec_direct(entry->ip))
goto out_unlock;
}
}
...
}
This can be triggered with parallel fentry/fexit tests with test_progs:
./test_progs -t fentry,fexit -j
Fix this by resetting tr->fops in ftrace_set_filter_ip(), so that there
will never be duplicated entries in tr->fops.
Fixes: 00963a2e75a8 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)")
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220729194106.1207472-1-song@kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
"Just one commit to suppress a spurious warning added during the 5.19
cycle"
* tag 'wq-for-5.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Avoid a false warning in unbind_workers()
|
|
Doing set_cpus_allowed_ptr() with wq_unbound_cpumask can be possible
fails and trigger the false warning.
Use cpu_possible_mask instead when wq_unbound_cpumask has no active CPUs.
It is very easy to trigger the warning:
Set wq_unbound_cpumask to a small set of CPUs.
Offline all the CPUs of wq_unbound_cpumask.
Offline an extra CPU and trigger the warning.
Fixes: 10a5a651e3af ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs")
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Merge devfreq changes, PM QoS change, and power management tools and
documentation changes for v5.20-rc1:
- Add new devfreq driver for Mediatek CCI (Cache Coherent
Interconnect) (Johnson Wang).
- Convert the Samsung Exynos SoC Bus bindings to DT schema of
exynos-bus.c (Krzysztof Kozlowski).
- Address kernel-doc warnings by adding the description for unused
fucntion parameters in devfreq core (Mauro Carvalho Chehab).
- Use NULL to pass a null pointer rather than zero according to the
function propotype in imx-bus.c (Colin Ian King).
- Print error message instead of error interger value in
tegra30-devfreq.c (Dmitry Osipenko).
- Add checks to prevent setting negative frequency QoS limits for
CPUs (Shivnandan Kumar).
- Update the pm-graph suite of utilities to the latest revision 5.9
including multiple improvements (Todd Brandt).
- Drop pme_interrupt reference from the PCI power management
documentation (Mario Limonciello).
* pm-devfreq:
PM / devfreq: tegra30: Add error message for devm_devfreq_add_device()
PM / devfreq: imx-bus: use NULL to pass a null pointer rather than zero
PM / devfreq: shut up kernel-doc warnings
dt-bindings: interconnect: samsung,exynos-bus: convert to dtschema
PM / devfreq: mediatek: Introduce MediaTek CCI devfreq driver
dt-bindings: interconnect: Add MediaTek CCI dt-bindings
* pm-qos:
PM: QoS: Add check to make sure CPU freq is non-negative
* pm-tools:
pm-graph v5.9
* pm-docs:
Documentation: PM: Drop pme_interrupt reference
|
|
Merge core device power management changes for v5.20-rc1:
- Extend support for wakeirq to callback wrappers used during system
suspend and resume (Ulf Hansson).
- Defer waiting for device probe before loading a hibernation image
till the first actual device access to avoid possible deadlocks
reported by syzbot (Tetsuo Handa).
- Unify device_init_wakeup() for PM_SLEEP and !PM_SLEEP (Bjorn
Helgaas).
- Add Raptor Lake-P to the list of processors supported by the Intel
RAPL driver (George D Sworo).
- Add Alder Lake-N and Raptor Lake-P to the list of processors for
which Power Limit4 is supported in the Intel RAPL driver (Sumeet
Pawnikar).
- Make pm_genpd_remove() check genpd_debugfs_dir against NULL before
attempting to remove it (Hsin-Yi Wang).
- Change the Energy Model code to represent power in micro-Watts and
adjust its users accordingly (Lukasz Luba).
* pm-core:
PM: runtime: Extend support for wakeirq for force_suspend|resume
* pm-sleep:
PM: hibernate: defer device probing when resuming from hibernation
PM: wakeup: Unify device_init_wakeup() for PM_SLEEP and !PM_SLEEP
* powercap:
powercap: RAPL: Add Power Limit4 support for Alder Lake-N and Raptor Lake-P
powercap: intel_rapl: Add support for RAPTORLAKE_P
* pm-domains:
PM: domains: Ensure genpd_debugfs_dir exists before remove
* pm-em:
cpufreq: scmi: Support the power scale in micro-Watts in SCMI v3.1
firmware: arm_scmi: Get detailed power scale from perf
Documentation: EM: Switch to micro-Watts scale
PM: EM: convert power field to micro-Watts precision and align drivers
|
|
vsnprintf returns the number of characters which would have been written if
enough space had been available, excluding the terminating null byte. Thus,
the return value of 'len_left' means that the last character has been
dropped.
Signed-off-by: Fedor Tokarev <ftokarev@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/bpf/20220711211317.GA1143610@laptop
|
|
No conflicts.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The cgroup_update_dfl_csses() function updates css associations when a
cgroup's subtree_control file is modified. Any changes made to a cgroup's
subtree_control file, however, will only affect its descendants but not
the cgroup itself. So there is no point in migrating csses associated
with that cgroup. We can skip them instead.
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
If a watch is being added to a queue, it needs to guard against
interference from addition of a new watch, manual removal of a watch and
removal of a watch due to some other queue being destroyed.
KEYCTL_WATCH_KEY guards against this for the same {key,queue} pair by
holding the key->sem writelocked and by holding refs on both the key and
the queue - but that doesn't prevent interaction from other {key,queue}
pairs.
While add_watch_to_object() does take the spinlock on the event queue,
it doesn't take the lock on the source's watch list. The assumption was
that the caller would prevent that (say by taking key->sem) - but that
doesn't prevent interference from the destruction of another queue.
Fix this by locking the watcher list in add_watch_to_object().
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Reported-by: syzbot+03d7b43290037d1f87ca@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: keyrings@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since __post_watch_notification() walks wlist->watchers with only the
RCU read lock held, we need to use RCU methods to add to the list (we
already use RCU methods to remove from the list).
Fix add_watch_to_object() to use hlist_add_head_rcu() instead of
hlist_add_head() for that list.
Fixes: c73be61cede5 ("pipe: Add general notification queue support")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Debugfs node will be run-timely checked and so local variable
should be not passed to debugfs_create_ulong(). Fix it via
debugfs_create_file() to create io_tlb_used node and calculate
used io tlb number with fops_io_tlb_used attribute.
Fixes: 20347fca71a3 ("swiotlb: split up the global swiotlb lock")
Signed-off-by: Tianyu Lan <tiala@microsoft.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
make html doc reports a cryptic warning with the commit named below:
kernel/dma/mapping.c:258: WARNING: Option list ends without a blank
line; unexpected unindent.
Seems the parser is a bit fussy about the tabbing and having a single
space tab causes the warning. To suppress the warning add another
tab to the list and reindent everything.
Fixes: 7c2645a2a30a ("dma-mapping: allow EREMOTEIO return code for P2PDMA transfers")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip/genirq updates from Marc Zyngier:
* Core code update:
- Non-SMP IRQ affinity fixes, allowing UP kernel to behave similarly
to SMP ones for the purpose of interrupt affinity
- Let irq_set_chip_handler_name_locked() take a const struct irq_chip *
- Tidy-up the NOMAP irqdomain API variant
- Teach action_show() to use for_each_action_of_desc()
- Make irq_chip_request_resources_parent() allow the parent callback
to be optional
- Remove dynamic allocations from populate_parent_alloc_arg()
* New drivers:
- Merge the long awaited IRQ support for the LoongArch architecture,
with the provisional ACPICA update (to be reverted once the official
support lands)
- New Renesas RZ/G2L IRQC driver, equipped with its companion GPIO
driver
* Driver updates
- Optimise the hot path operations for the SiFive PLIC, trading the
locking for per-CPU priority masking masking operations which are
apparently faster
- Work around broken PLIC implementations that deal pretty badly with
edge-triggered interrupts. Flag two implementations as affected.
- Simplify the irq-stm32-exti driver, particularly the table that
remaps the interrupts from exti to the GIC, reducing the memory usage
- Convert the ocelot irq_chip to being immutable
- Check ioremap() return value in the MIPS GIC driver
- Move MMP driver init function declarations into the common .h
- The obligatory typo fixes
Link: https://lore.kernel.org/all/20220727192356.1860546-1-maz@kernel.org
|
|
30312730bd02 ("cgroup: Add "no" prefixed mount options") added "no" prefixed
mount options to allow turning them off and 6a010a49b63a ("cgroup: Make
!percpu threadgroup_rwsem operations optional") added one more "no" prefixed
mount option. However, Michal pointed out that the "no" prefixed options
aren't necessary in allowing mount options to be turned off:
# grep group /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,relatime,nsdelegate,memory_recursiveprot 0 0
# mount -o remount,nsdelegate,memory_recursiveprot none /sys/fs/cgroup
# grep cgroup /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,relatime,nsdelegate,memory_recursiveprot 0 0
Note that this is different from the remount behavior when the mount(1) is
invoked without the device argument - "none":
# grep cgroup /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
# mount -o remount,nsdelegate,memory_recursiveprot /sys/fs/cgroup
# grep cgroup /proc/mounts
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
While a bit confusing, given that there is a way to turn off the options,
there's no reason to have the explicit "no" prefixed options. Let's remove
them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Commit
4675ff05de2d ("kmemcheck: rip it out")
removed kmemcheck and its corresponding build config KMEMCHECK.
Commit
0f620cefd775 ("objtool: Rename "VMLINUX_VALIDATION" -> "NOINSTR_VALIDATION"")
renamed the debug config option.
Adjust x86_debug.config to those changes in debug configs.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20220722121815.27535-1-lukas.bulwahn@gmail.com
|
|
In some circumstances it may be interesting to reconfigure the watchdog
from inside the kernel.
On PowerPC, this may helpful before and after a LPAR migration (LPM) is
initiated, because it implies some latencies, watchdog, and especially NMI
watchdog is expected to be triggered during this operation. Reconfiguring
the watchdog with a factor, would prevent it to happen too frequently
during LPM.
Rename lockup_detector_reconfigure() as __lockup_detector_reconfigure() and
create a new function lockup_detector_reconfigure() calling
__lockup_detector_reconfigure() under the protection of watchdog_mutex.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
[mpe: Squash in build fix from Laurent, reported by Sachin]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220713154729.80789-3-ldufour@linux.ibm.com
|
|
CPU frequency should never be negative.
If some client driver calls freq_qos_update_request with a
negative value which will be very high in absolute terms,
then frequency QoS sets max CPU freq at fmax as it considers
it's absolute value but it will add plist node with negative
priority.
plist node has priority from INT_MIN (highest) to INT_MAX(lowest).
Once priority is set as negative, another client will not be able
to reduce CPU frequency.
Adding check to make sure CPU freq is non-negative will fix
this problem.
Signed-off-by: Shivnandan Kumar <quic_kshivnan@quicinc.com>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|