Age | Commit message (Collapse) | Author |
|
Hook alloc_workqueue and alloc_workqueue_attrs() so that they're
accounted to the callsite. Since we're doing allocations on behalf of
another subsystem, this helps when using memory allocation profiling to
check for leaks.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:
- Prevent perf_sigtrap() from observing an exiting task and warning
about it
* tag 'perf_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix WARN in perf_sigtrap()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"19 hotfixes. A whopping 16 are cc:stable and the remainder address
post-6.15 issues or aren't considered necessary for -stable kernels.
14 are for MM. Three gdb-script fixes and a kallsyms build fix"
* tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Revert "sched/numa: add statistics of numa balance task"
mm: fix the inaccurate memory statistics issue for users
mm/damon: fix divide by zero in damon_get_intervals_score()
samples/damon: fix damon sample mtier for start failure
samples/damon: fix damon sample wsse for start failure
samples/damon: fix damon sample prcl for start failure
kasan: remove kasan_find_vm_area() to prevent possible deadlock
scripts: gdb: vfs: support external dentry names
mm/migrate: fix do_pages_stat in compat mode
mm/damon/core: handle damon_call_control as normal under kdmond deactivation
mm/rmap: fix potential out-of-bounds page table access during batched unmap
mm/hugetlb: don't crash when allocating a folio if there are no resv
scripts/gdb: de-reference per-CPU MCE interrupts
scripts/gdb: fix interrupts.py after maple tree conversion
maple_tree: fix mt_destroy_walk() on root leaf node
mm/vmalloc: leave lazy MMU mode on PTE mapping error
scripts/gdb: fix interrupts display after MCP on x86
lib/alloc_tag: do not acquire non-existent lock in alloc_tag_top_users()
kallsyms: fix build without execinfo
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fix from Marek Szyprowski:
- small fix relevant to arm64 server and custom CMA configuration (Feng
Tang)
* tag 'dma-mapping-6.16-2025-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-contiguous: hornor the cma address limit setup by user
|
|
This reverts commit ad6b26b6a0a79166b53209df2ca1cf8636296382.
This commit introduces per-memcg/task NUMA balance statistics, but
unfortunately it introduced a NULL pointer exception due to the following
race condition: After a swap task candidate was chosen, its mm_struct
pointer was set to NULL due to task exit. Later, when performing the
actual task swapping, the p->mm caused the problem.
CPU0 CPU1
:
...
task_numa_migrate
task_numa_find_cpu
task_numa_compare
# a normal task p is chosen
env->best_task = p
# p exit:
exit_signals(p);
p->flags |= PF_EXITING
exit_mm
p->mm = NULL;
migrate_swap_stop
__migrate_swap_task((arg->src_task, arg->dst_cpu)
count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL
task_lock() should be held and the PF_EXITING flag needs to be checked to
prevent this from happening. After discussion, the conclusion was that
adding a lock is not worthwhile for some statistics calculations. Revert
the change and rely on the tracepoint for this purpose.
Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org
Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Reported-by: Jirka Hladky <jhladky@redhat.com>
Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/
Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com>
Reported-by: Suneeth D <Suneeth.D@amd.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Hladky <jhladky@redhat.com>
Cc: Libo Chen <libo.chen@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Since exit_task_work() runs after perf_event_exit_task_context() updated
ctx->task to TASK_TOMBSTONE, perf_sigtrap() from perf_pending_task() might
observe event->ctx->task == TASK_TOMBSTONE.
Swap the early exit tests in order not to hit WARN_ON_ONCE().
Closes: https://syzkaller.appspot.com/bug?extid=2fe61cb2a86066be6985
Reported-by: syzbot <syzbot+2fe61cb2a86066be6985@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/b1c224bd-97f9-462c-a3e3-125d5e19c983@I-love.SAKURA.ne.jp
|
|
The per-CPU data section is handled differently than the other sections.
The memory allocations requires a special __percpu pointer and then the
section is copied into the view of each CPU. Therefore the SHF_ALLOC
flag is removed to ensure move_module() skips it.
Later, relocations are applied and apply_relocations() skips sections
without SHF_ALLOC because they have not been copied. This also skips the
per-CPU data section.
The missing relocations result in a NULL pointer on x86-64 and very
small values on x86-32. This results in a crash because it is not
skipped like NULL pointer would and can't be dereferenced.
Such an assignment happens during static per-CPU lock initialisation
with lockdep enabled.
Allow relocation processing for the per-CPU section even if SHF_ALLOC is
missing.
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202506041623.e45e4f7d-lkp@intel.com
Fixes: 1a6100caae425 ("Don't relocate non-allocated regions in modules.") #v2.6.1-rc3
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Link: https://lore.kernel.org/r/20250610163328.URcsSUC1@linutronix.de
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Message-ID: <20250610163328.URcsSUC1@linutronix.de>
|
|
All error conditions in move_module() set the return value by updating the
ret variable. Therefore, it is not necessary to the initialize the variable
when declaring it.
Remove the unnecessary initialization.
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Link: https://lore.kernel.org/r/20250618122730.51324-3-petr.pavlu@suse.com
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Message-ID: <20250618122730.51324-3-petr.pavlu@suse.com>
|
|
The function move_module() uses the variable t to track how many memory
types it has allocated and consequently how many should be freed if an
error occurs.
The variable is initially set to 0 and is updated when a call to
module_memory_alloc() fails. However, move_module() can fail for other
reasons as well, in which case t remains set to 0 and no memory is freed.
Fix the problem by initializing t to MOD_MEM_NUM_TYPES. Additionally, make
the deallocation loop more robust by not relying on the mod_mem_type_t enum
having a signed integer as its underlying type.
Fixes: c7ee8aebf6c0 ("module: add stop-grap sanity check on module memcpy()")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Link: https://lore.kernel.org/r/20250618122730.51324-2-petr.pavlu@suse.com
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Message-ID: <20250618122730.51324-2-petr.pavlu@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Fix the calculation of the deadline server task's runtime as this
mishap was preventing realtime tasks from running
- Avoid a race condition during migrate-swapping two tasks
- Fix the string reported for the "none" dynamic preemption option
* tag 'sched_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix dl_server runtime calculation formula
sched/core: Fix migrate_swap() vs. hotplug
sched: Fix preemption string of preempt_dynamic_none
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Revert uprobes to using CAP_SYS_ADMIN again as currently they can
destructively modify kernel code from an unprivileged process
- Move a warning to where it belongs
* tag 'perf_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Revert to requiring CAP_SYS_ADMIN for uprobes
perf/core: Fix the WARN_ON_ONCE is out of lock protected region
|
|
Merge fixes related to system sleep for 6.16-rc5:
- Fix typo in the ABI documentation (Sumanth Gavini).
- Allow swap to be used a bit longer during system suspend and
hibernation to avoid suspend failures under memory pressure (Mario
Limonciello).
* pm-sleep:
PM: sleep: docs: Replace "diasble" with "disable"
PM: Restrict swap use to later in the suspend sequence
|
|
In our testing with 6.12 based kernel on a big.LITTLE system, we were
seeing instances of RT tasks being blocked from running on the LITTLE
cpus for multiple seconds of time, apparently by the dl_server. This
far exceeds the default configured 50ms per second runtime.
This is due to the fair dl_server runtime calculation being scaled
for frequency & capacity of the cpu.
Consider the following case under a Big.LITTLE architecture:
Assume the runtime is: 50,000,000 ns, and Frequency/capacity
scale-invariance defined as below:
Frequency scale-invariance: 100
Capacity scale-invariance: 50
First by Frequency scale-invariance,
the runtime is scaled to 50,000,000 * 100 >> 10 = 4,882,812
Then by capacity scale-invariance,
it is further scaled to 4,882,812 * 50 >> 10 = 238,418.
So it will scaled to 238,418 ns.
This smaller "accounted runtime" value is what ends up being
subtracted against the fair-server's runtime for the current period.
Thus after 50ms of real time, we've only accounted ~238us against the
fair servers runtime. This 209:1 ratio in this example means that on
the smaller cpu the fair server is allowed to continue running,
blocking RT tasks, for over 10 seconds before it exhausts its supposed
50ms of runtime. And on other hardware configurations it can be even
worse.
For the fair deadline_server, to prevent realtime tasks from being
unexpectedly delayed, we really do want to use fixed time, and not
scaled time for smaller capacity/frequency cpus. So remove the scaling
from the fair server's accounting to fix this.
Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: John Stultz <jstultz@google.com>
Signed-off-by: kuyo chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Acked-by: John Stultz <jstultz@google.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20250702021440.2594736-1-kuyo.chang@mediatek.com
|
|
Jann reports that uprobes can be used destructively when used in the
middle of an instruction. The kernel only verifies there is a valid
instruction at the requested offset, but due to variable instruction
length cannot determine if this is an instruction as seen by the
intended execution stream.
Additionally, Mark Rutland notes that on architectures that mix data
in the text segment (like arm64), a similar things can be done if the
data word is 'mistaken' for an instruction.
As such, require CAP_SYS_ADMIN for uprobes.
Fixes: c9e0924e5c2b ("perf/core: open access to probes for CAP_PERFMON privileged process")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/CAG48ez1n4520sq0XrWYDHKiKxE_+WCfAK+qt9qkY4ZiBGmL-5g@mail.gmail.com
|
|
On Mon, Jun 02, 2025 at 03:22:13PM +0800, Kuyo Chang wrote:
> So, the potential race scenario is:
>
> CPU0 CPU1
> // doing migrate_swap(cpu0/cpu1)
> stop_two_cpus()
> ...
> // doing _cpu_down()
> sched_cpu_deactivate()
> set_cpu_active(cpu, false);
> balance_push_set(cpu, true);
> cpu_stop_queue_two_works
> __cpu_stop_queue_work(stopper1,...);
> __cpu_stop_queue_work(stopper2,..);
> stop_cpus_in_progress -> true
> preempt_enable();
> ...
> 1st balance_push
> stop_one_cpu_nowait
> cpu_stop_queue_work
> __cpu_stop_queue_work
> list_add_tail -> 1st add push_work
> wake_up_q(&wakeq); -> "wakeq is empty.
> This implies that the stopper is at wakeq@migrate_swap."
> preempt_disable
> wake_up_q(&wakeq);
> wake_up_process // wakeup migrate/0
> try_to_wake_up
> ttwu_queue
> ttwu_queue_cond ->meet below case
> if (cpu == smp_processor_id())
> return false;
> ttwu_do_activate
> //migrate/0 wakeup done
> wake_up_process // wakeup migrate/1
> try_to_wake_up
> ttwu_queue
> ttwu_queue_cond
> ttwu_queue_wakelist
> __ttwu_queue_wakelist
> __smp_call_single_queue
> preempt_enable();
>
> 2nd balance_push
> stop_one_cpu_nowait
> cpu_stop_queue_work
> __cpu_stop_queue_work
> list_add_tail -> 2nd add push_work, so the double list add is detected
> ...
> ...
> cpu1 get ipi, do sched_ttwu_pending, wakeup migrate/1
>
So this balance_push() is part of schedule(), and schedule() is supposed
to switch to stopper task, but because of this race condition, stopper
task is stuck in WAKING state and not actually visible to be picked.
Therefore CPU1 can do another schedule() and end up doing another
balance_push() even though the last one hasn't been done yet.
This is a confluence of fail, where both wake_q and ttwu_wakelist can
cause crucial wakeups to be delayed, resulting in the malfunction of
balance_push.
Since there is only a single stopper thread to be woken, the wake_q
doesn't really add anything here, and can be removed in favour of
direct wakeups of the stopper thread.
Then add a clause to ttwu_queue_cond() to ensure the stopper threads
are never queued / delayed.
Of all 3 moving parts, the last addition was the balance_push()
machinery, so pick that as the point the bug was introduced.
Fixes: 2558aacff858 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug")
Reported-by: Kuyo Chang <kuyo.chang@mediatek.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Kuyo Chang <kuyo.chang@mediatek.com>
Link: https://lkml.kernel.org/r/20250605100009.GO39944@noisy.programming.kicks-ass.net
|
|
Zero is a valid value for "preempt_dynamic_mode", namely
"preempt_dynamic_none".
Fix the off-by-one in preempt_model_str(), so that "preempty_dynamic_none"
is correctly formatted as PREEMPT(none) instead of PREEMPT(undef).
Fixes: 8bdc5daaa01e ("sched: Add a generic function to return the preemption string")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20250626-preempt-str-none-v2-1-526213b70a89@linutronix.de
|
|
commit 3172fb986666 ("perf/core: Fix WARN in perf_cgroup_switch()") try to
fix a concurrency problem between perf_cgroup_switch and
perf_cgroup_event_disable. But it does not to move the WARN_ON_ONCE into
lock-protected region, so the warning is still be triggered.
Fixes: 3172fb986666 ("perf/core: Fix WARN in perf_cgroup_switch()")
Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250626135403.2454105-1-luogengkun@huaweicloud.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:
- Make sure an AUX perf event is really disabled when it overruns
* tag 'perf_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/aux: Fix pending disable flow when the AUX ring buffer overruns
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
- Fix possible UAF on error path in filter_free_subsystem_filters()
When freeing a subsystem filter, the filter for the subsystem is
passed in to be freed and all the events within the subsystem will
have their filter freed too. In order to free without waiting for RCU
synchronization, list items are allocated to hold what is going to be
freed to free it via a call_rcu(). If the allocation of these items
fails, it will call the synchronization directly and free after that
(causing a bit of delay for the user).
The subsystem filter is first added to this list and then the filters
for all the events under the subsystem. The bug is if one of the
allocations of the list items for the event filters fail to allocate,
it jumps to the "free_now" label which will free the subsystem
filter, then all the items on the allocated list, and then the event
filters that were not added to the list yet. But because the
subsystem filter was added first, it gets freed twice.
The solution is to add the subsystem filter after the events, and
then if any of the allocations fail it will not try to free any of
them twice
* tag 'trace-v6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix filter logic error
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"16 hotfixes.
6 are cc:stable and the remainder address post-6.15 issues or aren't
considered necessary for -stable kernels. 5 are for MM"
* tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add Lorenzo as THP co-maintainer
mailmap: update Duje Mihanović's email address
selftests/mm: fix validate_addr() helper
crashdump: add CONFIG_KEYS dependency
mailmap: correct name for a historical account of Zijun Hu
mailmap: add entries for Zijun Hu
fuse: fix runtime warning on truncate_folio_batch_exceptionals()
scripts/gdb: fix dentry_name() lookup
mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on write
mm/alloc_tag: fix the kmemleak false positive issue in the allocation of the percpu variable tag->counters
lib/group_cpus: fix NULL pointer dereference from group_cpus_evenly()
mm/hugetlb: remove unnecessary holding of hugetlb_lock
MAINTAINERS: add missing files to mm page alloc section
MAINTAINERS: add tree entry to mm init block
mm: add OOM killer maintainer structure
fs/proc/task_mmu: fix PAGE_IS_PFNZERO detection for the huge zero folio
|
|
If the processing of the tr->events loop fails, the filter that has been
added to filter_head will be released twice in free_filter_list(&head->rcu)
and __free_filter(filter).
After adding the filter of tr->events, add the filter to the filter_head
process to avoid triggering uaf.
Link: https://lore.kernel.org/tencent_4EF87A626D702F816CD0951CE956EC32CD0A@qq.com
Fixes: a9d0aab5eb33 ("tracing: Fix regression of filter waiting a long time on RCU synchronization")
Reported-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=daba72c4af9915e9c894
Tested-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Currently swap is restricted before drivers have had a chance to do
their prepare() PM callbacks. Restricting swap this early means that if
a driver needs to evict some content from memory into sawp in it's
prepare callback, it won't be able to.
On AMD dGPUs this can lead to failed suspends under memory pressure
situations as all VRAM must be evicted to system memory or swap.
Move the swap restriction to right after all devices have had a chance
to do the prepare() callback. If there is any problem with the sequence,
restore swap in the appropriate dpm resume callbacks or error handling
paths.
Closes: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2362
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Nat Wittstock <nat@fardog.io>
Tested-by: Lucian Langa <lucilanga@7pot.org>
Link: https://patch.msgid.link/20250613214413.4127087-1-superm1@kernel.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
If an AUX event overruns, the event core layer intends to disable the
event by setting the 'pending_disable' flag. Unfortunately, the event
is not actually disabled afterwards.
In commit:
ca6c21327c6a ("perf: Fix missing SIGTRAPs")
the 'pending_disable' flag was changed to a boolean. However, the
AUX event code was not updated accordingly. The flag ends up holding a
CPU number. If this number is zero, the flag is taken as false and the
IRQ work is never triggered.
Later, with commit:
2b84def990d3 ("perf: Split __perf_pending_irq() out of perf_pending_irq()")
a new IRQ work 'pending_disable_irq' was introduced to handle event
disabling. The AUX event path was not updated to kick off the work queue.
To fix this bug, when an AUX ring buffer overrun is detected, call
perf_event_disable_inatomic() to initiate the pending disable flow.
Also update the outdated comment for setting the flag, to reflect the
boolean values (0 or 1).
Fixes: 2b84def990d3 ("perf: Split __perf_pending_irq() out of perf_pending_irq()")
Fixes: ca6c21327c6a ("perf: Fix missing SIGTRAPs")
Signed-off-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: James Clark <james.clark@linaro.org>
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Liang Kan <kan.liang@linux.intel.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-perf-users@vger.kernel.org
Link: https://lore.kernel.org/r/20250625170737.2918295-1-leo.yan@arm.com
|
|
Pull bpf fixes from Alexei Starovoitov:
- Fix use-after-free in libbpf when map is resized (Adin Scannell)
- Fix verifier assumptions about 2nd argument of bpf_sysctl_get_name
(Jerome Marchand)
- Fix verifier assumption of nullness of d_inode in dentry (Song Liu)
- Fix global starvation of LRU map (Willem de Bruijn)
- Fix potential NULL dereference in btf_dump__free (Yuan Chen)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: adapt one more case in test_lru_map to the new target_free
libbpf: Fix possible use-after-free for externs
selftests/bpf: Convert test_sysctl to prog_tests
bpf: Specify access type of bpf_sysctl_get_name args
libbpf: Fix null pointer dereference in btf_dump__free on allocation failure
bpf: Adjust free target to avoid global starvation of LRU map
bpf: Mark dentry->d_inode as trusted_or_null
|
|
Pull mount fixes from Al Viro:
"Several mount-related fixes"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
userns and mnt_idmap leak in open_tree_attr(2)
attach_recursive_mnt(): do not lock the covering tree when sliding something under it
replace collect_mounts()/drop_collected_mounts() with a safer variant
|
|
The dm_crypt code fails to build without CONFIG_KEYS:
kernel/crash_dump_dm_crypt.c: In function 'restore_dm_crypt_keys_to_thread_keyring':
kernel/crash_dump_dm_crypt.c:105:9: error: unknown type name 'key_ref_t'; did you mean 'key_ref_put'?
There is a mix of 'select KEYS' and 'depends on KEYS' in Kconfig,
so there is no single obvious solution here, but generally using 'depends on'
makes more sense and is less likely to cause dependency loops.
Link: https://lkml.kernel.org/r/20250620112140.3396316-1-arnd@kernel.org
Fixes: 62f17d9df692 ("crash_dump: retrieve dm crypt keys in kdump kernel")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Coiby Xu <coxu@redhat.com>
Cc: Dave Vasilevsky <dave@vasilevsky.ca>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The second argument of bpf_sysctl_get_name() helper is a pointer to a
buffer that is being written to. However that isn't specify in the
prototype.
Until commit 37cce22dbd51a ("bpf: verifier: Refactor helper access
type tracking"), all helper accesses were considered as a possible
write access by the verifier, so no big harm was done. However, since
then, the verifier might make wrong asssumption about the content of
that address which might lead it to make faulty optimizations (such as
removing code that was wrongly labeled dead). This is what happens in
test_sysctl selftest to the tests related to sysctl_get_name.
Add MEM_WRITE flag the second argument of bpf_sysctl_get_name().
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250619140603.148942-2-jmarchan@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
collect_mounts() has several problems - one can't iterate over the results
directly, so it has to be done with callback passed to iterate_mounts();
it has an oopsable race with d_invalidate(); it creates temporary clones
of mounts invisibly for sync umount (IOW, you can have non-lazy umount
succeed leaving filesystem not mounted anywhere and yet still busy).
A saner approach is to give caller an array of struct path that would pin
every mount in a subtree, without cloning any mounts.
* collect_mounts()/drop_collected_mounts()/iterate_mounts() is gone
* collect_paths(where, preallocated, size) gives either ERR_PTR(-E...) or
a pointer to array of struct path, one for each chunk of tree visible under
'where' (i.e. the first element is a copy of where, followed by (mount,root)
for everything mounted under it - the same set collect_mounts() would give).
Unlike collect_mounts(), the mounts are *not* cloned - we just get pinning
references to the roots of subtrees in the caller's namespace.
Array is terminated by {NULL, NULL} struct path. If it fits into
preallocated array (on-stack, normally), that's where it goes; otherwise
it's allocated by kmalloc_array(). Passing 0 as size means that 'preallocated'
is ignored (and expected to be NULL).
* drop_collected_paths(paths, preallocated) is given the array returned
by an earlier call of collect_paths() and the preallocated array passed to that
call. All mount/dentry references are dropped and array is kfree'd if it's not
equal to 'preallocated'.
* instead of iterate_mounts(), users should just iterate over array
of struct path - nothing exotic is needed for that. Existing users (all in
audit_tree.c) are converted.
[folded a fix for braino reported by Venkat Rao Bagalkote <venkat88@linux.ibm.com>]
Fixes: 80b5dce8c59b0 ("vfs: Add a function to lazily unmount all mounts from any dentry")
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"20 hotfixes. 7 are cc:stable and the remainder address post-6.15
issues or aren't considered necessary for -stable kernels. Only 4 are
for MM.
- The series `Revert "bcache: update min_heap_callbacks to use
default builtin swap"' from Kuan-Wei Chiu backs out the author's
recent min_heap changes due to a performance regression.
A fix for this regression has been developed but we felt it best to
go back to the known-good version to give the new code more bake
time.
- A lot of MAINTAINERS maintenance.
I like to get these changes upstreamed promptly because they can't
break things and more accurate/complete MAINTAINERS info hopefully
improves the speed and accuracy of our responses to submitters and
reporters"
* tag 'mm-hotfixes-stable-2025-06-22-18-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add additional mmap-related files to mmap section
MAINTAINERS: add memfd, shmem quota files to shmem section
MAINTAINERS: add stray rmap file to mm rmap section
MAINTAINERS: add hugetlb_cgroup.c to hugetlb section
MAINTAINERS: add further init files to mm init block
MAINTAINERS: update maintainers for HugeTLB
maple_tree: fix MA_STATE_PREALLOC flag in mas_preallocate()
MAINTAINERS: add missing test files to mm gup section
MAINTAINERS: add missing mm/workingset.c file to mm reclaim section
selftests/mm: skip uprobe vma merge test if uprobes are not enabled
bcache: remove unnecessary select MIN_HEAP
Revert "bcache: remove heap-related macros and switch to generic min_heap"
Revert "bcache: update min_heap_callbacks to use default builtin swap"
selftests/mm: add configs to fix testcase failure
kho: initialize tail pages for higher order folios properly
MAINTAINERS: add linux-mm@ list to Kexec Handover
mm: userfaultfd: fix race of userfaultfd_move and swap cache
mm/gup: revert "mm: gup: fix infinite loop within __get_longterm_locked"
selftests/mm: increase timeout from 180 to 900 seconds
mm/shmem, swap: fix softlockup with mTHP swapin
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Fix missing prototypes warnings
- Properly initialize work context when allocating it
- Remove a method tracking when managed interrupts are suspended during
hotplug, in favor of the code using a IRQ disable depth tracking now,
and have interrupts get properly enabled again on restore
- Make sure multiple CPUs getting hotplugged don't cause wrong tracking
of the managed IRQ disable depth
* tag 'irq_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/ath79-misc: Fix missing prototypes warnings
genirq/irq_sim: Initialize work context pointers properly
genirq/cpuhotplug: Restore affinity even for suspended IRQ
genirq/cpuhotplug: Rebalance managed interrupts across multi-CPU hotplug
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Avoid a crash on a heterogeneous machine where not all cores support
the same hw events features
- Avoid a deadlock when throttling events
- Document the perf event states more
- Make sure a number of perf paths switching off or rescheduling events
call perf_cgroup_event_disable()
- Make sure perf does task sampling before its userspace mapping is
torn down, and not after
* tag 'perf_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix crash in icl_update_topdown_event()
perf: Fix the throttle error of some clock events
perf: Add comment to enum perf_event_state
perf/core: Fix WARN in perf_cgroup_switch()
perf: Fix dangling cgroup pointer in cpuctx
perf: Fix cgroup state vs ERROR
perf: Fix sample vs do_exit()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Borislav Petkov:
- Make sure the switch to the global hash is requested always under a
lock so that two threads requesting that simultaneously cannot get to
inconsistent state
- Reject negative NUMA nodes earlier in the futex NUMA interface
handling code
- Selftests fixes
* tag 'locking_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Verify under the lock if hash can be replaced
futex: Handle invalid node numbers supplied by user
selftests/futex: Set the home_node in futex_numa_mpol
selftests/futex: getopt() requires int as return value.
|
|
Currently the call_rcu() API does not check whether a callback
pointer is NULL. If NULL is passed, rcu_core() will try to invoke
it, resulting in NULL pointer dereference and a kernel crash.
To prevent this and improve debuggability, this patch adds a check
for NULL and emits a kernel stack trace to help identify a faulty
caller.
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
|
|
Currently, when restoring higher order folios, kho_restore_folio() only
calls prep_compound_page() on all the pages. That is not enough to
properly initialize the folios. The managed page count does not get
updated, the reserved flag does not get dropped, and page count does not
get initialized properly.
Restoring a higher order folio with it results in the following BUG with
CONFIG_DEBUG_VM when attempting to free the folio:
BUG: Bad page state in process test pfn:104e2b
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffffffffffffffff pfn:0x104e2b
flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff)
raw: 002fffff80000000 0000000000000000 00000000ffffffff 0000000000000000
raw: ffffffffffffffff 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: nonzero _refcount
[...]
Call Trace:
<TASK>
dump_stack_lvl+0x4b/0x70
bad_page.cold+0x97/0xb2
__free_frozen_pages+0x616/0x850
[...]
Combine the path for 0-order and higher order folios, initialize the tail
pages with a count of zero, and call adjust_managed_page_count() to
account for all the pages instead of just missing them.
In addition, since all the KHO-preserved pages get marked with
MEMBLOCK_RSRV_NOINIT by deserialize_bitmap(), the reserved flag is not
actually set (as can also be seen from the flags of the dumped page in the
logs above). So drop the ClearPageReserved() calls.
[ptyadav@amazon.de: declare i in the loop instead of at the top]
Link: https://lkml.kernel.org/r/20250613125916.39272-1-pratyush@kernel.org
Link: https://lkml.kernel.org/r/20250605171143.76963-1-pratyush@kernel.org
Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation")
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the
map is full, due to percpu reservations and force shrink before
neighbor stealing. Once a CPU is unable to borrow from the global map,
it will once steal one elem from a neighbor and after that each time
flush this one element to the global list and immediately recycle it.
Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map
with 79 CPUs. CPU 79 will observe this behavior even while its
neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%).
CPUs need not be active concurrently. The issue can appear with
affinity migration, e.g., irqbalance. Each CPU can reserve and then
hold onto its 128 elements indefinitely.
Avoid global list exhaustion by limiting aggregate percpu caches to
half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count.
This change has no effect on sufficiently large tables.
Similar to LOCAL_NR_SCANS and lru->nr_scans, introduce a map variable
lru->free_target. The extra field fits in a hole in struct bpf_lru.
The cacheline is already warm where read in the hot path. The field is
only accessed with the lru lock held.
Tested-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://lore.kernel.org/r/20250618215803.3587312-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
- In cgroup1 freezer, a task migrating into a frozen cgroup might not
get frozen immediately due to the wrong operation order. Fix it.
* tag 'cgroup-for-6.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup,freezer: fix incomplete freezing when attaching tasks
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
- Fix missed early init of wq_isolated_cpumask
* tag 'wq-for-6.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Initialize wq_isolated_cpumask in workqueue_init_early()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix a couple bugs in cgroup cpu.weight support
- Add the new sched-ext@lists.linux.dev to MAINTAINERS
* tag 'sched_ext-for-6.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group()
sched_ext: Make scx_group_set_weight() always update tg->scx.weight
sched_ext: Update mailing list entry in MAINTAINERS
|
|
An issue was found:
# cd /sys/fs/cgroup/freezer/
# mkdir test
# echo FROZEN > test/freezer.state
# cat test/freezer.state
FROZEN
# sleep 1000 &
[1] 863
# echo 863 > test/cgroup.procs
# cat test/freezer.state
FREEZING
When tasks are migrated to a frozen cgroup, the freezer fails to
immediately freeze the tasks, causing the cgroup to remain in the
"FREEZING".
The freeze_task() function is called before clearing the CGROUP_FROZEN
flag. This causes the freezing() check to incorrectly return false,
preventing __freeze_task() from being invoked for the migrated task.
To fix this issue, clear the CGROUP_FROZEN state before calling
freeze_task().
Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic")
Cc: stable@vger.kernel.org # v6.1+
Reported-by: Zhong Jiawei <zhongjiawei1@huawei.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
When setting the funcgraph-args option when function graph tracer is net
enabled, it incorrectly enables it. Worse, it unregisters itself when it
was never registered. Then when it gets enabled again, it will register
itself a second time causing a WARNing.
~# echo 1 > /sys/kernel/tracing/options/funcgraph-args
~# head -20 /sys/kernel/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 813/26317372 #P:8
#
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
<idle>-0 [007] d..4. 358.966010: 7) 1.692 us | fetch_next_timer_interrupt(basej=4294981640, basem=357956000000, base_local=0xffff88823c3ae040, base_global=0xffff88823c3af300, tevt=0xffff888100e47cb8);
<idle>-0 [007] d..4. 358.966012: 7) | tmigr_cpu_deactivate(nextexp=357988000000) {
<idle>-0 [007] d..4. 358.966013: 7) | _raw_spin_lock(lock=0xffff88823c3b2320) {
<idle>-0 [007] d..4. 358.966014: 7) 0.981 us | preempt_count_add(val=1);
<idle>-0 [007] d..5. 358.966017: 7) 1.058 us | do_raw_spin_lock(lock=0xffff88823c3b2320);
<idle>-0 [007] d..4. 358.966019: 7) 5.824 us | }
<idle>-0 [007] d..5. 358.966021: 7) | tmigr_inactive_up(group=0xffff888100cb9000, child=0x0, data=0xffff888100e47bc0) {
<idle>-0 [007] d..5. 358.966022: 7) | tmigr_update_events(group=0xffff888100cb9000, child=0x0, data=0xffff888100e47bc0) {
Notice the "tracer: nop" at the top there. The current tracer is the "nop"
tracer, but the content is obviously the function graph tracer.
Enabling function graph tracing will cause it to register again and
trigger a warning in the accounting:
~# echo function_graph > /sys/kernel/tracing/current_tracer
-bash: echo: write error: Device or resource busy
With the dmesg of:
------------[ cut here ]------------
WARNING: CPU: 7 PID: 1095 at kernel/trace/ftrace.c:3509 ftrace_startup_subops+0xc1e/0x1000
Modules linked in: kvm_intel kvm irqbypass
CPU: 7 UID: 0 PID: 1095 Comm: bash Not tainted 6.16.0-rc2-test-00006-gea03de4105d3 #24 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:ftrace_startup_subops+0xc1e/0x1000
Code: 48 b8 22 01 00 00 00 00 ad de 49 89 84 24 88 01 00 00 8b 44 24 08 89 04 24 e9 c3 f7 ff ff c7 04 24 ed ff ff ff e9 b7 f7 ff ff <0f> 0b c7 04 24 f0 ff ff ff e9 a9 f7 ff ff c7 04 24 f4 ff ff ff e9
RSP: 0018:ffff888133cff948 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 1ffff1102679ff31 RCX: 0000000000000000
RDX: 1ffffffff0b27a60 RSI: ffffffff8593d2f0 RDI: ffffffff85941140
RBP: 00000000000c2041 R08: ffffffffffffffff R09: ffffed1020240221
R10: ffff88810120110f R11: ffffed1020240214 R12: ffffffff8593d2f0
R13: ffffffff8593d300 R14: ffffffff85941140 R15: ffffffff85631100
FS: 00007f7ec6f28740(0000) GS:ffff8882b5251000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7ec6f181c0 CR3: 000000012f1d0005 CR4: 0000000000172ef0
Call Trace:
<TASK>
? __pfx_ftrace_startup_subops+0x10/0x10
? find_held_lock+0x2b/0x80
? ftrace_stub_direct_tramp+0x10/0x10
? ftrace_stub_direct_tramp+0x10/0x10
? trace_preempt_on+0xd0/0x110
? __pfx_trace_graph_entry_args+0x10/0x10
register_ftrace_graph+0x4d2/0x1020
? tracing_reset_online_cpus+0x14b/0x1e0
? __pfx_register_ftrace_graph+0x10/0x10
? ring_buffer_record_enable+0x16/0x20
? tracing_reset_online_cpus+0x153/0x1e0
? __pfx_tracing_reset_online_cpus+0x10/0x10
? __pfx_trace_graph_return+0x10/0x10
graph_trace_init+0xfd/0x160
tracing_set_tracer+0x500/0xa80
? __pfx_tracing_set_tracer+0x10/0x10
? lock_release+0x181/0x2d0
? _copy_from_user+0x26/0xa0
tracing_set_trace_write+0x132/0x1e0
? __pfx_tracing_set_trace_write+0x10/0x10
? ftrace_graph_func+0xcc/0x140
? ftrace_stub_direct_tramp+0x10/0x10
? ftrace_stub_direct_tramp+0x10/0x10
? ftrace_stub_direct_tramp+0x10/0x10
vfs_write+0x1d0/0xe90
? __pfx_vfs_write+0x10/0x10
Have the setting of the funcgraph-args check if function_graph tracer is
the current tracer of the instance, and if not, do nothing, as there's
nothing to do (the option is checked when function_graph tracing starts).
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20250618073801.057ea636@gandalf.local.home
Fixes: c7a60a733c373 ("ftrace: Have funcgraph-args take affect during tracing")
Closes: https://lore.kernel.org/all/4ab1a7bdd0174ab09c7b0d68cdbff9a4@huawei.com/
Reported-by: Changbin Du <changbin.du@huawei.com>
Tested-by: Changbin Du <changbin.du@huawei.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Now when isolcpus is enabled via the cmdline, wq_isolated_cpumask does
not include these isolated CPUs, even wq_unbound_cpumask has already
excluded them. It is only when we successfully configure an isolate cpuset
partition that wq_isolated_cpumask gets overwritten by
workqueue_unbound_exclude_cpumask(), including both the cmdline-specified
isolated CPUs and the isolated CPUs within the cpuset partitions.
Fix this issue by initializing wq_isolated_cpumask properly in
workqueue_init_early().
Fixes: fe28f631fa94 ("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
sched_create_group()
During task_group creation, sched_create_group() calls
scx_group_set_weight() with CGROUP_WEIGHT_DFL to initialize the sched_ext
portion. This is premature and ends up calling ops.cgroup_set_weight() with
an incorrect @cgrp before ops.cgroup_init() is called.
sched_create_group() should just initialize SCX related fields in the new
task_group. Fix it by factoring out scx_tg_init() from sched_init() and
making sched_create_group() call that function instead of
scx_group_set_weight().
v2: Retain CONFIG_EXT_GROUP_SCHED ifdef in sched_init() as removing it leads
to build failures on !CONFIG_GROUP_SCHED configs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 819513666966 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
|
|
Otherwise, tg->scx.weight can go out of sync while scx_cgroup is not enabled
and ops.cgroup_init() may be called with a stale weight value.
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 819513666966 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
|
|
LSM hooks such as security_path_mknod() and security_inode_rename() have
access to newly allocated negative dentry, which has NULL d_inode.
Therefore, it is necessary to do the NULL pointer check for d_inode.
Also add selftests that checks the verifier enforces the NULL pointer
check.
Signed-off-by: Song Liu <song@kernel.org>
Reviewed-by: Matt Bobrowski <mattbobrowski@google.com>
Link: https://lore.kernel.org/r/20250613052857.1992233-1-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
posix_cpu_timer_del()
If an exiting non-autoreaping task has already passed exit_notify() and
calls handle_posix_cpu_timers() from IRQ, it can be reaped by its parent
or debugger right after unlock_task_sighand().
If a concurrent posix_cpu_timer_del() runs at that moment, it won't be
able to detect timer->it.cpu.firing != 0: cpu_timer_task_rcu() and/or
lock_task_sighand() will fail.
Add the tsk->exit_state check into run_posix_cpu_timers() to fix this.
This fix is not needed if CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, because
exit_task_work() is called before exit_notify(). But the check still
makes sense, task_work_add(&tsk->posix_cputimers_work.work) will fail
anyway in this case.
Cc: stable@vger.kernel.org
Reported-by: Benoît Sevens <bsevens@google.com>
Fixes: 0bdd2ed4138e ("sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Initialize `ops` member's pointers properly by using kzalloc() instead of
kmalloc() when allocating the simulation work context. Otherwise the
pointers contain random content leading to invalid dereferencing.
Signed-off-by: Gyeyoung Baek <gye976@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250612124827.63259-1-gye976@gmail.com
|
|
Commit 788019eb559f ("genirq: Retain disable depth for managed interrupts
across CPU hotplug") tried to make managed shutdown/startup properly
reference counted, but it missed the fact that the unplug and hotplug code
has an intentional imbalance by skipping IRQS_SUSPENDED interrupts on
the "restore" path.
This means that if a managed-affinity interrupt was both suspended and
managed-shutdown (such as may happen during system suspend / S3), resume
skips calling irq_startup_managed(), and would again have an unbalanced
depth this time, with a positive value (i.e., remaining unexpectedly
masked).
This IRQS_SUSPENDED check was introduced in commit a60dd06af674
("genirq/cpuhotplug: Skip suspended interrupts when restoring affinity")
for essentially the same reason as commit 788019eb559f, to prevent that
irq_startup() would unconditionally re-enable an interrupt too early.
Because irq_startup_managed() now respsects the disable-depth count, the
IRQS_SUSPENDED check is not longer needed, and instead, it causes harm.
Thus, drop the IRQS_SUSPENDED check, and restore balance.
This effectively reverts commit a60dd06af674 ("genirq/cpuhotplug: Skip
suspended interrupts when restoring affinity"), because it is replaced
by commit 788019eb559f ("genirq: Retain disable depth for managed
interrupts across CPU hotplug").
Fixes: 788019eb559f ("genirq: Retain disable depth for managed interrupts across CPU hotplug")
Reported-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Link: https://lore.kernel.org/all/20250612183303.3433234-3-briannorris@chromium.org
Closes: https://lore.kernel.org/lkml/24ec4adc-7c80-49e9-93ee-19908a97ab84@gmail.com/
|
|
Commit 788019eb559f ("genirq: Retain disable depth for managed interrupts
across CPU hotplug") intended to only decrement the disable depth once per
managed shutdown, but instead it decrements for each CPU hotplug in the
affinity mask, until its depth reaches a point where it finally gets
re-started.
For example, consider:
1. Interrupt is affine to CPU {M,N}
2. disable_irq() -> depth is 1
3. CPU M goes offline -> interrupt migrates to CPU N / depth is still 1
4. CPU N goes offline -> irq_shutdown() / depth is 2
5. CPU N goes online
-> irq_restore_affinity_of_irq()
-> irqd_is_managed_and_shutdown()==true
-> irq_startup_managed() -> depth is 1
6. CPU M goes online
-> irq_restore_affinity_of_irq()
-> irqd_is_managed_and_shutdown()==true
-> irq_startup_managed() -> depth is 0
*** BUG: driver expects the interrupt is still disabled ***
-> irq_startup() -> irqd_clr_managed_shutdown()
7. enable_irq() -> depth underflow / unbalanced enable_irq() warning
This should clear the managed-shutdown flag at step 6, so that further
hotplugs don't cause further imbalance.
Note: It might be cleaner to also remove the irqd_clr_managed_shutdown()
invocation from __irq_startup_managed(). But this is currently not possible
because of irq_update_affinity_desc() as it sets IRQD_MANAGED_SHUTDOWN and
expects irq_startup() to clear it.
Fixes: 788019eb559f ("genirq: Retain disable depth for managed interrupts across CPU hotplug")
Reported-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Aleksandrs Vinarskis <alex.vinarskis@gmail.com>
Link: https://lore.kernel.org/all/20250612183303.3433234-2-briannorris@chromium.org
|
|
When porting a cma related usage from x86_64 server to arm64 server,
the "cma=4G@4G" setup failed on arm64. The reason is arm64 and some
other architectures have specific physical address limit for reserved
cma area, like 4GB due to the device's need for 32 bit dma. Actually
lots of platforms of those architectures don't have this device dma
limit, but still have to obey it, and are not able to reserve a huge
cma pool.
This situation could be improved by honoring the user input cma
physical address than the arch limit. As when users specify it, they
already knows what the default is which probably can't suit them.
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20250612021417.44929-1-feng.tang@linux.alibaba.com
|
|
Once the global hash is requested there is no way back to switch back to
the per-task private hash. This is checked at the begin of the function.
It is possible that two threads simultaneously request the global hash
and both pass the initial check and block later on the
mm::futex_hash_lock. In this case the first thread performs the switch
to the global hash. The second thread will also attempt to switch to the
global hash and while doing so, accessing the nonexisting slot 1 of the
struct futex_private_hash.
The same applies if the hash is made immutable: There is no reference
counting and the hash must not be replaced.
Verify under mm_struct::futex_phash that neither the global hash nor an
immutable hash in use.
Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/all/aDwDw9Aygqo6oAx+@ly-workstation/
Fixes: bd54df5ea7cad ("futex: Allow to resize the private local hash")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20250610104400.1077266-5-bigeasy@linutronix.de/
|