Age | Commit message (Collapse) | Author |
|
The file /sys/kernel/debug/tracing/eneabled_functions is used to debug
ftrace function hooks. Add to the output what function is being called
by the trampoline if the arch supports it.
Add support for this feature in x86_64.
Cc: H. Peter Anvin <hpa@linux.intel.com>
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
The current method of handling multiple function callbacks is to register
a list function callback that calls all the other callbacks based on
their hash tables and compare it to the function that the callback was
called on. But this is very inefficient.
For example, if you are tracing all functions in the kernel and then
add a kprobe to a function such that the kprobe uses ftrace, the
mcount trampoline will switch from calling the function trace callback
to calling the list callback that will iterate over all registered
ftrace_ops (in this case, the function tracer and the kprobes callback).
That means for every function being traced it checks the hash of the
ftrace_ops for function tracing and kprobes, even though the kprobes
is only set at a single function. The kprobes ftrace_ops is checked
for every function being traced!
Instead of calling the list function for functions that are only being
traced by a single callback, we can call a dynamically allocated
trampoline that calls the callback directly. The function graph tracer
already uses a direct call trampoline when it is being traced by itself
but it is not dynamically allocated. It's trampoline is static in the
kernel core. The infrastructure that called the function graph trampoline
can also be used to call a dynamically allocated one.
For now, only ftrace_ops that are not dynamically allocated can have
a trampoline. That is, users such as function tracer or stack tracer.
kprobes and perf allocate their ftrace_ops, and until there's a safe
way to free the trampoline, it can not be used. The dynamically allocated
ftrace_ops may, although, use the trampoline if the kernel is not
compiled with CONFIG_PREEMPT. But that will come later.
Tested-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
ARM has some private syscalls (for example, set_tls(2)) which lie
outside the range of NR_syscalls. If any of these are called while
syscall tracing is being performed, out-of-bounds array access will
occur in the ftrace and perf sys_{enter,exit} handlers.
# trace-cmd record -e raw_syscalls:* true && trace-cmd report
...
true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264
true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
true-653 [000] 384.675988: sys_exit: NR 983045 = 0
...
# trace-cmd record -e syscalls:* true
[ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace
[ 17.289590] pgd = 9e71c000
[ 17.289696] [aaaaaace] *pgd=00000000
[ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 17.290169] Modules linked in:
[ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
[ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
[ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
[ 17.290866] LR is at syscall_trace_enter+0x124/0x184
Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.
Commit cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
added the check for less than zero, but it should have also checked
for greater than NR_syscalls.
Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in
Fixes: cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls"
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
Add a space between subj= and feature= fields to make them parsable.
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moore <pmoore@redhat.com>
|
|
verifier keeps track of register state spilled to stack.
registers are 8-byte wide and always aligned, so instead of tracking them
in every byte-sized stack slot, use MAX_BPF_STACK / 8 array to track
spilled register state.
Though verifier runs in user context and its state freed immediately
after verification, it makes sense to reduce its memory usage.
This optimization reduces sizeof(struct verifier_state)
from 12464 to 1712 on 64-bit and from 6232 to 1112 on 32-bit.
Note, this patch doesn't change existing limits, which are there to bound
time and memory during verification: 4k total number of insns in a program,
1k number of jumps (states to visit) and 32k number of processed insn
(since an insn may be visited multiple times). Theoretical worst case memory
during verification is 1712 * 1k = 17Mbyte. Out-of-memory situation triggers
cleanup and rejects the program.
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent
Pull two RCU fixes from Paul E. McKenney:
" - Complete the work of commit dd56af42bd82 (rcu: Eliminate deadlock
between CPU hotplug and expedited grace periods), which was
intended to allow synchronize_sched_expedited() to be safely
used when holding locks acquired by CPU-hotplug notifiers.
This commit makes the put_online_cpus() avoid the deadlock
instead of just handling the get_online_cpus().
- Complete the work of commit 35ce7f29a44a (rcu: Create rcuo
kthreads only for onlined CPUs), which was intended to allow
RCU to avoid allocating unneeded kthreads on systems where the
firmware says that there are more CPUs than are really present.
This commit makes rcu_barrier() aware of the mismatch, so that
it doesn't hang waiting for non-existent CPUs. "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Found this in the message log on a s390 system:
BUG kmalloc-192 (Not tainted): Poison overwritten
Disabling lock debugging due to kernel taint
INFO: 0x00000000684761f4-0x00000000684761f7. First byte 0xff instead of 0x6b
INFO: Allocated in call_usermodehelper_setup+0x70/0x128 age=71 cpu=2 pid=648
__slab_alloc.isra.47.constprop.56+0x5f6/0x658
kmem_cache_alloc_trace+0x106/0x408
call_usermodehelper_setup+0x70/0x128
call_usermodehelper+0x62/0x90
cgroup_release_agent+0x178/0x1c0
process_one_work+0x36e/0x680
worker_thread+0x2f0/0x4f8
kthread+0x10a/0x120
kernel_thread_starter+0x6/0xc
kernel_thread_starter+0x0/0xc
INFO: Freed in call_usermodehelper_exec+0x110/0x1b8 age=71 cpu=2 pid=648
__slab_free+0x94/0x560
kfree+0x364/0x3e0
call_usermodehelper_exec+0x110/0x1b8
cgroup_release_agent+0x178/0x1c0
process_one_work+0x36e/0x680
worker_thread+0x2f0/0x4f8
kthread+0x10a/0x120
kernel_thread_starter+0x6/0xc
kernel_thread_starter+0x0/0xc
There is a use-after-free bug on the subprocess_info structure allocated
by the user mode helper. In case do_execve() returns with an error
____call_usermodehelper() stores the error code to sub_info->retval, but
sub_info can already have been freed.
Regarding UMH_NO_WAIT, the sub_info structure can be freed by
__call_usermodehelper() before the worker thread returns from
do_execve(), allowing memory corruption when do_execve() failed after
exec_mmap() is called.
Regarding UMH_WAIT_EXEC, the call to umh_complete() allows
call_usermodehelper_exec() to continue which then frees sub_info.
To fix this race the code needs to make sure that the call to
call_usermodehelper_freeinfo() is always done after the last store to
sub_info->retval.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Following up the arm testing of gcov, turns out gcov on ARM64 works fine
as well. Only change needed is adding ARM64 to Kconfig depends.
Tested with qemu and mach-virt
Signed-off-by: Riku Voipio <riku.voipio@linaro.org>
Acked-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
contains checks for the case where CPUs are brought online out of
order, re-wiring the rcuo leader-follower relationships as needed.
Unfortunately, this rewiring was broken. This apparently went undetected
due to the tendency of systems to bring CPUs online in order. This commit
nevertheless fixes the rewiring.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
|
If a no-CBs CPU were to post an RCU callback with interrupts disabled
after it entered the idle loop for the last time, there might be no
deferred wakeup for the corresponding rcuo kthreads. This commit
therefore adds a set of calls to do_nocb_deferred_wakeup() after the
CPU has gone completely offline.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
|
PREEMPT_RCU and TREE_PREEMPT_RCU serve the same function after
TINY_PREEMPT_RCU has been removed. This patch removes TREE_PREEMPT_RCU
and uses PREEMPT_RCU config option in its place.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
|
Rename CONFIG_RCU_BOOST_PRIO to CONFIG_RCU_KTHREAD_PRIO and use this
value for both the per-CPU kthreads (rcuc/N) and the rcu boosting
threads (rcub/n).
Also, create the module_parameter rcutree.kthread_prio to be used on
the kernel command line at boot to set a new value (rcutree.kthread_prio=N).
Signed-off-by: Clark Williams <clark.williams@gmail.com>
[ paulmck: Ported to rcu/dev, applied Paul Bolle and Peter Zijlstra feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
|
__cleanup_sighand() frees sighand without RCU grace period. This is
correct but this looks "obviously buggy" and constantly confuses the
readers, add the comments to explain how this works.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
|
|
The kill_pid_info() can potentially loop indefinitely if tasks are created
and deleted sufficiently quickly, and if this happens, this function
will remain in a single RCU read-side critical section indefinitely.
This commit therefore exits the RCU read-side critical section on each
pass through the loop. Because a race must happen to retry the loop,
this should have no performance impact in the common case.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
|
|
During the 3.18 merge period additional __get_cpu_var uses were
added. The patch converts these to this_cpu_ptr().
Signed-off-by: Christoph Lameter <cl@linux.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
CLOCK_REALTIME
ktime_get_real_seconds() is the replacement function for get_seconds()
returning the seconds portion of CLOCK_REALTIME in a time64_t. For
64bit the function is equivivalent to get_seconds(), but for 32bit it
protects the readout with the timekeeper sequence count. This is
required because 32-bit machines cannot access 64-bit tk->xtime_sec
variable atomically.
[tglx: Massaged changelog and added docbook comment ]
Signed-off-by: Heena Sirwani <heenasirwani@gmail.com>
Reviewed-by: Arnd Bergman <arnd@arndb.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: opw-kernel@googlegroups.com
Link: http://lkml.kernel.org/r/7adcfaa8962b8ad58785d9a2456c3f77d93c0ffb.1414578445.git.heenasirwani@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
This is the counterpart to get_seconds() based on CLOCK_MONOTONIC. The
use case for this interface are kernel internal coarse grained
timestamps which do neither require the nanoseconds fraction of
current time nor the CLOCK_REALTIME properties. Such timestamps can
currently only retrieved by calling ktime_get_ts64() and using the
tv_sec field of the returned timespec64. That's inefficient as it
involves the read of the clocksource, math operations and must be
protected by the timekeeper sequence counter.
To avoid the sequence counter protection we restrict the return value
to unsigned 32bit on 32bit machines. This covers ~136 years of uptime
and therefor an overflow is not expected to hit anytime soon.
To avoid math in the function we calculate the current seconds portion
of CLOCK_MONOTONIC when the timekeeper gets updated in
tk_update_ktime_data() similar to the CLOCK_REALTIME counterpart
xtime_sec.
[ tglx: Massaged changelog, simplified and commented the update
function, added docbook comment ]
Signed-off-by: Heena Sirwani <heenasirwani@gmail.com>
Reviewed-by: Arnd Bergman <arnd@arndb.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: opw-kernel@googlegroups.com
Link: http://lkml.kernel.org/r/da0b63f4bdf3478909f92becb35861197da3a905.1414578445.git.heenasirwani@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
Simple typo in a comment, so fix it.
Signed-off-by: James Hartley <james.hartley@imgtec.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
Currently, synchronize_sched_expedited() sends IPIs to all online CPUs,
even those that are idle or executing in nohz_full= userspace. Because
idle CPUs and nohz_full= userspace CPUs are in extended quiescent states,
there is no need to IPI them in the first place. This commit therefore
avoids IPIing CPUs that are already in extended quiescent states.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
|
There are some RCU_BOOST-specific per-CPU variable declarations that
are needlessly defined under #ifdef in kernel/rcu/tree.c. This commit
therefore moves these declarations into a pre-existing #ifdef in
kernel/rcu/tree_plugin.h.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
|
The CONFIG_RCU_CPU_STALL_VERBOSE Kconfig parameter causes preemptible
RCU's CPU stall warnings to dump out any preempted tasks that are blocking
the current RCU grace period. This information is useful, and the default
has been CONFIG_RCU_CPU_STALL_VERBOSE=y for some years. It is therefore
time for this commit to remove this Kconfig parameter, so that future
kernel builds will always act as if CONFIG_RCU_CPU_STALL_VERBOSE=y.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace trampoline accounting fixes from Steven Rostedt:
"Adding the new code for 3.19, I discovered a couple of minor bugs with
the accounting of the ftrace_ops trampoline logic.
One was that the old hash was not updated before calling the modify
code for an ftrace_ops. The second bug was what let the first bug go
unnoticed, as the update would check the current hash for all
ftrace_ops (where it should only check the old hash for modified
ones). This let things work when only one ftrace_ops was registered
to a function, but could break if more than one was registered
depending on the order of the look ups.
The worse thing that can happen if this bug triggers is that the
ftrace self checks would find an anomaly and shut itself down"
* tag 'trace-fixes-v3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix checking of trampoline ftrace_ops in finding trampoline
ftrace: Set ops->old_hash on modifying what an ops hooks to
|
|
Commit 35ce7f29a44a (rcu: Create rcuo kthreads only for onlined CPUs)
avoids creating rcuo kthreads for CPUs that never come online. This
fixes a bug in many instances of firmware: Instead of lying about their
age, these systems instead lie about the number of CPUs that they have.
Before commit 35ce7f29a44a, this could result in huge numbers of useless
rcuo kthreads being created.
It appears that experience indicates that I should have told the
people suffering from this problem to fix their broken firmware, but
I instead produced what turned out to be a partial fix. The missing
piece supplied by this commit makes sure that rcu_barrier() knows not to
post callbacks for no-CBs CPUs that have not yet come online, because
otherwise rcu_barrier() will hang on systems having firmware that lies
about the number of CPUs.
It is tempting to simply have rcu_barrier() refuse to post a callback on
any no-CBs CPU that does not have an rcuo kthread. This unfortunately
does not work because rcu_barrier() is required to wait for all pending
callbacks. It is therefore required to wait even for those callbacks
that cannot possibly be invoked. Even if doing so hangs the system.
Given that posting a callback to a no-CBs CPU that does not yet have an
rcuo kthread can hang rcu_barrier(), It is tempting to report an error
in this case. Unfortunately, this will result in false positives at
boot time, when it is perfectly legal to post callbacks to the boot CPU
before the scheduler has started, in other words, before it is legal
to invoke rcu_barrier().
So this commit instead has rcu_barrier() avoid posting callbacks to
CPUs having neither rcuo kthread nor pending callbacks, and has it
complain bitterly if it finds CPUs having no rcuo kthread but some
pending callbacks. And when rcu_barrier() does find CPUs having no rcuo
kthread but pending callbacks, as noted earlier, it has no choice but
to hang indefinitely.
Reported-by: Yanko Kaneti <yaneti@declera.com>
Reported-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Eric B Munson <emunson@akamai.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: Eric B Munson <emunson@akamai.com>
Tested-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Tested-by: Yanko Kaneti <yaneti@declera.com>
Tested-by: Kevin Fenzi <kevin@scrye.com>
Tested-by: Meelis Roos <mroos@linux.ee>
|
|
cond_resched() is a preemption point, not strictly a blocking
primitive, so exclude it from the ->state test.
In particular, preemption preserves task_struct::state.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Alex Elder <alex.elder@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Lin <axel.lin@ingics.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140924082242.656559952@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Validate we call might_sleep() with TASK_RUNNING, which catches places
where we nest blocking primitives, eg. mutex usage in a wait loop.
Since all blocking is arranged through task_struct::state, nesting
this will cause the inner primitive to set TASK_RUNNING and the outer
will thus not block.
Another observed problem is calling a blocking function from
schedule()->sched_submit_work()->blk_schedule_flush_plug() which will
then destroy the task state for the actual __schedule() call that
comes after it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.591637616@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
This is a genuine bug in add_unformed_module(), we cannot use blocking
primitives inside a wait loop.
So rewrite the wait_event_interruptible() usage to use the fresh
wait_woken() stuff.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: http://lkml.kernel.org/r/20140924082242.458562904@infradead.org
[ So this is probably complex to backport and the race wasn't reported AFAIK,
so not marked for -stable. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
smp_hotplug_thread::{setup,unpark} functions can sleep too, so be
consistent and do the same for all callbacks.
__might_sleep+0x74/0x80
kmem_cache_alloc_trace+0x4e/0x1c0
perf_event_alloc+0x55/0x450
perf_event_create_kernel_counter+0x2f/0x100
watchdog_nmi_enable+0x8d/0x160
watchdog_enable+0x45/0x90
smpboot_thread_fn+0xec/0x2b0
kthread+0xe4/0x100
ret_from_fork+0x7c/0xb0
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: oleg@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.392279328@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
do_wait() is a big wait loop, but we set TASK_RUNNING too late; we end
up calling potential sleeps before we reset it.
Not strictly a bug since we're guaranteed to exit the loop and not
call schedule(); put in annotations to quiet might_sleep().
WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:7123 __might_sleep+0x7e/0x90()
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8109a788>] do_wait+0x88/0x270
Call Trace:
[<ffffffff81694991>] dump_stack+0x4e/0x7a
[<ffffffff8109877c>] warn_slowpath_common+0x8c/0xc0
[<ffffffff8109886c>] warn_slowpath_fmt+0x4c/0x50
[<ffffffff810bca6e>] __might_sleep+0x7e/0x90
[<ffffffff811a1c15>] might_fault+0x55/0xb0
[<ffffffff8109a3fb>] wait_consider_task+0x90b/0xc10
[<ffffffff8109a804>] do_wait+0x104/0x270
[<ffffffff8109b837>] SyS_wait4+0x77/0x100
[<ffffffff8169d692>] system_call_fastpath+0x16/0x1b
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: umgwanakikbuti@gmail.com
Cc: ilya.dryomov@inktank.com
Cc: Alex Elder <alex.elder@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Lin <axel.lin@ingics.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Guillaume Morin <guillaume@morinfr.org>
Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Michal Schmidt <mschmidt@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140924082242.186408915@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
There are a few places that call blocking primitives from wait loops,
provide infrastructure to support this without the typical
task_struct::state collision.
We record the wakeup in wait_queue_t::flags which leaves
task_struct::state free to be used by others.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082242.051202318@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
We're going to make might_sleep() test for TASK_RUNNING, because
blocking without TASK_RUNNING will destroy the task state by setting
it to TASK_RUNNING.
There are a few occasions where its 'valid' to call blocking
primitives (and mutex_lock in particular) and not have TASK_RUNNING,
typically such cases are right before we set TASK_RUNNING anyhow.
Robustify the code by not assuming this; this has the beneficial side
effect of allowing optional code emission for fixing the above
might_sleep() false positives.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: ilya.dryomov@inktank.com
Cc: umgwanakikbuti@gmail.com
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140924082241.988560063@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Andy reported that the current state of event_idx is rather confused.
So remove all but the x86_pmu implementation and change the default to
return 0 (the safe option).
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Cody P Schafer <dev@codyps.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Himangi Saraogi <himangi774@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: sukadev@linux.vnet.ibm.com <sukadev@linux.vnet.ibm.com>
Cc: Thomas Huth <thuth@linux.vnet.ibm.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux390@de.ibm.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Use nr_cpus_allowed to bail from select_task_rq() when only one cpu
can be used, and saves some cycles for pinned tasks.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413253360-5318-2-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
There is no need to do balance during fork since SCHED_DEADLINE
tasks can't fork. This patch avoid the SD_BALANCE_FORK check.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1413253360-5318-1-git-send-email-wanpeng.li@linux.intel.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
How we deal with updates to exclusive cpusets is currently broken.
As an example, suppose we have an exclusive cpuset composed of
two cpus: A[cpu0,cpu1]. We can assign SCHED_DEADLINE task to it
up to the allowed bandwidth. If we want now to modify cpusetA's
cpumask, we have to check that removing a cpu's amount of
bandwidth doesn't break AC guarantees. This thing isn't checked
in the current code.
This patch fixes the problem above, denying an update if the
new cpumask won't have enough bandwidth for SCHED_DEADLINE tasks
that are currently active.
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/5433E6AF.5080105@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
exclusive cpusets
Exclusive cpusets are the only way users can restrict SCHED_DEADLINE tasks
affinity (performing what is commonly called clustered scheduling).
Unfortunately, such thing is currently broken for two reasons:
- No check is performed when the user tries to attach a task to
an exlusive cpuset (recall that exclusive cpusets have an
associated maximum allowed bandwidth).
- Bandwidths of source and destination cpusets are not correctly
updated after a task is migrated between them.
This patch fixes both things at once, as they are opposite faces
of the same coin.
The check is performed in cpuset_can_attach(), as there aren't any
points of failure after that function. The updated is split in two
halves. We first reserve bandwidth in the destination cpuset, after
we pass the check in cpuset_can_attach(). And we then release
bandwidth from the source cpuset when the task's affinity is
actually changed. Even if there can be time windows when sched_setattr()
may erroneously fail in the source cpuset, we are fine with it, as
we can't perfom an atomic update of both cpusets at once.
Reported-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Reported-by: Vincent Legout <vincent@legout.info>
Signed-off-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: Michael Trimarchi <michael@amarulasolutions.com>
Cc: Fabio Checconi <fchecconi@gmail.com>
Cc: michael@amarulasolutions.com
Cc: luca.abeni@unitn.it
Cc: Li Zefan <lizefan@huawei.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: cgroups@vger.kernel.org
Link: http://lkml.kernel.org/r/1411118561-26323-3-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
As Kirill mentioned (https://lkml.org/lkml/2013/1/29/118):
| If rq has already had 2 or more pushable tasks and we try to add a
| pinned task then call of push_rt_task will just waste a time.
Just switched pinned task is not able to be pushed. If the rq has had
several dl tasks before they have already been considered as candidates
to be pushed (or pulled). This patch implements the same behavior as rt
class which introduced by commit 10447917551e ("sched/rt: Do not try to
push tasks if pinned task switches to RT").
Suggested-by: Kirill V Tkhai <tkhai@yandex.ru>
Acked-by: Juri Lelli <juri.lelli@arm.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413938203-224610-1-git-send-email-wanpeng.li@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
task_preempt_count() is pointless if preemption counter is per-cpu,
currently this is x86 only. It is only valid if the task is not
running, and even in this case the only info it can provide is the
state of PREEMPT_ACTIVE bit.
Change its single caller to check p->on_rq instead, this should be
the same if p->state != TASK_RUNNING, and kill this helper.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Alexander Graf <agraf@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Link: http://lkml.kernel.org/r/20141008183348.GC17495@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Both callers of finish_task_switch() need to recalculate this_rq()
and pass it as an argument, plus __schedule() does this again after
context_switch().
It would be simpler to call this_rq() once in finish_task_switch()
and return the this rq to the callers.
Note: probably "int cpu" in __schedule() should die; it is not used
and both rcu_note_context_switch() and wq_worker_sleeping() do not
really need this argument.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141009193232.GB5408@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
finish_task_switch() enables preemption, so post_schedule(rq) can be
called on the wrong (and even dead) CPU. Afaics, nothing really bad
can happen, but in this case we can wrongly clear rq->post_schedule
on that CPU. And this simply looks wrong in any case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141008193644.GA32055@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
In pseudo-interleaved numa_groups, all tasks try to relocate to
the group's preferred_nid. When a group is spread across multiple
NUMA nodes, this can lead to tasks swapping their location with
other tasks inside the same group, instead of swapping location with
tasks from other NUMA groups. This can keep NUMA groups from converging.
Examining all nodes, when dealing with a task in a pseudo-interleaved
NUMA group, avoids this problem. Note that only CPUs in nodes that
improve the task or group score are examined, so the loop isn't too
bad.
Tested-by: Vinod Chegu <chegu_vinod@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Vinod Chegu" <chegu_vinod@hp.com>
Cc: mgorman@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20141009172747.0d97c38c@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
On systems with complex NUMA topologies, the node scoring is adjusted
to allow workloads to converge on nodes that are near each other.
The way a task group's preferred nid is determined needs to be adjusted,
in order for the preferred_nid to be consistent with group_weight scoring.
This ensures that we actually try to converge workloads on adjacent nodes.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413530994-9732-6-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
In order to do task placement on systems with complex NUMA topologies,
it is necessary to count the faults on nodes nearby the node that is
being examined for a potential move.
In case of a system with a backplane interconnect, we are dealing with
groups of NUMA nodes; each of the nodes within a group is the same number
of hops away from nodes in other groups in the system. Optimal placement
on this topology is achieved by counting all nearby nodes equally. When
comparing nodes A and B at distance N, nearby nodes are those at distances
smaller than N from nodes A or B.
Placement strategy on a system with a glueless mesh NUMA topology needs
to be different, because there are no natural groups of nodes determined
by the hardware. Instead, when dealing with two nodes A and B at distance
N, N >= 2, there will be intermediate nodes at distance < N from both nodes
A and B. Good placement can be achieved by right shifting the faults on
nearby nodes by the number of hops from the node being scored. In this
context, a nearby node is any node less than the maximum distance in the
system away from the node. Those nodes are skipped for efficiency reasons,
there is no real policy reason to do so.
Placement policy on directly connected NUMA systems is not affected.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Link: http://lkml.kernel.org/r/1413530994-9732-5-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Preparatory patch for adding NUMA placement on systems with
complex NUMA topology. Also fix a potential divide by zero
in group_weight()
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413530994-9732-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Smaller NUMA systems tend to have all NUMA nodes directly connected
to each other. This includes the degenerate case of a system with just
one node, ie. a non-NUMA system.
Larger systems can have two kinds of NUMA topology, which affects how
tasks and memory should be placed on the system.
On glueless mesh systems, nodes that are not directly connected to
each other will bounce traffic through intermediary nodes. Task groups
can be run closer to each other by moving tasks from a node to an
intermediary node between it and the task's preferred node.
On NUMA systems with backplane controllers, the intermediary hops
are incapable of running programs. This creates "islands" of nodes
that are at an equal distance to anywhere else in the system.
Each kind of topology requires a slightly different placement
algorithm; this patch provides the mechanism to detect the kind
of NUMA topology of a system.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
[ Changed to use kernel/sched/sched.h ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Link: http://lkml.kernel.org/r/1413530994-9732-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Export some information that is necessary to do placement of
tasks on systems with multi-level NUMA topologies.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: mgorman@suse.de
Cc: chegu_vinod@hp.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413530994-9732-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
1) switched_to_dl() check is wrong. We reschedule only
if rq->curr is deadline task, and we do not reschedule
if it's a lower priority task. But we must always
preempt a task of other classes.
2) dl_task_timer():
Policy does not change in case of priority inheritance.
rt_mutex_setprio() changes prio, while policy remains old.
So we lose some balancing logic in dl_task_timer() and
switched_to_dl() when we check policy instead of priority. Boosted
task may be rq->curr.
(I didn't change switched_from_dl() because no check is necessary
there at all).
I've looked at this place(switched_to_dl) several times and even fixed
this function, but found just now... I suppose some performance tests
may work better after this.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
preempt_schedule_context() does preempt_enable_notrace() at the end
and this can call the same function again; exception_exit() is heavy
and it is quite possible that need-resched is true again.
1. Change this code to dec preempt_count() and check need_resched()
by hand.
2. As Linus suggested, we can use the PREEMPT_ACTIVE bit and avoid
the enable/disable dance around __schedule(). But in this case
we need to move into sched/core.c.
3. Cosmetic, but x86 forgets to declare this function. This doesn't
really matter because it is only called by asm helpers, still it
make sense to add the declaration into asm/preempt.h to match
preempt_schedule().
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Graf <agraf@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/20141005202322.GB27962@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.
This bash command reproduces problem:
$ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done
divide error: 0000 [#1] SMP
Modules linked in:
CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
RIP: 0010:[<ffffffff81074191>] [<ffffffff81074191>] task_scan_min+0x21/0x50
RSP: 0000:ffff880037a6bce0 EFLAGS: 00010246
RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
FS: 00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
Stack:
ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
Call Trace:
[<ffffffff810741d1>] task_scan_max+0x11/0x40
[<ffffffff81077ef7>] task_numa_fault+0x1f7/0xae0
[<ffffffff8115a896>] ? migrate_misplaced_page+0x276/0x300
[<ffffffff81134a4d>] handle_mm_fault+0x62d/0xba0
[<ffffffff8103e2f1>] __do_page_fault+0x191/0x510
[<ffffffff81030122>] ? native_smp_send_reschedule+0x42/0x60
[<ffffffff8106dc00>] ? check_preempt_curr+0x80/0xa0
[<ffffffff8107092c>] ? wake_up_new_task+0x11c/0x1a0
[<ffffffff8104887d>] ? do_fork+0x14d/0x340
[<ffffffff811799bb>] ? get_unused_fd_flags+0x2b/0x30
[<ffffffff811799df>] ? __fd_install+0x1f/0x60
[<ffffffff8103e67c>] do_page_fault+0xc/0x10
[<ffffffff8150d322>] page_fault+0x22/0x30
RIP [<ffffffff81074191>] task_scan_min+0x21/0x50
RSP <ffff880037a6bce0>
---[ end trace 9a826d16936c04de ]---
Also fix race in task_scan_min (it depends on compiler behaviour).
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dario Faggioli <raistlin@linux.it>
Cc: David Rientjes <rientjes@google.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
While offling node by hot removing memory, the following divide error
occurs:
divide error: 0000 [#1] SMP
[...]
Call Trace:
[...] handle_mm_fault
[...] ? try_to_wake_up
[...] ? wake_up_state
[...] __do_page_fault
[...] ? do_futex
[...] ? put_prev_entity
[...] ? __switch_to
[...] do_page_fault
[...] page_fault
[...]
RIP [<ffffffff810a7081>] task_numa_fault
RSP <ffff88084eb2bcb0>
The issue occurs as follows:
1. When page fault occurs and page is allocated from node 1,
task_struct->numa_faults_buffer_memory[] of node 1 is
incremented and p->numa_faults_locality[] is also incremented
as follows:
o numa_faults_buffer_memory[] o numa_faults_locality[]
NR_NUMA_HINT_FAULT_TYPES
| 0 | 1 |
---------------------------------- ----------------------
node 0 | 0 | 0 | remote | 0 |
node 1 | 0 | 1 | locale | 1 |
---------------------------------- ----------------------
2. node 1 is offlined by hot removing memory.
3. When page fault occurs, fault_types[] is calculated by using
p->numa_faults_buffer_memory[] of all online nodes in
task_numa_placement(). But node 1 was offline by step 2. So
the fault_types[] is calculated by using only
p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
are set to 0.
4. The values(0) of fault_types[] pass to update_task_scan_period().
5. numa_faults_locality[1] is set to 1. So the following division is
calculated.
static void update_task_scan_period(struct task_struct *p,
unsigned long shared, unsigned long private){
...
ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
}
6. But both of private and shared are set to 0. So divide error
occurs here.
The divide error is rare case because the trigger is node offline.
This patch always increments denominator for avoiding divide error.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Unlocked access to dst_rq->curr in task_numa_compare() is racy.
If curr task is exiting this may be a reason of use-after-free:
task_numa_compare() do_exit()
... current->flags |= PF_EXITING;
... release_task()
... ~~delayed_put_task_struct()~~
... schedule()
rcu_read_lock() ...
cur = ACCESS_ONCE(dst_rq->curr) ...
... rq->curr = next;
... context_switch()
... finish_task_switch()
... put_task_struct()
... __put_task_struct()
... free_task_struct()
task_numa_assign() ...
get_task_struct() ...
As noted by Oleg:
<<The lockless get_task_struct(tsk) is only safe if tsk == current
and didn't pass exit_notify(), or if this tsk was found on a rcu
protected list (say, for_each_process() or find_task_by_vpid()).
IOW, it is only safe if release_task() was not called before we
take rcu_read_lock(), in this case we can rely on the fact that
delayed_put_pid() can not drop the (potentially) last reference
until rcu_read_unlock().
And as Kirill pointed out task_numa_compare()->task_numa_assign()
path does get_task_struct(dst_rq->curr) and this is not safe. The
task_struct itself can't go away, but rcu_read_lock() can't save
us from the final put_task_struct() in finish_task_switch(); this
reference goes away without rcu gp>>
The patch provides simple check of PF_EXITING flag. If it's not set,
this guarantees that call_rcu() of delayed_put_task_struct() callback
hasn't happened yet, so we can safely do get_task_struct() in
task_numa_assign().
Locked dst_rq->lock protects from concurrency with the last schedule().
Reusing or unmapping of cur's memory may happen without it.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|