summaryrefslogtreecommitdiff
path: root/include/linux/sched.h
AgeCommit message (Collapse)Author
2011-05-25oom: replace PF_OOM_ORIGIN with toggling oom_score_adjDavid Rientjes
There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty. It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages. PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks. We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task. This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX. The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old value is then reinstated when the process should no longer be considered a high priority for oom killing. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <ieidus@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-23Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Increase SCHED_LOAD_SCALE resolution sched: Introduce SCHED_POWER_SCALE to scale cpu_power calculations sched: Cleanup set_load_weight()
2011-05-23Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf tools: Fix sample size bit operations perf tools: Fix ommitted mmap data update on remap watchdog: Change the default timeout and configure nmi watchdog period based on watchdog_thresh watchdog: Disable watchdog when thresh is zero watchdog: Only disable/enable watchdog if neccessary watchdog: Fix rounding bug in get_sample_period() perf tools: Propagate event parse error handling perf tools: Robustify dynamic sample content fetch perf tools: Pre-check sample size before parsing perf tools: Move evlist sample helpers to evlist area perf tools: Remove junk code in mmap size handling perf tools: Check we are able to read the event size on mmap
2011-05-23watchdog: Disable watchdog when thresh is zeroMandeep Singh Baines
This restores the previous behavior of softlock_thresh. Currently, setting watchdog_thresh to zero causes the watchdog kthreads to consume a lot of CPU. In addition, the logic of proc_dowatchdog_thresh and proc_dowatchdog_enabled has been factored into proc_dowatchdog. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Cc: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1306127423-3347-3-git-send-email-msb@chromium.org Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <20110517071018.GE22305@elte.hu>
2011-05-20Merge branch 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/miscLinus Torvalds
* 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (41 commits) signal: trivial, fix the "timespec declared inside parameter list" warning job control: reorganize wait_task_stopped() ptrace: fix signal->wait_chldexit usage in task_clear_group_stop_trapping() signal: sys_sigprocmask() needs retarget_shared_pending() signal: cleanup sys_sigprocmask() signal: rename signandsets() to sigandnsets() signal: do_sigtimedwait() needs retarget_shared_pending() signal: introduce do_sigtimedwait() to factor out compat/native code signal: sys_rt_sigtimedwait: simplify the timeout logic signal: cleanup sys_rt_sigprocmask() x86: signal: sys_rt_sigreturn() should use set_current_blocked() x86: signal: handle_signal() should use set_current_blocked() signal: sigprocmask() should do retarget_shared_pending() signal: sigprocmask: narrow the scope of ->siglock signal: retarget_shared_pending: optimize while_each_thread() loop signal: retarget_shared_pending: consider shared/unblocked signals only signal: introduce retarget_shared_pending() ptrace: ptrace_check_attach() should not do s/STOPPED/TRACED/ signal: Turn SIGNAL_STOP_DEQUEUED into GROUP_STOP_DEQUEUED signal: do_signal_stop: Remove the unneeded task_clear_group_stop_pending() ...
2011-05-20sched: Increase SCHED_LOAD_SCALE resolutionNikhil Rao
Introduce SCHED_LOAD_RESOLUTION, which scales is added to SCHED_LOAD_SHIFT and increases the resolution of SCHED_LOAD_SCALE. This patch sets the value of SCHED_LOAD_RESOLUTION to 10, scaling up the weights for all sched entities by a factor of 1024. With this extra resolution, we can handle deeper cgroup hiearchies and the scheduler can do better shares distribution and load load balancing on larger systems (especially for low weight task groups). This does not change the existing user interface, the scaled weights are only used internally. We do not modify prio_to_weight values or inverses, but use the original weights when calculating the inverse which is used to scale execution time delta in calc_delta_mine(). This ensures we do not lose accuracy when accounting time to the sched entities. Thanks to Nikunj Dadhania for fixing an bug in c_d_m() that broken fairness. Below is some analysis of the performance costs/improvements of this patch. 1. Micro-arch performance costs: Experiment was to run Ingo's pipe_test_100k 200 times with the task pinned to one cpu. I measured instruction, cycles and stalled-cycles for the runs. See: http://thread.gmane.org/gmane.linux.kernel/1129232/focus=1129389 for more info. -tip (baseline): Performance counter stats for '/root/load-scale/pipe-test-100k' (200 runs): 964,991,769 instructions # 0.82 insns per cycle # 0.33 stalled cycles per insn # ( +- 0.05% ) 1,171,186,635 cycles # 0.000 GHz ( +- 0.08% ) 306,373,664 stalled-cycles-backend # 26.16% backend cycles idle ( +- 0.28% ) 314,933,621 stalled-cycles-frontend # 26.89% frontend cycles idle ( +- 0.34% ) 1.122405684 seconds time elapsed ( +- 0.05% ) -tip+patches: Performance counter stats for './load-scale/pipe-test-100k' (200 runs): 963,624,821 instructions # 0.82 insns per cycle # 0.33 stalled cycles per insn # ( +- 0.04% ) 1,175,215,649 cycles # 0.000 GHz ( +- 0.08% ) 315,321,126 stalled-cycles-backend # 26.83% backend cycles idle ( +- 0.28% ) 316,835,873 stalled-cycles-frontend # 26.96% frontend cycles idle ( +- 0.29% ) 1.122238659 seconds time elapsed ( +- 0.06% ) With this patch, instructions decrease by ~0.10% and cycles increase by 0.27%. This doesn't look statistically significant. The number of stalled cycles in the backend increased from 26.16% to 26.83%. This can be attributed to the shifts we do in c_d_m() and other places. The fraction of stalled cycles in the frontend remains about the same, at 26.96% compared to 26.89% in -tip. 2. Balancing low-weight task groups Test setup: run 50 tasks with random sleep/busy times (biased around 100ms) in a low weight container (with cpu.shares = 2). Measure %idle as reported by mpstat over a 10s window. -tip (baseline): 06:47:48 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle intr/s 06:47:49 PM all 94.32 0.00 0.06 0.00 0.00 0.00 0.00 0.00 5.62 15888.00 06:47:50 PM all 94.57 0.00 0.62 0.00 0.00 0.00 0.00 0.00 4.81 16180.00 06:47:51 PM all 94.69 0.00 0.06 0.00 0.00 0.00 0.00 0.00 5.25 15966.00 06:47:52 PM all 95.81 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.19 16053.00 06:47:53 PM all 94.88 0.06 0.00 0.00 0.00 0.00 0.00 0.00 5.06 15984.00 06:47:54 PM all 93.31 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.69 15806.00 06:47:55 PM all 94.19 0.00 0.06 0.00 0.00 0.00 0.00 0.00 5.75 15896.00 06:47:56 PM all 92.87 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.13 15716.00 06:47:57 PM all 94.88 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.12 15982.00 06:47:58 PM all 95.44 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.56 16075.00 Average: all 94.49 0.01 0.08 0.00 0.00 0.00 0.00 0.00 5.42 15954.60 -tip+patches: 06:47:03 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle intr/s 06:47:04 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16630.00 06:47:05 PM all 99.69 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.31 16580.20 06:47:06 PM all 99.69 0.00 0.06 0.00 0.00 0.00 0.00 0.00 0.25 16596.00 06:47:07 PM all 99.20 0.00 0.74 0.00 0.00 0.06 0.00 0.00 0.00 17838.61 06:47:08 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16540.00 06:47:09 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16575.00 06:47:10 PM all 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16614.00 06:47:11 PM all 99.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 16588.00 06:47:12 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 0.00 16593.00 06:47:13 PM all 99.94 0.00 0.06 0.00 0.00 0.00 0.00 0.00 0.00 16551.00 Average: all 99.84 0.00 0.09 0.00 0.00 0.01 0.00 0.00 0.06 16711.58 We see an improvement in idle% on the system (drops from 5.42% on -tip to 0.06% with the patches). We see an improvement in idle% on the system (drops from 5.42% on -tip to 0.06% with the patches). Signed-off-by: Nikhil Rao <ncrao@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Stephan Barwolf <stephan.baerwolf@tu-ilmenau.de> Cc: Mike Galbraith <efault@gmx.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1305754668-18792-1-git-send-email-ncrao@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-20sched: Introduce SCHED_POWER_SCALE to scale cpu_power calculationsNikhil Rao
SCHED_LOAD_SCALE is used to increase nice resolution and to scale cpu_power calculations in the scheduler. This patch introduces SCHED_POWER_SCALE and converts all uses of SCHED_LOAD_SCALE for scaling cpu_power to use SCHED_POWER_SCALE instead. This is a preparatory patch for increasing the resolution of SCHED_LOAD_SCALE, and there is no need to increase resolution for cpu_power calculations. Signed-off-by: Nikhil Rao <ncrao@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Stephan Barwolf <stephan.baerwolf@tu-ilmenau.de> Cc: Mike Galbraith <efault@gmx.de> Link: http://lkml.kernel.org/r/1305738580-9924-3-git-send-email-ncrao@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-12sched: Remove unused parameters from sched_fork() and wake_up_new_task()Samir Bellabes
sched_fork() and wake_up_new_task() are defined with a parameter 'unsigned long clone_flags', which is unused. This patch removes the parameters. Signed-off-by: Samir Bellabes <sam@synack.fr> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1305130685-1047-1-git-send-email-sam@synack.fr Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-05-12Merge commit 'v2.6.39-rc7' into sched/coreIngo Molnar
2011-04-25ptrace: Prepare to fix racy accesses on task breakpointsFrederic Weisbecker
When a task is traced and is in a stopped state, the tracer may execute a ptrace request to examine the tracee state and get its task struct. Right after, the tracee can be killed and thus its breakpoints released. This can happen concurrently when the tracer is in the middle of reading or modifying these breakpoints, leading to dereferencing a freed pointer. Hence, to prepare the fix, create a generic breakpoint reference holding API. When a reference on the breakpoints of a task is held, the breakpoints won't be released until the last reference is dropped. After that, no more ptrace request on the task's breakpoints can be serviced for the tracer. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Will Deacon <will.deacon@arm.com> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: v2.6.33.. <stable@kernel.org> Link: http://lkml.kernel.org/r/1302284067-7860-2-git-send-email-fweisbec@gmail.com
2011-04-24sched: Get rid of lock_depthJonathan Corbet
Neil Brown pointed out that lock_depth somehow escaped the BKL removal work. Let's get rid of it now. Note that the perf scripting utilities still have a bunch of code for dealing with common_lock_depth in tracepoints; I have left that in place in case anybody wants to use that code with older kernels. Suggested-by: Neil Brown <neilb@suse.de> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110422111910.456c0e84@bike.lwn.net Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-21Merge commit 'v2.6.39-rc4' into sched/coreIngo Molnar
Merge reason: Pick up upstream fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-18Merge branch 'sched/locking' into sched/coreIngo Molnar
Merge reason: the rq locking changes are stable, propagate them into the .40 queue. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-14brk: COMPAT_BRK: fix detection of randomized brkJiri Kosina
5520e89 ("brk: fix min_brk lower bound computation for COMPAT_BRK") tried to get the whole logic of brk randomization for legacy (libc5-based) applications finally right. It turns out that the way to detect whether brk has actually been randomized in the end or not introduced by that patch still doesn't work for those binaries, as reported by Geert: : /sbin/init from my old m68k ramdisk exists prematurely. : : Before the patch: : : | brk(0x80005c8e) = 0x80006000 : : After the patch: : : | brk(0x80005c8e) = 0x80005c8e : : Old libc5 considers brk() to have failed if the return value is not : identical to the requested value. I don't like it, but currently see no better option than a bit flag in task_struct to catch the CONFIG_COMPAT_BRK && randomize_va_space == 2 case. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-14sched: Move the second half of ttwu() to the remote cpuPeter Zijlstra
Now that we've removed the rq->lock requirement from the first part of ttwu() and can compute placement without holding any rq->lock, ensure we execute the second half of ttwu() on the actual cpu we want the task to run on. This avoids having to take rq->lock and doing the task enqueue remotely, saving lots on cacheline transfers. As measured using: http://oss.oracle.com/~mason/sembench.c $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i; done $ echo 4096 32000 64 128 > /proc/sys/kernel/sem $ ./sembench -t 2048 -w 1900 -o 0 unpatched: run time 30 seconds 647278 worker burns per second patched: run time 30 seconds 816715 worker burns per second Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152729.515897185@chello.nl
2011-04-14sched: Delay task_contributes_to_load()Peter Zijlstra
In prepratation of having to call task_contributes_to_load() without holding rq->lock, we need to store the result until we do and can update the rq accounting accordingly. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152729.151523907@chello.nl
2011-04-14sched: Remove rq argument to sched_class::task_waking()Peter Zijlstra
In preparation of calling this without rq->lock held, remove the dependency on the rq argument. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110405152729.071474242@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-14sched: Drop the rq argument to sched_class::select_task_rq()Peter Zijlstra
In preparation of calling select_task_rq() without rq->lock held, drop the dependency on the rq argument. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110405152729.031077745@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-14sched: Provide p->on_rqPeter Zijlstra
Provide a generic p->on_rq because the p->se.on_rq semantics are unfavourable for lockless wakeups but needed for sched_fair. In particular, p->on_rq is only cleared when we actually dequeue the task in schedule() and not on any random dequeue as done by things like __migrate_task() and __sched_setscheduler(). This also allows us to remove p->se usage from !sched_fair code. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152728.949545047@chello.nl
2011-04-14mutex: Use p->on_cpu for the adaptive spinPeter Zijlstra
Since we now have p->on_cpu unconditionally available, use it to re-implement mutex_spin_on_owner. Requested-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152728.826338173@chello.nl
2011-04-14sched: Always provide p->on_cpuPeter Zijlstra
Always provide p->on_cpu so that we can determine if its on a cpu without having to lock the rq. Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110405152728.785452014@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-14sched: Provide scheduler_ipi() callback in response to smp_send_reschedule()Peter Zijlstra
For future rework of try_to_wake_up() we'd like to push part of that function onto the CPU the task is actually going to run on. In order to do so we need a generic callback from the existing scheduler IPI. This patch introduces such a generic callback: scheduler_ipi() and implements it as a NOP. BenH notes: PowerPC might use this IPI on offline CPUs under rare conditions! Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: Chris Metcalf <cmetcalf@tilera.com> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110405152728.744338123@chello.nl
2011-04-11sched: Dynamic sched_domain::levelPeter Zijlstra
Remove the SD_LV_ enum and use dynamic level assignments. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.969433965@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11sched: Remove some dead codePeter Zijlstra
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.553814623@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11sched: Reduce some allocation pressurePeter Zijlstra
Since we now allocate SD_LV_MAX * nr_cpu_ids sched_domain/sched_group structures when rebuilding the scheduler toplogy it might make sense to shrink that depending on the CONFIG_ options. This is only needed until we get rid of SD_LV_* alltogether and provide a full dynamic topology interface. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.406226449@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-11sched: Dynamically allocate sched_domain/sched_group data-structuresPeter Zijlstra
Instead of relying on static allocations for the sched_domain and sched_group trees, dynamically allocate and RCU free them. Allocating this dynamically also allows for some build_sched_groups() simplification since we can now (like with other simplifications) rely on the sched_domain tree instead of hard-coded knowledge. One tricky to note is that detach_destroy_domains() needs to hold rcu_read_lock() over the entire tear-down, per-cpu is not sufficient since that can lead to partial sched_group existance (could possibly be solved by doing the tear-down backwards but this is much more robust). A concequence of the above is that we can no longer print the sched_domain debug stuff from cpu_attach_domain() since that might now run with preemption disabled (due to classic RCU etc.) and sched_domain_debug() does some GFP_KERNEL allocations. Another thing to note is that we now fully rely on normal RCU and not RCU-sched, this is because with the new and exiting RCU flavours we grew over the years BH doesn't necessarily hold off RCU-sched grace periods (-rt is known to break this). This would in fact already cause us grief since we do sched_domain/sched_group iterations from softirq context. This patch is somewhat larger than I would like it to be, but I didn't find any means of shrinking/splitting this. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20110407122942.245307941@chello.nl Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-07Merge branch 'ptrace' of ↵Oleg Nesterov
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into ptrace
2011-04-04signal: Turn SIGNAL_STOP_DEQUEUED into GROUP_STOP_DEQUEUEDOleg Nesterov
This patch moves SIGNAL_STOP_DEQUEUED from signal_struct->flags to task_struct->group_stop, and thus makes it per-thread. Like SIGNAL_STOP_DEQUEUED, GROUP_STOP_DEQUEUED can be false-positive after return from get_signal_to_deliver(), this is fine. The only purpose of this bit is: we can drop ->siglock after __dequeue_signal() returns the sig_kernel_stop() signal and before we call do_signal_stop(), in this case we must not miss SIGCONT if it comes in between. But, unlike SIGNAL_STOP_DEQUEUED, GROUP_STOP_DEQUEUED can not be false-positive in do_signal_stop() if multiple threads dequeue the sig_kernel_stop() signal at the same time. Consider two threads T1 and T2, SIGTTIN has a hanlder. - T1 dequeues SIGTSTP and sets SIGNAL_STOP_DEQUEUED, then it drops ->siglock - SIGCONT comes and clears SIGNAL_STOP_DEQUEUED, SIGTSTP should be cancelled. - T2 dequeues SIGTTIN and sets SIGNAL_STOP_DEQUEUED again. Since we have a handler we should not stop, T2 returns to usermode to run the handler. - T1 continues, calls do_signal_stop() and wrongly starts the group stop because SIGNAL_STOP_DEQUEUED was restored in between. With or without this change: - we need to do something with ptrace_signal() which can return SIGSTOP, but this needs another discussion - SIGSTOP can be lost if it races with the mt exec, will be fixed later. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2011-03-31Fix common misspellingsLucas De Marchi
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-25Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched, doc: Update sched-design-CFS.txt sched: Remove unused 'rq' variable and cpu_rq() call from alloc_fair_sched_group() sched.h: Fix a typo ("its") sched: Fix yield_to kernel-doc
2011-03-24Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}
2011-03-23memcg: convert uncharge batching from bytes to page granularityJohannes Weiner
We never uncharge subpage quantities. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23sched.h: Fix a typo ("its")Jonathan Neuschäfer
The sentence uses the possessive pronoun, which is spelled without an apostrophe. Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Cc: Jiri Kosina <trivial@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1300735487-2406-1-git-send-email-j.neuschaefer@gmx.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-23ptrace: Clean transitions between TASK_STOPPED and TRACEDTejun Heo
Currently, if the task is STOPPED on ptrace attach, it's left alone and the state is silently changed to TRACED on the next ptrace call. The behavior breaks the assumption that arch_ptrace_stop() is called before any task is poked by ptrace and is ugly in that a task manipulates the state of another task directly. With GROUP_STOP_PENDING, the transitions between TASK_STOPPED and TRACED can be made clean. The tracer can use the flag to tell the tracee to retry stop on attach and detach. On retry, the tracee will enter the desired state in the correct way. The lower 16bits of task->group_stop is used to remember the signal number which caused the last group stop. This is used while retrying for ptrace attach as the original group_exit_code could have been consumed with wait(2) by then. As the real parent may wait(2) and consume the group_exit_code anytime, the group_exit_code needs to be saved separately so that it can be used when switching from regular sleep to ptrace_stop(). This is recorded in the lower 16bits of task->group_stop. If a task is already stopped and there's no intervening SIGCONT, a ptrace request immediately following a successful PTRACE_ATTACH should always succeed even if the tracer doesn't wait(2) for attach completion; however, with this change, the tracee might still be TASK_RUNNING trying to enter TASK_TRACED which would cause the following request to fail with -ESRCH. This intermediate state is hidden from the ptracer by setting GROUP_STOP_TRAPPING on attach and making ptrace_check_attach() wait for it to clear on its signal->wait_chldexit. Completing the transition or getting killed clears TRAPPING and wakes up the tracer. Note that the STOPPED -> RUNNING -> TRACED transition is still visible to other threads which are in the same group as the ptracer and the reverse transition is visible to all. Please read the comments for details. Oleg: * Spotted a race condition where a task may retry group stop without proper bookkeeping. Fixed by redoing bookkeeping on retry. * Spotted that the transition is visible to userland in several different ways. Most are fixed with GROUP_STOP_TRAPPING. Unhandled corner case is documented. * Pointed out not setting GROUP_STOP_SIGMASK on an already stopped task would result in more consistent behavior. * Pointed out that calling ptrace_stop() from do_signal_stop() in TASK_STOPPED can race with group stop start logic and then confuse the TRAPPING wait in ptrace_check_attach(). ptrace_stop() is now called with TASK_RUNNING. * Suggested using signal->wait_chldexit instead of bit wait. * Spotted a race condition between TRACED transition and clearing of TRAPPING. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
2011-03-23signal: Use GROUP_STOP_PENDING to stop once for a single group stopTejun Heo
Currently task->signal->group_stop_count is used to decide whether to stop for group stop. However, if there is a task in the group which is taking a long time to stop, other tasks which are continued by ptrace would repeatedly stop for the same group stop until the group stop is complete. Conversely, if a ptraced task is in TASK_TRACED state, the debugger won't get notified of group stops which is inconsistent compared to the ptraced task in any other state. This patch introduces GROUP_STOP_PENDING which tracks whether a task is yet to stop for the group stop in progress. The flag is set when a group stop starts and cleared when the task stops the first time for the group stop, and consulted whenever whether the task should participate in a group stop needs to be determined. Note that now tasks in TASK_TRACED also participate in group stop. This results in the following behavior changes. * For a single group stop, a ptracer would see at most one stop reported. * A ptracee in TASK_TRACED now also participates in group stop and the tracer would get the notification. However, as a ptraced task could be in TASK_STOPPED state or any ptrace trap could consume group stop, the notification may still be missing. These will be addressed with further patches. * A ptracee may start a group stop while one is still in progress if the tracer let it continue with stop signal delivery. Group stop code handles this correctly. Oleg: * Spotted that a task might skip signal check even when its GROUP_STOP_PENDING is set. Fixed by updating recalc_sigpending_tsk() to check GROUP_STOP_PENDING instead of group_stop_count. * Pointed out that task->group_stop should be cleared whenever task->signal->group_stop_count is cleared. Fixed accordingly. * Pointed out the behavior inconsistency between TASK_TRACED and RUNNING and the last behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com>
2011-03-23signal: Fix premature completion of group stop when interfered by ptraceTejun Heo
task->signal->group_stop_count is used to track the progress of group stop. It's initialized to the number of tasks which need to stop for group stop to finish and each stopping or trapping task decrements. However, each task doesn't keep track of whether it decremented the counter or not and if woken up before the group stop is complete and stops again, it can decrement the counter multiple times. Please consider the following example code. static void *worker(void *arg) { while (1) ; return NULL; } int main(void) { pthread_t thread; pid_t pid; int i; pid = fork(); if (!pid) { for (i = 0; i < 5; i++) pthread_create(&thread, NULL, worker, NULL); while (1) ; return 0; } ptrace(PTRACE_ATTACH, pid, NULL, NULL); while (1) { waitid(P_PID, pid, NULL, WSTOPPED); ptrace(PTRACE_SINGLESTEP, pid, NULL, (void *)(long)SIGSTOP); } return 0; } The child creates five threads and the parent continuously traps the first thread and whenever the child gets a signal, SIGSTOP is delivered. If an external process sends SIGSTOP to the child, all other threads in the process should reliably stop. However, due to the above bug, the first thread will often end up consuming group_stop_count multiple times and SIGSTOP often ends up stopping none or part of the other four threads. This patch adds a new field task->group_stop which is protected by siglock and uses GROUP_STOP_CONSUME flag to track which task is still to consume group_stop_count to fix this bug. task_clear_group_stop_pending() and task_participate_group_stop() are added to help manipulating group stop states. As ptrace_stop() now also uses task_participate_group_stop(), it will set SIGNAL_STOP_STOPPED if it completes a group stop. There still are many issues regarding the interaction between group stop and ptrace. Patches to address them will follow. - Oleg spotted duplicate GROUP_STOP_CONSUME. Dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Roland McGrath <roland@redhat.com>
2011-03-22kthread: NUMA aware kthread_create_on_node()Eric Dumazet
All kthreads being created from a single helper task, they all use memory from a single node for their kernel stack and task struct. This patch suite creates kthread_create_on_node(), adding a 'cpu' parameter to parameters already used by kthread_create(). This parameter serves in allocating memory for the new kthread on its memory node if possible. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: David Howells <dhowells@redhat.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-15Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (62 commits) posix-clocks: Check write permissions in posix syscalls hrtimer: Remove empty hrtimer_init_hres_timer() hrtimer: Update hrtimer->state documentation hrtimer: Update base[CLOCK_BOOTTIME].offset correctly timers: Export CLOCK_BOOTTIME via the posix timers interface timers: Add CLOCK_BOOTTIME hrtimer base time: Extend get_xtime_and_monotonic_offset() to also return sleep time: Introduce get_monotonic_boottime and ktime_get_boottime hrtimers: extend hrtimer base code to handle more then 2 clockids ntp: Remove redundant and incorrect parameter check mn10300: Switch do_timer() to xtimer_update() posix clocks: Introduce dynamic clocks posix-timers: Cleanup namespace posix-timers: Add support for fd based clocks x86: Add clock_adjtime for x86 posix-timers: Introduce a syscall for clock tuning. time: Splitout compat timex accessors ntp: Add ADJ_SETOFFSET mode bit time: Introduce timekeeping_inject_offset posix-timer: Update comment ... Fix up new system-call-related conflicts in arch/x86/ia32/ia32entry.S arch/x86/include/asm/unistd_32.h arch/x86/include/asm/unistd_64.h arch/x86/kernel/syscall_table_32.S (name_to_handle_at()/open_by_handle_at() vs clock_adjtime()), and some due to movement of get_jiffies_64() in: kernel/time.c
2011-03-15Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (26 commits) sched: Resched proper CPU on yield_to() sched: Allow users with sufficient RLIMIT_NICE to change from SCHED_IDLE policy sched: Allow SCHED_BATCH to preempt SCHED_IDLE tasks sched: Clean up the IRQ_TIME_ACCOUNTING code sched: Add #ifdef around irq time accounting functions sched, autogroup: Stop claiming ownership of the root task group sched, autogroup: Stop going ahead if autogroup is disabled sched, autogroup, sysctl: Use proc_dointvec_minmax() instead sched: Fix the group_imb logic sched: Clean up some f_b_g() comments sched: Clean up remnants of sd_idle sched: Wholesale removal of sd_idle logic sched: Add yield_to(task, preempt) functionality sched: Use a buddy to implement yield_task_fair() sched: Limit the scope of clear_buddies sched: Check the right ->nr_running in yield_task_fair() sched: Avoid expensive initial update_cfs_load(), on UP too sched: Fix switch_from_fair() sched: Simplify the idle scheduling class softirqs: Account ksoftirqd time as cpustat softirq ...
2011-03-10block: initial patch for on-stack per-task pluggingJens Axboe
This patch adds support for creating a queuing context outside of the queue itself. This enables us to batch up pieces of IO before grabbing the block device queue lock and submitting them to the IO scheduler. The context is created on the stack of the process and assigned in the task structure, so that we can auto-unplug it if we hit a schedule event. The current queue plugging happens implicitly if IO is submitted to an empty device, yet callers have to remember to unplug that IO when they are going to wait for it. This is an ugly API and has caused bugs in the past. Additionally, it requires hacks in the vm (->sync_page() callback) to handle that logic. By switching to an explicit plugging scheme we make the API a lot nicer and can get rid of the ->sync_page() hack in the vm. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-04Merge branch 'perf/urgent' into perf/coreIngo Molnar
Merge reason: Pick up updates before queueing up dependent patches. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-23Merge commit 'v2.6.38-rc6' into sched/coreIngo Molnar
Merge reason: Pick up the latest fixes before queueing up new changes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-16workqueue, freezer: unify spelling of 'freeze' + 'able' to 'freezable'Tejun Heo
There are two spellings in use for 'freeze' + 'able' - 'freezable' and 'freezeable'. The former is the more prominent one. The latter is mostly used by workqueue and in a few other odd places. Unify the spelling to 'freezable'. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Alan Stern <stern@rowland.harvard.edu> Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Dmitry Torokhov <dtor@mail.ru> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Alex Dubov <oakad@yahoo.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Steven Whitehouse <swhiteho@redhat.com>
2011-02-03sched: Add yield_to(task, preempt) functionalityMike Galbraith
Currently only implemented for fair class tasks. Add a yield_to_task method() to the fair scheduling class. allowing the caller of yield_to() to accelerate another thread in it's thread group, task group. Implemented via a scheduler hint, using cfs_rq->next to encourage the target being selected. We can rely on pick_next_entity to keep things fair, so noone can accelerate a thread that has already used its fair share of CPU time. This also means callers should only call yield_to when they really mean it. Calling it too often can result in the scheduler just ignoring the hint. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110201095051.4ddb7738@annuminas.surriel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-03sched: Use a buddy to implement yield_task_fair()Rik van Riel
Use the buddy mechanism to implement yield_task_fair. This allows us to skip onto the next highest priority se at every level in the CFS tree, unless doing so would introduce gross unfairness in CPU time distribution. We order the buddy selection in pick_next_entity to check yield first, then last, then next. We need next to be able to override yield, because it is possible for the "next" and "yield" task to be different processen in the same sub-tree of the CFS tree. When they are, we need to go into that sub-tree regardless of the "yield" hint, and pick the correct entity once we get to the right level. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110201095103.3a79e92a@annuminas.surriel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-03perf: Cure task_oncpu_function_call() racesPeter Zijlstra
Oleg reported that on architectures with __ARCH_WANT_INTERRUPTS_ON_CTXSW the IPI from task_oncpu_function_call() can land before perf_event_task_sched_in() and cause interesting situations for eg. perf_install_in_context(). This patch reworks the task_oncpu_function_call() interface to give a more usable primitive as well as rework all its users to hopefully be more obvious as well as remove the races. While looking at the code I also found a number of races against perf_event_task_sched_out() which can flip contexts between tasks so plug those too. Reported-and-reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-31time: Make do_timer() and xtime_lock local to kernel/time/Torben Hohn
All callers of do_timer() are converted to xtime_update(). The only users of xtime_lock are in kernel/time/. Make both local to kernel/time/ and remove them from the global header files. [ tglx: Reuse tick-internal.h instead of creating another local header file. Massaged changelog ] Signed-off-by: Torben Hohn <torbenh@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: johnstul@us.ibm.com Cc: yong.zhang0@gmail.com Cc: hch@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-01-31time: Provide xtime_update()Torben Hohn
xtime_update() takes xtime_lock write locked and calls do_timer(). Provided to replace the do_timer() calls in the architecture code. Signed-off-by: Torben Hohn <torbenh@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: johnstul@us.ibm.com Cc: yong.zhang0@gmail.com Cc: hch@infradead.org LKML-Reference: <20110127145910.23248.21379.stgit@localhost> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-01-26sched: Fix switch_from_fair()Peter Zijlstra
When a task is taken out of the fair class we must ensure the vruntime is properly normalized because when we put it back in it will assume to be normalized. The case that goes wrong is when changing away from the fair class while sleeping. Sleeping tasks have non-normalized vruntime in order to make sleeper-fairness work. So treat the switch away from fair as a wakeup and preserve the relative vruntime. Also update sysrq-n to call the ->switch_{to,from} methods. Reported-by: Onkalo Samu <samu.p.onkalo@nokia.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-26softirqs: Free up pf flag PF_KSOFTIRQDVenkatesh Pallipadi
Cleanup patch, freeing up PF_KSOFTIRQD and use per_cpu ksoftirqd pointer instead, as suggested by Eric Dumazet. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-2-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>