summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2011-01-31watchdog: Don't change watchdog state on read of sysctlMarcin Slusarz
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> [ add {}'s to fix a warning ] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1296230433-6261-3-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-31watchdog: Fix sysctl consistencyMarcin Slusarz
If it was not possible to enable watchdog for any cpu, switch watchdog_enabled back to 0, because it's visible via kernel.watchdog sysctl. Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1296230433-6261-2-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-31watchdog: Fix broken nowatchdog logicMarcin Slusarz
Passing nowatchdog to kernel disables 2 things: creation of watchdog threads AND initialization of percpu watchdog_hrtimer. As hrtimers are initialized only at boot it's not possible to enable watchdog later - for me all watchdog threads started to eat 100% of CPU time, but they could just crash. Additionally, even if these threads would start properly, watchdog_disable_all_cpus was guarded by no_watchdog check, so you couldn't disable watchdog. To fix this, remove no_watchdog variable and use already existing watchdog_enabled variable. Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> [ removed another no_watchdog instance ] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: <stable@kernel.org> LKML-Reference: <1296230433-6261-1-git-send-email-dzickus@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-31genirq: Fix affinity notifier falloutThomas Gleixner
The new code of commit cd7eab44e(genirq: Add IRQ affinity notifiers) references irq_desc.affinity which fails to compile with CONFIG_GENERIC_HARDIRQS_NO_DEPRECATED=y. Use irq_desc.irq_data.affinity instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ben Hutchings <bhutchings@solarflare.com>
2011-01-31Fix prlimit64 for suid/sgid processesKacper Kornet
Since check_prlimit_permission always fails in the case of SUID/GUID processes, such processes are not able to read or set their own limits. This commit changes this by assuming that process can always read/change its own limits. Signed-off-by: Kacper Kornet <kornet@camk.edu.pl> Acked-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-27rtmutex: Simplify PI algorithm and make highest prio task get lockLai Jiangshan
In current rtmutex, the pending owner may be boosted by the tasks in the rtmutex's waitlist when the pending owner is deboosted or a task in the waitlist is boosted. This boosting is unrelated, because the pending owner does not really take the rtmutex. It is not reasonable. Example. time1: A(high prio) onwers the rtmutex. B(mid prio) and C (low prio) in the waitlist. time2 A release the lock, B becomes the pending owner A(or other high prio task) continues to run. B's prio is lower than A, so B is just queued at the runqueue. time3 A or other high prio task sleeps, but we have passed some time The B and C's prio are changed in the period (time2 ~ time3) due to boosting or deboosting. Now C has the priority higher than B. ***Is it reasonable that C has to boost B and help B to get the rtmutex? NO!! I think, it is unrelated/unneed boosting before B really owns the rtmutex. We should give C a chance to beat B and win the rtmutex. This is the motivation of this patch. This patch *ensures* only the top waiter or higher priority task can take the lock. How? 1) we don't dequeue the top waiter when unlock, if the top waiter is changed, the old top waiter will fail and go to sleep again. 2) when requiring lock, it will get the lock when the lock is not taken and: there is no waiter OR higher priority than waiters OR it is top waiter. 3) In any time, the top waiter is changed, the top waiter will be woken up. The algorithm is much simpler than before, no pending owner, no boosting for pending owner. Other advantage of this patch: 1) The states of a rtmutex are reduced a half, easier to read the code. 2) the codes become shorter. 3) top waiter is not dequeued until it really take the lock: they will retain FIFO when it is stolen. Not advantage nor disadvantage 1) Even we may wakeup multiple waiters(any time when top waiter changed), we hardly cause "thundering herd", the number of wokenup task is likely 1 or very little. 2) two APIs are changed. rt_mutex_owner() will not return pending owner, it will return NULL when the top waiter is going to take the lock. rt_mutex_next_owner() always return the top waiter. will not return NULL if we have waiters because the top waiter is not dequeued. I have fixed the code that use these APIs. need updated after this patch is accepted 1) Document/* 2) the testcase scripts/rt-tester/t4-l2-pi-deboost.tst Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> LKML-Reference: <4D3012D5.4060709@cn.fujitsu.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-01-28Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Use rq->clock_task instead of rq->clock for correctly maintaining load averages sched: Fix/remove redundant cfs_rq checks sched: Fix sign under-flows in wake_affine
2011-01-27perf: Fix alloc_callchain_buffers()Eric Dumazet
Commit 927c7a9e92c4 ("perf: Fix race in callchains") introduced a mismatch in the sizing of struct callchain_cpus_entries. nr_cpu_ids must be used instead of num_possible_cpus(), or we might get out of bound memory accesses on some machines. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Miller <davem@davemloft.net> Cc: Stephane Eranian <eranian@google.com> CC: stable@kernel.org LKML-Reference: <1295980851.3588.351.camel@edumazet-laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-27sched: Avoid expensive initial update_cfs_load(), on UP tooPeter Zijlstra
Fix the build on UP. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> LKML-Reference: <20110122044852.102126037@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-27cred: Replace deprecated spinlock initializationThomas Gleixner
SPIN_LOCK_UNLOCK is deprecated. Use the lockdep capable variant instead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-01-27Merge commit 'v2.6.38-rc2' into core/lockingThomas Gleixner
Reason: Update to mainline before adding the locking cleanup Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-01-26rtmutex-tester: Remove BKL testsArnd Bergmann
The BKL is going away, no need to test it any more. I left the definitions of the test case numbers in, so that the other tests do not get renumbered. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andrew Morton <akpm@linux-foundation.org> LKML-Reference: <1295993854-4971-19-git-send-email-arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-01-26sched: Fix switch_from_fair()Peter Zijlstra
When a task is taken out of the fair class we must ensure the vruntime is properly normalized because when we put it back in it will assume to be normalized. The case that goes wrong is when changing away from the fair class while sleeping. Sleeping tasks have non-normalized vruntime in order to make sleeper-fairness work. So treat the switch away from fair as a wakeup and preserve the relative vruntime. Also update sysrq-n to call the ->switch_{to,from} methods. Reported-by: Onkalo Samu <samu.p.onkalo@nokia.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-26sched: Simplify the idle scheduling classPeter Zijlstra
Since commit 48c5ccae88dcd (sched: Simplify cpu-hot-unplug task migration) this should no longer happen, so remove the code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-26softirqs: Account ksoftirqd time as cpustat softirqVenkatesh Pallipadi
softirq time in ksoftirqd context is not accounted in ns granularity per cpu softirq stats, as we want that to be a part of ksoftirqd exec_runtime. Accounting them as softirq on /proc/stat separately. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-6-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-26sched: Export ns irqtimes through /proc/statVenkatesh Pallipadi
CONFIG_IRQ_TIME_ACCOUNTING adds ns granularity irq time on each CPU. This info is already used in scheduler to do proper task chargeback (earlier patches). This patch retro-fits this ns granularity hardirq and softirq information to /proc/stat irq and softirq fields. The update is still done on timer tick, where we look at accumulated ns hardirq/softirq time and account the tick to user/system/irq/hardirq/guest accordingly. No new interface added. Earlier versions looked at adding this as new fields in some /proc files. This one seems to be the best in terms of impact to existing apps, even though it has somewhat more kernel code than earlier versions. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-5-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-26sched: Refactor account_system_time separating id-updateVenkatesh Pallipadi
Refactor account_system_time, to separate out the logic of identifying the update needed and code that does actual update. This is used by following patch for IRQ_TIME_ACCOUNTING, which has different identification logic and same update logic. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-4-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-26time: Add nsecs_to_cputime64 interface for asm-genericVenkatesh Pallipadi
Add nsecs_to_cputime64 interface. This is used in following patches that updates cpu irq stat based on ns granularity info in IRQ_TIME_ACCOUNTING. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-3-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-26softirqs: Free up pf flag PF_KSOFTIRQDVenkatesh Pallipadi
Cleanup patch, freeing up PF_KSOFTIRQD and use per_cpu ksoftirqd pointer instead, as suggested by Eric Dumazet. Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292980144-28796-2-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-26sched: Avoid expensive initial update_cfs_load()Paul Turner
Since cfs->{load_stamp,load_last} are zero-initalized the initial load update will consider the delta to be 'since the beginning of time'. This results in a lot of pointless divisions to bring this large period to be within the sysctl_sched_shares_window. Fix this by initializing load_stamp to be 1 at cfs_rq initialization, this allows for an initial load_stamp > load_last which then lets standard idle truncation proceed. We avoid spinning (and slightly improve consistency) by fixing delta to be [period - 1] in this path resulting in a slightly more predictable shares ramp. (Previously the amount of idle time preserved by the overflow would range between [period/2,period-1].) Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110122044852.102126037@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-26sched: Simplify update_cfs_shares parametersPaul Turner
Re-visiting this: Since update_cfs_shares will now only ever re-weight an entity that is a relative parent of the current entity in enqueue_entity; we can safely issue the account_entity_enqueue relative to that cfs_rq and avoid the requirement for special handling of the enqueue case in update_cfs_shares. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110122044851.915214637@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-26sched: Use rq->clock_task instead of rq->clock for correctly maintaining ↵Paul Turner
load averages The delta in clock_task is a more fair attribution of how much time a tg has been contributing load to the current cpu. While not really important it also means we're more in sync (by magnitude) with respect to periodic updates (since __update_curr deltas are clock_task based). Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110122044852.007092349@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-26sched: Fix/remove redundant cfs_rq checksPaul Turner
Since updates are against an entity's queuing cfs_rq it's not possible to enter update_cfs_{shares,load} with a NULL cfs_rq. (Indeed, update_cfs_load would crash prior to the check if we did anyway since we load is examined during the initializers). Also, in the update_cfs_load case there's no point in maintaining averages for rq->cfs_rq since we don't perform shares distribution at that level -- NULL check is replaced accordingly. Thanks to Dan Carpenter for pointing out the deference before NULL check. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110122044851.825284940@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-26sched: Fix sign under-flows in wake_affinePaul Turner
While care is taken around the zero-point in effective_load to not exceed the instantaneous rq->weight, it's still possible (e.g. using wake_idx != 0) for (load + effective_load) to underflow. In this case the comparing the unsigned values can result in incorrect balanced decisions. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20110122044851.734245014@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: Input: wacom - pass touch resolution to clients through input_absinfo Input: wacom - add 2 Bamboo Pen and touch models Input: sysrq - ensure sysrq_enabled and __sysrq_enabled are consistent Input: sparse-keymap - fix KEY_VSW handling in sparse_keymap_setup Input: tegra-kbc - add tegra keyboard driver Input: gpio_keys - switch to using request_any_context_irq Input: serio - allow registered drivers to get status flag Input: ct82710c - return proper error code for ct82c710_open Input: bu21013_ts - added regulator support Input: bu21013_ts - remove duplicate resolution parameters Input: tnetv107x-ts - don't treat NULL clk as an error Input: tnetv107x-keypad - don't treat NULL clk as an error Fix up trivial conflicts in drivers/input/keyboard/Makefile due to additions of tc3589x/Tegra drivers
2011-01-26console: rename acquire/release_console_sem() to console_lock/unlock()Torben Hohn
The -rt patches change the console_semaphore to console_mutex. As a result, a quite large chunk of the patches changes all acquire/release_console_sem() to acquire/release_console_mutex() This commit makes things use more neutral function names which dont make implications about the underlying lock. The only real change is the return value of console_trylock which is inverted from try_acquire_console_sem() This patch also paves the way to switching console_sem from a semaphore to a mutex. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: make console_trylock return 1 on success, per Geert] Signed-off-by: Torben Hohn <torbenh@gmx.de> Cc: Thomas Gleixner <tglx@tglx.de> Cc: Greg KH <gregkh@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-24Merge branch 'irq/numa' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
2011-01-24Merge branch 'master' of ↵David S. Miller
master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: net/sched/sch_hfsc.c net/sched/sch_htb.c net/sched/sch_tbf.c
2011-01-25Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf tools: Fix time function double declaration with glibc perf tools: Fix build by checking if extra warnings are supported perf tools: Fix build when using gcc 3.4.6 perf tools: Add missing header, fixes build perf tools: Fix 64 bit integer format strings perf test: Fix build on older glibcs perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/ perf test: Use cpu_map->[cpu] when setting affinity perf symbols: Fix annotation of thumb code perf: Annotate cpuctx->ctx.mutex to avoid a lockdep splat powerpc, perf: Fix frequency calculation for overflowing counters (FSL version) perf: Fix perf_event_init_task()/perf_event_free_task() interaction perf: Fix find_get_context() vs perf_event_exit_task() race
2011-01-25Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: RTC: Remove Kconfig symbol for UIE emulation RTC: Properly handle rtc_read_alarm error propagation and fix bug RTC: Propagate error handling via rtc_timer_enqueue properly acpi_pm: Clear pmtmr_ioport if acpi_pm initialization fails rtc: Cleanup removed UIE emulation declaration hrtimers: Notify hrtimer users of switches to NOHZ mode
2011-01-25Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix poor interactivity on UP systems due to group scheduler nice tune bug
2011-01-24Input: sysrq - ensure sysrq_enabled and __sysrq_enabled are consistentAndy Whitcroft
Currently sysrq_enabled and __sysrq_enabled are initialised separately and inconsistently, leading to sysrq being actually enabled by reported as not enabled in sysfs. The first change to the sysfs configurable synchronises these two: static int __read_mostly sysrq_enabled = 1; static int __sysrq_enabled; Add a common define to carry the default for these preventing them becoming out of sync again. Default this to 1 to mirror previous behaviour. Signed-off-by: Andy Whitcroft <apw@canonical.com> Cc: stable@kernel.org Signed-off-by: Dmitry Torokhov <dtor@mail.ru>
2011-01-24sched: Fix poor interactivity on UP systems due to group scheduler nice tune bugYong Zhang
Michael Witten and Christian Kujau reported that the autogroup scheduling feature hurts interactivity on their UP systems. It turns out that this is an older bug in the group scheduling code, and the wider appeal provided by the autogroup feature exposed it more prominently. When on UP with FAIR_GROUP_SCHED enabled, tune shares only affect tg->shares, but is not reflected in tg->se->load. The reason is that update_cfs_shares() does nothing on UP. So introduce update_cfs_shares() for UP && FAIR_GROUP_SCHED. This issue was found when enable autogroup scheduling was enabled, but it is an older bug that also exists on cgroup.cpu on UP. Reported-and-Tested-by: Michael Witten <mfwitten@gmail.com> Reported-and-Tested-by: Christian Kujau <christian@nerdbynature.de> Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: Mike Galbraith <efault@gmx.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20110124073352.GA24186@windriver.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-24module: show version information for built-in modules in sysfsDmitry Torokhov
Currently only drivers that are built as modules have their versions shown in /sys/module/<module_name>/version, but this information might also be useful for built-in drivers as well. This especially important for drivers that do not define any parameters - such drivers, if built-in, are completely invisible from userspace. This patch changes MODULE_VERSION() macro so that in case when we are compiling built-in module, version information is stored in a separate section. Kernel then uses this data to create 'version' sysfs attribute in the same fashion it creates attributes for module parameters. Signed-off-by: Dmitry Torokhov <dtor@vmware.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2011-01-22genirq: Add IRQ affinity notifiersBen Hutchings
When initiating I/O on a multiqueue and multi-IRQ device, we may want to select a queue for which the response will be handled on the same or a nearby CPU. This requires a reverse-map of IRQ affinity. Add a notification mechanism to support this. This is based closely on work by Thomas Gleixner <tglx@linutronix.de>. Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Cc: linux-net-drivers@solarflare.com Cc: Tom Herbert <therbert@google.com> Cc: David Miller <davem@davemloft.net> LKML-Reference: <1295470904.11126.84.camel@bwh-desktop> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-01-21Merge branch 'fixes-2.6.38' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'fixes-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: note the nested NOT_RUNNING test in worker_clr_flags() isn't a noop workqueue: relax lockdep annotation on flush_work()
2011-01-21perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/Oleg Nesterov
In theory, almost every user of task->child->perf_event_ctxp[] is wrong. find_get_context() can install the new context at any moment, we need read_barrier_depends(). dbe08d82ce3967ccdf459f7951d02589cf967300 "perf: Fix find_get_context() vs perf_event_exit_task() race" added rcu_dereference() into perf_event_exit_task_context() to make the precedent, but this makes __rcu_dereference_check() unhappy. Use rcu_dereference_raw() to shut up the warning. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: acme@redhat.com Cc: paulus@samba.org Cc: stern@rowland.harvard.edu Cc: a.p.zijlstra@chello.nl Cc: fweisbec@gmail.com Cc: roland@redhat.com Cc: prasad@linux.vnet.ibm.com Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> LKML-Reference: <20110121174547.GA8796@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-21perf: Annotate cpuctx->ctx.mutex to avoid a lockdep splatPeter Zijlstra
Lockdep spotted: loop_1b_instruc/1899 is trying to acquire lock: (event_mutex){+.+.+.}, at: [<ffffffff810e1908>] perf_trace_init+0x3b/0x2f7 but task is already holding lock: (&ctx->mutex){+.+.+.}, at: [<ffffffff810eb45b>] perf_event_init_context+0xc0/0x218 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&ctx->mutex){+.+.+.}: -> #2 (cpu_hotplug.lock){+.+.+.}: -> #1 (module_mutex){+.+...}: -> #0 (event_mutex){+.+.+.}: But because the deadlock would be cpuhotplug (cpu-event) vs fork (task-event) it cannot, in fact, happen. We can annotate this by giving the perf_event_context used for the cpuctx a different lock class from those used by tasks. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-21genirq: Remove __do_IRQThomas Gleixner
All architectures are finally converted. Remove the cruft. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Henderson <rth@twiddle.net> Cc: Mike Frysinger <vapier@gentoo.org> Cc: David Howells <dhowells@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Greg Ungerer <gerg@uclinux.org> Cc: Michal Simek <monstr@monstr.eu> Acked-by: David Howells <dhowells@redhat.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chen Liqin <liqin.chen@sunplusct.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jeff Dike <jdike@addtoit.com>
2011-01-20Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: smp: Allow on_each_cpu() to be called while early_boot_irqs_disabled status to init/main.c lockdep: Move early boot local IRQ enable/disable status to init/main.c
2011-01-20Merge branch 'akpm'Linus Torvalds
* akpm: kernel/smp.c: consolidate writes in smp_call_function_interrupt() kernel/smp.c: fix smp_call_function_many() SMP race memcg: correctly order reading PCG_USED and pc->mem_cgroup backlight: fix 88pm860x_bl macro collision drivers/leds/ledtrig-gpio.c: make output match input, tighten input checking MAINTAINERS: update Atmel AT91 entry mm: fix truncate_setsize() comment memcg: fix rmdir, force_empty with THP memcg: fix LRU accounting with THP memcg: fix USED bit handling at uncharge in THP memcg: modify accounting function for supporting THP better fs/direct-io.c: don't try to allocate more than BIO_MAX_PAGES in a bio mm: compaction: prevent division-by-zero during user-requested compaction mm/vmscan.c: remove duplicate include of compaction.h memblock: fix memblock_is_region_memory() thp: keep highpte mapped until it is no longer needed kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT
2011-01-20kernel/smp.c: consolidate writes in smp_call_function_interrupt()Milton Miller
We have to test the cpu mask in the interrupt handler before checking the refs, otherwise we can start to follow an entry before its deleted and find it partially initailzed for the next trip. Presently we also clear the cpumask bit before executing the called function, which implies getting write access to the line. After the function is called we then decrement refs, and if they go to zero we then unlock the structure. However, this implies getting write access to the call function data before and after another the function is called. If we can assert that no smp_call_function execution function is allowed to enable interrupts, then we can move both writes to after the function is called, hopfully allowing both writes with one cache line bounce. On a 256 thread system with a kernel compiled for 1024 threads, the time to execute testcase in the "smp_call_function_many race" changelog was reduced by about 30-40ms out of about 545 ms. I decided to keep this as WARN because its now a buggy function, even though the stack trace is of no value -- a simple printk would give us the information needed. Raw data: Without patch: ipi_test startup took 1219366ns complete 539819014ns total 541038380ns ipi_test startup took 1695754ns complete 543439872ns total 545135626ns ipi_test startup took 7513568ns complete 539606362ns total 547119930ns ipi_test startup took 13304064ns complete 533898562ns total 547202626ns ipi_test startup took 8668192ns complete 544264074ns total 552932266ns ipi_test startup took 4977626ns complete 548862684ns total 553840310ns ipi_test startup took 2144486ns complete 541292318ns total 543436804ns ipi_test startup took 21245824ns complete 530280180ns total 551526004ns With patch: ipi_test startup took 5961748ns complete 500859628ns total 506821376ns ipi_test startup took 8975996ns complete 495098924ns total 504074920ns ipi_test startup took 19797750ns complete 492204740ns total 512002490ns ipi_test startup took 14824796ns complete 487495878ns total 502320674ns ipi_test startup took 11514882ns complete 494439372ns total 505954254ns ipi_test startup took 8288084ns complete 502570774ns total 510858858ns ipi_test startup took 6789954ns complete 493388112ns total 500178066ns #include <linux/module.h> #include <linux/init.h> #include <linux/sched.h> /* sched clock */ #define ITERATIONS 100 static void do_nothing_ipi(void *dummy) { } static void do_ipis(struct work_struct *dummy) { int i; for (i = 0; i < ITERATIONS; i++) smp_call_function(do_nothing_ipi, NULL, 1); printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id()); } static struct work_struct work[NR_CPUS]; static int __init testcase_init(void) { int cpu; u64 start, started, done; start = local_clock(); for_each_online_cpu(cpu) { INIT_WORK(&work[cpu], do_ipis); schedule_work_on(cpu, &work[cpu]); } started = local_clock(); for_each_online_cpu(cpu) flush_work(&work[cpu]); done = local_clock(); pr_info("ipi_test startup took %lldns complete %lldns total %lldns\n", started-start, done-started, done-start); return 0; } static void __exit testcase_exit(void) { } module_init(testcase_init) module_exit(testcase_exit) MODULE_LICENSE("GPL"); MODULE_AUTHOR("Anton Blanchard"); Signed-off-by: Milton Miller <miltonm@bga.com> Cc: Anton Blanchard <anton@samba.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20kernel/smp.c: fix smp_call_function_many() SMP raceAnton Blanchard
I noticed a failure where we hit the following WARN_ON in generic_smp_call_function_interrupt: if (!cpumask_test_and_clear_cpu(cpu, data->cpumask)) continue; data->csd.func(data->csd.info); refs = atomic_dec_return(&data->refs); WARN_ON(refs < 0); <------------------------- We atomically tested and cleared our bit in the cpumask, and yet the number of cpus left (ie refs) was 0. How can this be? It turns out commit 54fdade1c3332391948ec43530c02c4794a38172 ("generic-ipi: make struct call_function_data lockless") is at fault. It removes locking from smp_call_function_many and in doing so creates a rather complicated race. The problem comes about because: - The smp_call_function_many interrupt handler walks call_function.queue without any locking. - We reuse a percpu data structure in smp_call_function_many. - We do not wait for any RCU grace period before starting the next smp_call_function_many. Imagine a scenario where CPU A does two smp_call_functions back to back, and CPU B does an smp_call_function in between. We concentrate on how CPU C handles the calls: CPU A CPU B CPU C CPU D smp_call_function smp_call_function_interrupt walks call_function.queue sees data from CPU A on list smp_call_function smp_call_function_interrupt walks call_function.queue sees (stale) CPU A on list smp_call_function int clears last ref on A list_del_rcu, unlock smp_call_function reuses percpu *data A data->cpumask sees and clears bit in cpumask might be using old or new fn! decrements refs below 0 set data->refs (too late!) The important thing to note is since the interrupt handler walks a potentially stale call_function.queue without any locking, then another cpu can view the percpu *data structure at any time, even when the owner is in the process of initialising it. The following test case hits the WARN_ON 100% of the time on my PowerPC box (having 128 threads does help :) #include <linux/module.h> #include <linux/init.h> #define ITERATIONS 100 static void do_nothing_ipi(void *dummy) { } static void do_ipis(struct work_struct *dummy) { int i; for (i = 0; i < ITERATIONS; i++) smp_call_function(do_nothing_ipi, NULL, 1); printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id()); } static struct work_struct work[NR_CPUS]; static int __init testcase_init(void) { int cpu; for_each_online_cpu(cpu) { INIT_WORK(&work[cpu], do_ipis); schedule_work_on(cpu, &work[cpu]); } return 0; } static void __exit testcase_exit(void) { } module_init(testcase_init) module_exit(testcase_exit) MODULE_LICENSE("GPL"); MODULE_AUTHOR("Anton Blanchard"); I tried to fix it by ordering the read and the write of ->cpumask and ->refs. In doing so I missed a critical case but Paul McKenney was able to spot my bug thankfully :) To ensure we arent viewing previous iterations the interrupt handler needs to read ->refs then ->cpumask then ->refs _again_. Thanks to Milton Miller and Paul McKenney for helping to debug this issue. [miltonm@bga.com: add WARN_ON and BUG_ON, remove extra read of refs before initial read of mask that doesn't help (also noted by Peter Zijlstra), adjust comments, hopefully clarify scenario ] [miltonm@bga.com: remove excess tests] Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Milton Miller <miltonm@bga.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: <stable@kernel.org> [2.6.32+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched, cgroup: Use exit hook to avoid use-after-free crash sched: Fix signed unsigned comparison in check_preempt_tick() sched: Replace rq->bkl_count with rq->rq_sched_info.bkl_count sched, autogroup: Fix CONFIG_RT_GROUP_SCHED sched_setscheduler() failure sched: Display autogroup names in /proc/sched_debug sched: Reinstate group names in /proc/sched_debug sched: Update effective_load() to use global share weights
2011-01-20smp: Allow on_each_cpu() to be called while early_boot_irqs_disabled status ↵Tejun Heo
to init/main.c percpu may end up calling vfree() during early boot which in turn may call on_each_cpu() for TLB flushes. The function of on_each_cpu() can be done safely while IRQ is disabled during early boot but it assumed that the function is always called with local IRQ enabled which ended up enabling local IRQ prematurely during boot and triggering a couple of warnings. This patch updates on_each_cpu() and smp_call_function_many() such on_each_cpu() can be used safely while early_boot_irqs_disabled is set. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20110120110713.GC6036@htj.dyndns.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Reported-by: Ingo Molnar <mingo@elte.hu>
2011-01-20lockdep: Move early boot local IRQ enable/disable status to init/main.cTejun Heo
During early boot, local IRQ is disabled until IRQ subsystem is properly initialized. During this time, no one should enable local IRQ and some operations which usually are not allowed with IRQ disabled, e.g. operations which might sleep or require communications with other processors, are allowed. lockdep tracked this with early_boot_irqs_off/on() callbacks. As other subsystems need this information too, move it to init/main.c and make it generally available. While at it, toggle the boolean to early_boot_irqs_disabled instead of enabled so that it can be initialized with %false and %true indicates the exceptional condition. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20110120110635.GB6036@htj.dyndns.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-19Merge branch 'master' of /repos/git/net-next-2.6Patrick McHardy
2011-01-19hrtimers: Notify hrtimer users of switches to NOHZ modeStephen Boyd
When NOHZ=y and high res timers are disabled (via cmdline or Kconfig) tick_nohz_switch_to_nohz() will notify the user about switching into NOHZ mode. Nothing is printed for the case where HIGH_RES_TIMERS=y. Fix this for the HIGH_RES_TIMERS=y case by duplicating the printk from the low res NOHZ path in the high res NOHZ path. This confused me since I was thinking 'dmesg | grep -i NOHZ' would tell me if NOHZ was enabled, but if I have hrtimers there is nothing. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1295419594-13085-1-git-send-email-sboyd@codeaurora.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-19perf: Fix perf_event_init_task()/perf_event_free_task() interactionOleg Nesterov
perf_event_init_task() should clear child->perf_event_ctxp[] before anything else. Otherwise, if perf_event_init_context(perf_hw_context) fails, perf_event_free_task() can free perf_event_ctxp[perf_sw_context] copied from parent->perf_event_ctxp[] by dup_task_struct(). Also move the initialization of perf_event_mutex and perf_event_list from perf_event_init_context() to perf_event_init_context(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> LKML-Reference: <20110119182228.GC12183@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-19perf: Fix find_get_context() vs perf_event_exit_task() raceOleg Nesterov
find_get_context() must not install the new perf_event_context if the task has already passed perf_event_exit_task(). If nothing else, this means the memory leak. Initially ctx->refcount == 2, it is supposed that perf_event_exit_task_context() should participate and do the necessary put_ctx(). find_lively_task_by_vpid() checks PF_EXITING but this buys nothing, by the time we call find_get_context() this task can be already dead. To the point, cmpxchg() can succeed when the task has already done the last schedule(). Change find_get_context() to populate task->perf_event_ctxp[] under task->perf_event_mutex, this way we can trust PF_EXITING because perf_event_exit_task() takes the same mutex. Also, change perf_event_exit_task_context() to use rcu_dereference(). Probably this is not strictly needed, but with or without this change find_get_context() can race with setup_new_exec()->perf_event_exit_task(), rcu_dereference() looks better. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Roland McGrath <roland@redhat.com> LKML-Reference: <20110119182207.GB12183@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>