summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2013-04-26lockdep: Print out additional debugging advice when we hit lockdep BUGsDave Jones
We occasionally get reports of these BUGs being hit, and the stack trace doesn't necessarily always tell us what we need to know about why we are hitting those limits. If users start attaching /proc/lock_stats to reports we may have more of a clue what's going on. Signed-off-by: Dave Jones <davej@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20130423163403.GA12839@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-25clockevents: Set dummy handler on CPU_DEAD shutdownThomas Gleixner
Vitaliy reported that a per cpu HPET timer interrupt crashes the system during hibernation. What happens is that the per cpu HPET timer gets shut down when the nonboot cpus are stopped. When the nonboot cpus are onlined again the HPET code sets up the MSI interrupt which fires before the clock event device is registered. The event handler is still set to hrtimer_interrupt, which then crashes the machine due to highres mode not being active. See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=700333 There is no real good way to avoid that in the HPET code. The HPET code alrady has a mechanism to detect spurious interrupts when event handler == NULL for a similar reason. We can handle that in the clockevent/tick layer and replace the previous functional handler with a dummy handler like we do in tick_setup_new_device(). The original clockevents code did this in clockevents_exchange_device(), but that got removed by commit 7c1e76897 (clockevents: prevent clockevent event_handler ending up handler_noop) which forgot to fix it up in tick_shutdown(). Same issue with the broadcast device. Reported-by: Vitaliy Fillipov <vitalif@yourcmc.ru> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: stable@vger.kernel.org Cc: 700333@bugs.debian.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-24Merge branch 'linus' into timers/coreThomas Gleixner
Reason: Get upstream fixes before adding conflicting code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-24nohz: Remove full dynticks' superfluous dependency on RCU treeFrederic Weisbecker
Remove the dependency on (TREE_RCU || TREE_PREEMPT_RCU). The full dynticks option already depends on SMP which implies (whatever flavour of) RCU tree config anyway. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-24sched: Prevent to re-select dst-cpu in load_balance()Joonsoo Kim
Commit 88b8dac0 makes load_balance() consider other cpus in its group. But, in that, there is no code for preventing to re-select dst-cpu. So, same dst-cpu can be selected over and over. This patch add functionality to load_balance() in order to exclude cpu which is selected once. We prevent to re-select dst_cpu via env's cpus, so now, env's cpus is a candidate not only for src_cpus, but also dst_cpus. With this patch, we can remove lb_iterations and max_lb_iterations, because we decide whether we can go ahead or not via env's cpus. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Jason Low <jason.low2@hp.com> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1366705662-3587-7-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24sched: Rename load_balance_tmpmask to load_balance_maskJoonsoo Kim
This name doesn't represent specific meaning. So rename it to imply it's purpose. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Jason Low <jason.low2@hp.com> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1366705662-3587-6-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24sched: Move up affinity check to mitigate useless redoing overheadJoonsoo Kim
Currently, LBF_ALL_PINNED is cleared after affinity check is passed. So, if task migration is skipped by small load value or small imbalance value in move_tasks(), we don't clear LBF_ALL_PINNED. At last, we trigger 'redo' in load_balance(). Imbalance value is often so small that any tasks cannot be moved to other cpus and, of course, this situation may be continued after we change the target cpu. So this patch move up affinity check code and clear LBF_ALL_PINNED before evaluating load value in order to mitigate useless redoing overhead. In addition, re-order some comments correctly. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Jason Low <jason.low2@hp.com> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1366705662-3587-5-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24sched: Don't consider other cpus in our group in case of NEWLY_IDLEJoonsoo Kim
Commit 88b8dac0 makes load_balance() consider other cpus in its group, regardless of idle type. When we do NEWLY_IDLE balancing, we should not consider it, because a motivation of NEWLY_IDLE balancing is to turn this cpu to non idle state if needed. This is not the case of other cpus. So, change code not to consider other cpus for NEWLY_IDLE balancing. With this patch, assign 'if (pulled_task) this_rq->idle_stamp = 0' in idle_balance() is corrected, because NEWLY_IDLE balancing doesn't consider other cpus. Assigning to 'this_rq->idle_stamp' is now valid. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Tested-by: Jason Low <jason.low2@hp.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1366705662-3587-4-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24sched: Explicitly cpu_idle_type checking in rebalance_domains()Joonsoo Kim
After commit 88b8dac0, dst-cpu can be changed in load_balance(), then we can't know cpu_idle_type of dst-cpu when load_balance() return positive. So, add explicit cpu_idle_type checking. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Tested-by: Jason Low <jason.low2@hp.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1366705662-3587-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-24sched: Change position of resched_cpu() in load_balance()Joonsoo Kim
cur_ld_moved is reset if env.flags hit LBF_NEED_BREAK. So, there is possibility that we miss doing resched_cpu(). Correct it as changing position of resched_cpu() before checking LBF_NEED_BREAK. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Tested-by: Jason Low <jason.low2@hp.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1366705662-3587-2-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/emulex/benet/be_main.c drivers/net/ethernet/intel/igb/igb_main.c drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c include/net/scm.h net/batman-adv/routing.c net/ipv4/tcp_input.c The e{uid,gid} --> {uid,gid} credentials fix conflicted with the cleanup in net-next to now pass cred structs around. The be2net driver had a bug fix in 'net' that overlapped with the VLAN interface changes by Patrick McHardy in net-next. An IGB conflict existed because in 'net' the build_skb() support was reverted, and in 'net-next' there was a comment style fix within that code. Several batman-adv conflicts were resolved by making sure that all calls to batadv_is_my_mac() are changed to have a new bat_priv first argument. Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO rewrite in 'net-next', mostly overlapping changes. Thanks to Stephen Rothwell and Antonio Quartulli for help with several of these merge resolutions. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-22nohz: Add basic tracingFrederic Weisbecker
It's not obvious to find out why the full dynticks subsystem doesn't always stop the tick: whether this is due to kthreads, posix timers, perf events, etc... These new tracepoints are here to help the user diagnose the failures and test this feature. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22nohz: Select wide RCU nocb for full dynticksFrederic Weisbecker
It makes testing and implementation much easier as we know in advance that all CPUs are RCU nocbs. Also this prepares to remove the dynamic check for nohz_full= boot mask to be a subset of rcu_nocbs= Eventually this should also help removing the requirement for the boot CPU to be outside the full dynticks range. Suggested-by: Christoph Lameter <cl@linux.com> Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22nohz: Disable the tick when irq resume in full dynticks CPUFrederic Weisbecker
Eventually try to disable tick on irq exit, now that the fundamental infrastructure is in place. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22nohz: Re-evaluate the tick for the new task after a context switchFrederic Weisbecker
When a task is scheduled in, it may have some properties of its own that could make the CPU reconsider the need for the tick: posix cpu timers, perf events, ... So notify the full dynticks subsystem when a task gets scheduled in and re-check the tick dependency at this stage. This is done through a self IPI to avoid messing up with any current lock scenario. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22nohz: Prepare to stop the tick on irq exitFrederic Weisbecker
Interrupt exit is a natural place to stop the tick: it happens after all events happening before and during the irq which are liable to update the dependency on the tick occured. Also it makes sure that any check on tick dependency is well ordered against dynticks kick IPIs. Bring in the infrastructure that performs the tick dependency checks on irq exit and shut it down if these checks show that we can do it safely. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22nohz: Implement full dynticks kickFrederic Weisbecker
Implement the full dynticks kick that is performed from IPIs sent by various subsystems (scheduler, posix timers, ...) when they want to notify about a new event that may reconsider the dependency on the tick. Most of the time, such an event end up restarting the tick. (Part of the design with subsystems providing *_can_stop_tick() helpers suggested by Peter Zijlstra a while ago). Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22timekeeping: Update tk->cycle_last in resumeThomas Gleixner
commit 7ec98e15aa (timekeeping: Delay update of clock->cycle_last) forgot to update tk->cycle_last in the resume path. This results in a stale value versus clock->cycle_last and prevents resume in the worst case. Reported-by: Jiri Slaby <jslaby@suse.cz> Reported-and-tested-by: Borislav Petkov <bp@alien8.de> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Linux-pm mailing list <linux-pm@lists.linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304211648150.21884@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2013-04-22nohz: Re-evaluate the tick from the scheduler IPIFrederic Weisbecker
The scheduler IPI is used by the scheduler to kick full dynticks CPUs asynchronously when more than one task are running or when a new timer list timer is enqueued. This way the destination CPU can decide to restart the tick to handle this new situation. Now let's call that kick in the scheduler IPI. (Reusing the scheduler IPI rather than implementing a new IPI was suggested by Peter Zijlstra a while ago) Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22sched: New helper to prevent from stopping the tick in full dynticksFrederic Weisbecker
Provide a new helper to be called from the full dynticks engine before stopping the tick in order to make sure we don't stop it when there is more than one task running on the CPU. This way we make sure that the tick stays alive to maintain fairness. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22sched: Kick full dynticks CPU that have more than one task enqueued.Frederic Weisbecker
Kick the tick on full dynticks CPUs when they get more than one task running on their queue. This makes sure that local fairness is maintained by the tick on the destination. This is done regardless of these tasks' class. We should be able to be more clever in the future depending on these. eg: a CPU that runs a SCHED_FIFO task doesn't need to maintain fairness against local pending tasks of the fair class. But keep things simple for now. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22perf: New helper to prevent full dynticks CPUs from stopping tickFrederic Weisbecker
Provide a new helper that help full dynticks CPUs to prevent from stopping their tick in case there are events in the local rotation list. This way we make sure that perf_event_task_tick() is serviced on demand. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephane Eranian <eranian@google.com> Cc: Jiri Olsa <jolsa@redhat.com>
2013-04-22perf: Kick full dynticks CPU if events rotation is neededFrederic Weisbecker
Kick the current CPU's tick by sending it a self IPI when an event is queued on the rotation list and it is the first element inserted. This makes sure that perf_event_task_tick() works on full dynticks CPUs. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephane Eranian <eranian@google.com> Cc: Jiri Olsa <jolsa@redhat.com>
2013-04-22posix_timers: Fix pre-condition to stop the tick on full dynticksFrederic Weisbecker
The test that checks if a CPU can stop its tick from posix CPU timers angle was mistakenly inverted. What we want is to prevent the tick from being stopped as long as the current CPU's task runs a posix CPU timer. Fix this. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-22kernel/hz.bc: ignore.Rusty Russell
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-21Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix offcore_rsp valid mask for SNB/IVB perf: Treat attr.config as u64 in perf_swevent_init()
2013-04-21sched: Fix wrong rq's runnable_avg update with rt tasksVincent Guittot
The current update of the rq's load can be erroneous when RT tasks are involved. The update of the load of a rq that becomes idle, is done only if the avg_idle is less than sysctl_sched_migration_cost. If RT tasks and short idle duration alternate, the runnable_avg will not be updated correctly and the time will be accounted as idle time when a CFS task wakes up. A new idle_enter function is called when the next task is the idle function so the elapsed time will be accounted as run time in the load of the rq, whatever the average idle time is. The function update_rq_runnable_avg is removed from idle_balance. When a RT task is scheduled on an idle CPU, the update of the rq's load is not done when the rq exit idle state because CFS's functions are not called. Then, the idle_balance, which is called just before entering the idle function, updates the rq's load and makes the assumption that the elapsed time since the last update, was only running time. As a consequence, the rq's load of a CPU that only runs a periodic RT task, is close to LOAD_AVG_MAX whatever the running duration of the RT task is. A new idle_exit function is called when the prev task is the idle function so the elapsed time will be accounted as idle time in the rq's load. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: linaro-kernel@lists.linaro.org Cc: peterz@infradead.org Cc: pjt@google.com Cc: fweisbec@gmail.com Cc: efault@gmx.de Link: http://lkml.kernel.org/r/1366302867-5055-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-21events: Protect access via task_subsys_state_check()Paul E. McKenney
The following RCU splat indicates lack of RCU protection: [ 953.267649] =============================== [ 953.267652] [ INFO: suspicious RCU usage. ] [ 953.267657] 3.9.0-0.rc6.git2.4.fc19.ppc64p7 #1 Not tainted [ 953.267661] ------------------------------- [ 953.267664] include/linux/cgroup.h:534 suspicious rcu_dereference_check() usage! [ 953.267669] [ 953.267669] other info that might help us debug this: [ 953.267669] [ 953.267675] [ 953.267675] rcu_scheduler_active = 1, debug_locks = 0 [ 953.267680] 1 lock held by glxgears/1289: [ 953.267683] #0: (&sig->cred_guard_mutex){+.+.+.}, at: [<c00000000027f884>] .prepare_bprm_creds+0x34/0xa0 [ 953.267700] [ 953.267700] stack backtrace: [ 953.267704] Call Trace: [ 953.267709] [c0000001f0d1b6e0] [c000000000016e30] .show_stack+0x130/0x200 (unreliable) [ 953.267717] [c0000001f0d1b7b0] [c0000000001267f8] .lockdep_rcu_suspicious+0x138/0x180 [ 953.267724] [c0000001f0d1b840] [c0000000001d43a4] .perf_event_comm+0x4c4/0x690 [ 953.267731] [c0000001f0d1b950] [c00000000027f6e4] .set_task_comm+0x84/0x1f0 [ 953.267737] [c0000001f0d1b9f0] [c000000000280414] .setup_new_exec+0x94/0x220 [ 953.267744] [c0000001f0d1ba70] [c0000000002f665c] .load_elf_binary+0x58c/0x19b0 ... This commit therefore adds the required RCU read-side critical section to perf_event_comm(). Reported-by: Adam Jackson <ajax@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: a.p.zijlstra@chello.nl Cc: paulus@samba.org Cc: acme@ghostprotocols.net Link: http://lkml.kernel.org/r/20130419190124.GA8638@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Gustavo Luiz Duarte <gusld@br.ibm.com>
2013-04-21Merge branch 'timers/nohz-posix-timers-v2' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/nohz Pull posix cpu timers handling on full dynticks from Frederic Weisbecker. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-21Merge branch 'perf/urgent' into perf/coreIngo Molnar
Conflicts: arch/x86/kernel/cpu/perf_event_intel.c Merge in the latest fixes before applying new patches, resolve the conflict. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-20Merge branch 'x86-kdump-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull kdump fixes from Peter Anvin: "The kexec/kdump people have found several problems with the support for loading over 4 GiB that was introduced in this merge cycle. This is partly due to a number of design problems inherent in the way the various pieces of kdump fit together (it is pretty horrifically manual in many places.) After a *lot* of iterations this is the patchset that was agreed upon, but of course it is now very late in the cycle. However, because it changes both the syntax and semantics of the crashkernel option, it would be desirable to avoid a stable release with the broken interfaces." I'm not happy with the timing, since originally the plan was to release the final 3.9 tomorrow. But apparently I'm doing an -rc8 instead... * 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kexec: use Crash kernel for Crash kernel low x86, kdump: Change crashkernel_high/low= to crashkernel=,high/low x86, kdump: Retore crashkernel= to allocate under 896M x86, kdump: Set crashkernel_low automatically
2013-04-19tracepoints: Prevent null probe from being addedSahara
Somehow tracepoint_entry_add_probe() function allows a null probe function. And, this may lead to unexpected results since the number of probe functions in an entry can be counted by checking whether a probe is null or not in the for-loop. This patch prevents a null probe from being added. In tracepoint_entry_remove_probe() function, checking probe parameter within the for-loop is moved out for code efficiency, leaving the null probe feature which removes all probe functions in the entry. Link: http://lkml.kernel.org/r/1365991995-19445-1-git-send-email-kpark3469@gmail.com Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Sahara <keun-o.park@windriver.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2013-04-19posix_timers: New API to prevent from stopping the tick when timers are runningFrederic Weisbecker
Bring a new helper that the full dynticks infrastructure can call in order to know if it can safely stop the tick from the posix cpu timers subsystem point of view. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19posix_timers: Kick full dynticks CPUs when a posix cpu timer is armedFrederic Weisbecker
Kick the full dynticks CPUs when a posix cpu timer is enqueued by way of a standard call to posix_cpu_timer_set() or set_process_cpu_timer(). This also include rescheduled firing timers. This way they can re-evaluate the state of (and possibly restart) their tick against the new expiry modification. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19nohz: New option to default all CPUs in full dynticks rangeFrederic Weisbecker
Provide a new kernel config that defaults all CPUs to be part of the full dynticks range, except the boot one for timekeeping. This default setting is overriden by the nohz_full= boot option if passed by the user. This is helpful for those who don't need a finegrained range of full dynticks CPU and also for automated testing. Suggested-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19nohz: Ensure full dynticks CPUs are RCU nocbsFrederic Weisbecker
We need full dynticks CPU to also be RCU nocb so that we don't have to keep the tick to handle RCU callbacks. Make sure the range passed to nohz_full= boot parameter is a subset of rcu_nocbs= The CPUs that fail to meet this requirement will be excluded from the nohz_full range. This is checked early in boot time, before any CPU has the opportunity to stop its tick. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19nohz: Force boot CPU outside full dynticks rangeFrederic Weisbecker
The timekeeping job must be able to run early on boot because there may be some pre-SMP (and thus pre-initcalls ) components that rely on it. The IO-APIC is one such users as it tests the timer health by watching jiffies progression. Given that it happens before we know the initial online set, we can't rely on it to select a timekeeper. We need one before SMP time otherwise we simply crash on boot. To fix this and keep things simple for now, force the boot CPU outside of the full dynticks range in any case and do this early on kernel parameter parsing time. We might want a trickier solution later, expecially for aSMP architectures that need to assign housekeeping tasks to arbitrary low power CPUs. But it's still first pass KISS time for now. Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-19mutex: Back out architecture specific check for negative mutex countWaiman Long
Linus suggested that probably all the supported architectures can allow a negative mutex count without incorrect behavior, so we can then back out the architecture specific change and allow the mutex count to go to any negative number. That should further reduce contention for non-x86 architecture. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Waiman Long <Waiman.Long@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chandramouleeswaran Aswin <aswin@hp.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Norton Scott J <scott.norton@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1366226594-5506-5-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-19mutex: Queue mutex spinners with MCS lock to reduce cacheline contentionWaiman Long
The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option turned on) allow multiple tasks to spin on a single mutex concurrently. A potential problem with the current approach is that when the mutex becomes available, all the spinning tasks will try to acquire the mutex more or less simultaneously. As a result, there will be a lot of cacheline bouncing especially on systems with a large number of CPUs. This patch tries to reduce this kind of contention by putting the mutex spinners into a queue so that only the first one in the queue will try to acquire the mutex. This will reduce contention and allow all the tasks to move forward faster. The queuing of mutex spinners is done using an MCS lock based implementation which will further reduce contention on the mutex cacheline than a similar ticket spinlock based implementation. This patch will add a new field into the mutex data structure for holding the MCS lock. This expands the mutex size by 8 bytes for 64-bit system and 4 bytes for 32-bit system. This overhead will be avoid if the MUTEX_SPIN_ON_OWNER option is turned off. The following table shows the jobs per minute (JPM) scalability data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl command is used to restrict the running of the fserver workloads to 1/2/4/8 nodes with hyperthreading off. +-----------------+-----------+-----------+-------------+----------+ | Configuration | Mean JPM | Mean JPM | Mean JPM | % Change | | | w/o patch | patch 1 | patches 1&2 | 1->1&2 | +-----------------+------------------------------------------------+ | | User Range 1100 - 2000 | +-----------------+------------------------------------------------+ | 8 nodes, HT off | 227972 | 227237 | 305043 | +34.2% | | 4 nodes, HT off | 393503 | 381558 | 394650 | +3.4% | | 2 nodes, HT off | 334957 | 325240 | 338853 | +4.2% | | 1 node , HT off | 198141 | 197972 | 198075 | +0.1% | +-----------------+------------------------------------------------+ | | User Range 200 - 1000 | +-----------------+------------------------------------------------+ | 8 nodes, HT off | 282325 | 312870 | 332185 | +6.2% | | 4 nodes, HT off | 390698 | 378279 | 393419 | +4.0% | | 2 nodes, HT off | 336986 | 326543 | 340260 | +4.2% | | 1 node , HT off | 197588 | 197622 | 197582 | 0.0% | +-----------------+-----------+-----------+-------------+----------+ At low user range 10-100, the JPM differences were within +/-1%. So they are not that interesting. The fserver workload uses mutex spinning extensively. With just the mutex change in the first patch, there is no noticeable change in performance. Rather, there is a slight drop in performance. This mutex spinning patch more than recovers the lost performance and show a significant increase of +30% at high user load with the full 8 nodes. Similar improvements were also seen in a 3.8 kernel. The table below shows the %time spent by different kernel functions as reported by perf when running the fserver workload at 1500 users with all 8 nodes. +-----------------------+-----------+---------+-------------+ | Function | % time | % time | % time | | | w/o patch | patch 1 | patches 1&2 | +-----------------------+-----------+---------+-------------+ | __read_lock_failed | 34.96% | 34.91% | 29.14% | | __write_lock_failed | 10.14% | 10.68% | 7.51% | | mutex_spin_on_owner | 3.62% | 3.42% | 2.33% | | mspin_lock | N/A | N/A | 9.90% | | __mutex_lock_slowpath | 1.46% | 0.81% | 0.14% | | _raw_spin_lock | 2.25% | 2.50% | 1.10% | +-----------------------+-----------+---------+-------------+ The fserver workload for an 8-node system is dominated by the contention in the read/write lock. Mutex contention also plays a role. With the first patch only, mutex contention is down (as shown by the __mutex_lock_slowpath figure) which help a little bit. We saw only a few percents improvement with that. By applying patch 2 as well, the single mutex_spin_on_owner figure is now split out into an additional mspin_lock figure. The time increases from 3.42% to 11.23%. It shows a great reduction in contention among the spinners leading to a 30% improvement. The time ratio 9.9/2.33=4.3 indicates that there are on average 4+ spinners waiting in the spin_lock loop for each spinner in the mutex_spin_on_owner loop. Contention in other locking functions also go down by quite a lot. The table below shows the performance change of both patches 1 & 2 over patch 1 alone in other AIM7 workloads (at 8 nodes, hyperthreading off). +--------------+---------------+----------------+-----------------+ | Workload | mean % change | mean % change | mean % change | | | 10-100 users | 200-1000 users | 1100-2000 users | +--------------+---------------+----------------+-----------------+ | alltests | 0.0% | -0.8% | +0.6% | | five_sec | -0.3% | +0.8% | +0.8% | | high_systime | +0.4% | +2.4% | +2.1% | | new_fserver | +0.1% | +14.1% | +34.2% | | shared | -0.5% | -0.3% | -0.4% | | short | -1.7% | -9.8% | -8.3% | +--------------+---------------+----------------+-----------------+ The short workload is the only one that shows a decline in performance probably due to the spinner locking and queuing overhead. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chandramouleeswaran Aswin <aswin@hp.com> Cc: Norton Scott J <scott.norton@hp.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1366226594-5506-4-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-19mutex: Make more scalable by doing less atomic operationsWaiman Long
In the __mutex_lock_common() function, an initial entry into the lock slow path will cause two atomic_xchg instructions to be issued. Together with the atomic decrement in the fast path, a total of three atomic read-modify-write instructions will be issued in rapid succession. This can cause a lot of cache bouncing when many tasks are trying to acquire the mutex at the same time. This patch will reduce the number of atomic_xchg instructions used by checking the counter value first before issuing the instruction. The atomic_read() function is just a simple memory read. The atomic_xchg() function, on the other hand, can be up to 2 order of magnitude or even more in cost when compared with atomic_read(). By using atomic_read() to check the value first before calling atomic_xchg(), we can avoid a lot of unnecessary cache coherency traffic. The only downside with this change is that a task on the slow path will have a tiny bit less chance of getting the mutex when competing with another task in the fast path. The same is true for the atomic_cmpxchg() function in the mutex-spin-on-owner loop. So an atomic_read() is also performed before calling atomic_cmpxchg(). The mutex locking and unlocking code for the x86 architecture can allow any negative number to be used in the mutex count to indicate that some tasks are waiting for the mutex. I am not so sure if that is the case for the other architectures. So the default is to avoid atomic_xchg() if the count has already been set to -1. For x86, the check is modified to include all negative numbers to cover a larger case. The following table shows the jobs per minutes (JPM) scalability data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The numactl command is used to restrict the running of the high_systime workloads to 1/2/4/8 nodes with hyperthreading on and off. +-----------------+-----------+------------+----------+ | Configuration | Mean JPM | Mean JPM | % Change | | | w/o patch | with patch | | +-----------------+-----------------------------------+ | | User Range 1100 - 2000 | +-----------------+-----------------------------------+ | 8 nodes, HT on | 36980 | 148590 | +301.8% | | 8 nodes, HT off | 42799 | 145011 | +238.8% | | 4 nodes, HT on | 61318 | 118445 | +51.1% | | 4 nodes, HT off | 158481 | 158592 | +0.1% | | 2 nodes, HT on | 180602 | 173967 | -3.7% | | 2 nodes, HT off | 198409 | 198073 | -0.2% | | 1 node , HT on | 149042 | 147671 | -0.9% | | 1 node , HT off | 126036 | 126533 | +0.4% | +-----------------+-----------------------------------+ | | User Range 200 - 1000 | +-----------------+-----------------------------------+ | 8 nodes, HT on | 41525 | 122349 | +194.6% | | 8 nodes, HT off | 49866 | 124032 | +148.7% | | 4 nodes, HT on | 66409 | 106984 | +61.1% | | 4 nodes, HT off | 119880 | 130508 | +8.9% | | 2 nodes, HT on | 138003 | 133948 | -2.9% | | 2 nodes, HT off | 132792 | 131997 | -0.6% | | 1 node , HT on | 116593 | 115859 | -0.6% | | 1 node , HT off | 104499 | 104597 | +0.1% | +-----------------+------------+-----------+----------+ At low user range 10-100, the JPM differences were within +/-1%. So they are not that interesting. AIM7 benchmark run has a pretty large run-to-run variance due to random nature of the subtests executed. So a difference of less than +-5% may not be really significant. This patch improves high_systime workload performance at 4 nodes and up by maintaining transaction rates without significant drop-off at high node count. The patch has practically no impact on 1 and 2 nodes system. The table below shows the percentage time (as reported by perf record -a -s -g) spent on the __mutex_lock_slowpath() function by the high_systime workload at 1500 users for 2/4/8-node configurations with hyperthreading off. +---------------+-----------------+------------------+---------+ | Configuration | %Time w/o patch | %Time with patch | %Change | +---------------+-----------------+------------------+---------+ | 8 nodes | 65.34% | 0.69% | -99% | | 4 nodes | 8.70% | 1.02% | -88% | | 2 nodes | 0.41% | 0.32% | -22% | +---------------+-----------------+------------------+---------+ It is obvious that the dramatic performance improvement at 8 nodes was due to the drastic cut in the time spent within the __mutex_lock_slowpath() function. The table below show the improvements in other AIM7 workloads (at 8 nodes, hyperthreading off). +--------------+---------------+----------------+-----------------+ | Workload | mean % change | mean % change | mean % change | | | 10-100 users | 200-1000 users | 1100-2000 users | +--------------+---------------+----------------+-----------------+ | alltests | +0.6% | +104.2% | +185.9% | | five_sec | +1.9% | +0.9% | +0.9% | | fserver | +1.4% | -7.7% | +5.1% | | new_fserver | -0.5% | +3.2% | +3.1% | | shared | +13.1% | +146.1% | +181.5% | | short | +7.4% | +5.0% | +4.2% | +--------------+---------------+----------------+-----------------+ Signed-off-by: Waiman Long <Waiman.Long@hp.com> Reviewed-by: Davidlohr Bueso <davidlohr.bueso@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chandramouleeswaran Aswin <aswin@hp.com> Cc: Norton: Scott J <scott.norton@hp.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1366226594-5506-3-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-19mutex: Move mutex spinning code from sched/core.c back to mutex.cWaiman Long
As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler feature bit was really just an early hack to make with/without mutex-spinning testable. So it is no longer necessary. This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and move the mutex spinning code from kernel/sched/core.c back to kernel/mutex.c which is where they should belong. Signed-off-by: Waiman Long <Waiman.Long@hp.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chandramouleeswaran Aswin <aswin@hp.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Norton Scott J <scott.norton@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1366226594-5506-2-git-send-email-Waiman.Long@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-04-18cgroup: fix broken file xattrsLi Zefan
We should store file xattrs in struct cfent instead of struct cftype, because cftype is a type while cfent is object instance of cftype. For example each cgroup has a tasks file, and each tasks file is associated with a uniq cfent, but all those files share the same struct cftype. Alexey Kodanev reported a crash, which can be reproduced: # mount -t cgroup -o xattr /sys/fs/cgroup # mkdir /sys/fs/cgroup/test # setfattr -n trusted.value -v test_value /sys/fs/cgroup/tasks # rmdir /sys/fs/cgroup/test # umount /sys/fs/cgroup oops! In this case, simple_xattrs_free() will free the same struct simple_xattrs twice. tj: Dropped unused local variable @cft from cgroup_diput(). Cc: <stable@vger.kernel.org> # 3.8.x Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-04-18Merge branch 'userns-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux Pull user-namespace fixes from Andy Lutomirski. * 'userns-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux: userns: Changing any namespace id mappings should require privileges userns: Check uid_map's opener's fsuid, not the current fsuid userns: Don't let unprivileged users trick privileged users into setting the id_map
2013-04-18nohz: New APIs to re-evaluate the tick on full dynticks CPUsFrederic Weisbecker
Provide two new helpers in order to notify the full dynticks CPUs about some internal system changes against which they may reconsider the state of their tick. Some practical examples include: posix cpu timers, perf tick and sched clock tick. For now the notifying handler, implemented through IPIs, is a stub that will be implemented when we get the tick stop/restart infrastructure in. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Christoph Lameter <cl@linux.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Gilad Ben Yossef <gilad@benyossef.com> Cc: Hakan Akkan <hakanakkan@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kevin Hilman <khilman@linaro.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2013-04-18Revert "block: add missing block_bio_complete() tracepoint"Linus Torvalds
This reverts commit 3a366e614d0837d9fc23f78cdb1a1186ebc3387f. Wanlong Gao reports that it causes a kernel panic on his machine several minutes after boot. Reverting it removes the panic. Jens says: "It's not quite clear why that is yet, so I think we should just revert the commit for 3.9 final (which I'm assuming is pretty close). The wifi is crap at the LSF hotel, so sending this email instead of queueing up a revert and pull request." Reported-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Requested-by: Jens Axboe <axboe@kernel.dk> Cc: Tejun Heo <tj@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-18kprobes: Fix a double lock bug of kprobe_mutexMasami Hiramatsu
Fix a double locking bug caused when debug.kprobe-optimization=0. While the proc_kprobes_optimization_handler locks kprobe_mutex, wait_for_kprobe_optimizer locks it again and that causes a double lock. To fix the bug, this introduces different mutex for protecting sysctl parameter and locks it in proc_kprobes_optimization_handler. Of course, since we need to lock kprobe_mutex when touching kprobes resources, that is done in *optimize_all_kprobes(). This bug was introduced by commit ad72b3bea744 ("kprobes: fix wait_for_kprobe_optimizer()") Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-18posix-timers: Remove unused variableThomas Gleixner
Remove the unused variable *node introduced by commit 5ed67f05 (posix timers: Allocate timer id per process) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Pavel Emelyanov <xemul@parallels.com>
2013-04-17kernel/signal.c: stop info leak via the tkill and the tgkill syscallsEmese Revfy
This fixes a kernel memory contents leak via the tkill and tgkill syscalls for compat processes. This is visible in the siginfo_t->_sifields._rt.si_sigval.sival_ptr field when handling signals delivered from tkill. The place of the infoleak: int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from) { ... put_user_ex(ptr_to_compat(from->si_ptr), &to->si_ptr); ... } Signed-off-by: Emese Revfy <re.emese@gmail.com> Reviewed-by: PaX Team <pageexec@freemail.hu> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-17kexec: use Crash kernel for Crash kernel lowYinghai Lu
We can extend kexec-tools to support multiple "Crash kernel" in /proc/iomem instead. So we can use "Crash kernel" instead of "Crash kernel low" in /proc/iomem. Suggested-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1366089828-19692-3-git-send-email-yinghai@kernel.org Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-17x86, kdump: Change crashkernel_high/low= to crashkernel=,high/lowYinghai Lu
Per hpa, use crashkernel=X,high crashkernel=Y,low instead of crashkernel_hign=X crashkernel_low=Y. As that could be extensible. -v2: according to Vivek, change delimiter to ; -v3: let hign and low only handle simple form and it conforms to description in kernel-parameters.txt still keep crashkernel=X override any crashkernel=X,high crashkernel=Y,low -v4: update get_last_crashkernel returning and add more strict checking in parse_crashkernel_simple() found by HATAYAMA. -v5: Change delimiter back to , according to HPA. also separate parse_suffix from parse_simper according to vivek. so we can avoid @pos in that path. -v6: Tight the checking about crashkernel=X,highblahblah,high found by HTYAYAMA. Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1366089828-19692-5-git-send-email-yinghai@kernel.org Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>