summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2012-05-08fork: Remove the weak insanityThomas Gleixner
We error out when compiling with gcc4.1.[01] as it miscompiles __weak. The workaround with magic defines is not longer necessary. Make it __weak again. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20120505150141.306358267@linutronix.de
2012-05-08smp: Implement kick_all_cpus_sync()Thomas Gleixner
Will replace the misnomed cpu_idle_wait() function which is copied a gazillion times all over arch/* Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20120507175652.049316594@linutronix.de
2012-05-07kmsg: export printk records to the /dev/kmsg interfaceKay Sievers
Support for multiple concurrent readers of /dev/kmsg, with read(), seek(), poll() support. Output of message sequence numbers, to allow userspace log consumers to reliably reconnect and reconstruct their state at any given time. After open("/dev/kmsg"), read() always returns *all* buffered records. If only future messages should be read, SEEK_END can be used. In case records get overwritten while /dev/kmsg is held open, or records get faster overwritten than they are read, the next read() will return -EPIPE and the current reading position gets updated to the next available record. The passed sequence numbers allow the log consumer to calculate the amount of lost messages. [root@mop ~]# cat /dev/kmsg 5,0,0;Linux version 3.4.0-rc1+ (kay@mop) (gcc version 4.7.0 20120315 ... 6,159,423091;ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff]) 7,160,424069;pci_root PNP0A03:00: host bridge window [io 0x0000-0x0cf7] (ignored) SUBSYSTEM=acpi DEVICE=+acpi:PNP0A03:00 6,339,5140900;NET: Registered protocol family 10 30,340,5690716;udevd[80]: starting version 181 6,341,6081421;FDC 0 is a S82078B 6,345,6154686;microcode: CPU0 sig=0x623, pf=0x0, revision=0x0 7,346,6156968;sr 1:0:0:0: Attached scsi CD-ROM sr0 SUBSYSTEM=scsi DEVICE=+scsi:1:0:0:0 6,347,6289375;microcode: CPU1 sig=0x623, pf=0x0, revision=0x0 Cc: Karel Zak <kzak@redhat.com> Tested-by: William Douglas <william.douglas@intel.com> Signed-off-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-07printk: convert byte-buffer to variable-length record bufferKay Sievers
- Record-based stream instead of the traditional byte stream buffer. All records carry a 64 bit timestamp, the syslog facility and priority in the record header. - Records consume almost the same amount, sometimes less memory than the traditional byte stream buffer (if printk_time is enabled). The record header is 16 bytes long, plus some padding bytes at the end if needed. The byte-stream buffer needed 3 chars for the syslog prefix, 15 char for the timestamp and a newline. - Buffer management is based on message sequence numbers. When records need to be discarded, the reading heads move on to the next full record. Unlike the byte-stream buffer, no old logged lines get truncated or partly overwritten by new ones. Sequence numbers also allow consumers of the log stream to get notified if any message in the stream they are about to read gets discarded during the time of reading. - Better buffered IO support for KERN_CONT continuation lines, when printk() is called multiple times for a single line. The use of KERN_CONT is now mandatory to use continuation; a few places in the kernel need trivial fixes here. The buffering could possibly be extended to per-cpu variables to allow better thread-safety for multiple printk() invocations for a single line. - Full-featured syslog facility value support. Different facilities can tag their messages. All userspace-injected messages enforce a facility value > 0 now, to be able to reliably distinguish them from the kernel-generated messages. Independent subsystems like a baseband processor running its own firmware, or a kernel-related userspace process can use their own unique facility values. Multiple independent log streams can co-exist that way in the same buffer. All share the same global sequence number counter to ensure proper ordering (and interleaving) and to allow the consumers of the log to reliably correlate the events from different facilities. Tested-by: William Douglas <william.douglas@intel.com> Signed-off-by: Kay Sievers <kay@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-07sched: Update documentation and commentsHiroshi Shimamoto
Change sched_*.c to sched/*.c in documentation and comments. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4F795CAC.9080206@ct.jp.nec.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-07Merge branch 'linus' into sched/coreIngo Molnar
Merge reason: We were on a pretty old base, refresh before moving on. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-07Merge branch 'tip/perf/core-4' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core
2012-05-05PM / Sleep: Fix a mistake in a conditional in autosleep_store()Arve Hjønnevåg
The condition check in autosleep_store() is incorrect and prevents /sys/power/autosleep from working as advertised. Fix that. [rjw: Added the changelog.] Signed-off-by: Arve Hjønnevåg <arve@android.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-05-05init_task: Create generic init_task instanceThomas Gleixner
All archs define init_task in the same way (except ia64, but there is no particular reason why ia64 cannot use the common version). Create a generic instance so all archs can be converted over. The config switch is temporary and will be removed when all archs are converted over. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chen Liqin <liqin.chen@sunplusct.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Howells <dhowells@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: James E.J. Bottomley <jejb@parisc-linux.org> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Mark Salter <msalter@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Richard Weinberger <richard@nod.at> Cc: Russell King <linux@arm.linux.org.uk> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20120503085034.092585287@linutronix.de
2012-05-04params: replace printk(KERN_<LVL>...) with pr_<lvl>(...)Jim Cromie
I left 1 printk which uses __FILE__, __LINE__ explicitly, which should not be subject to generic preferences expressed via pr_fmt(). + tweaks suggested by Joe Perches: - add doing to irq-enabled warning, like others. It wont happen often.. - change sysfs failure crit, not just err, make it 1 line in logs. - coalese 2 format fragments into 1 >80 char line cc: Joe Perches <joe@perches.com> Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-04params.c: fix Smack complaint about parse_argsJim Cromie
In commit 9fb48c744: "params: add 3rd arg to option handler callback signature", the if-guard added to the pr_debug was overzealous; no callers pass NULL, and existing code above and below the guard assumes as much. Change the if-guard to match, and silence the Smack complaint. CC: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-04genirq: Do not consider disabled wakeup irqsThomas Gleixner
If an wakeup interrupt has been disabled before the suspend code disables all interrupts then we have to ignore the pending flag. Otherwise we would abort suspend over and over as nothing clears the pending flag because the interrupt is disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: NeilBrown <neilb@suse.de>
2012-05-04genirq: Allow check_wakeup_irqs to notice level-triggered interruptsThomas Gleixner
Level triggered interrupts do not cause IRQS_PENDING to be set when they fire while "disabled" as the 'pending' state is always present in the level - they automatically refire where re-enabled. However the IRQS_PENDING flag is also used to abort a suspend cycle - if any 'is_wakeup_set' interrupt is PENDING, check_wakeup_irqs() will cause suspend to abort. Without IRQS_PENDING, suspend won't abort. Consequently, level-triggered interrupts that fire during the 'noirq' phase of suspend do not currently abort suspend. So set IRQS_PENDING even for level triggered interrupts, and make sure to clear the flag in check_irq_resend. [ Changelog by courtesy of Neil ] Tested-by: NeilBrown <neilb@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-04smp: Fix idle_thread_init() inline stubThomas Gleixner
idle_thread_init() does not have arguments. Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-04Merge tag 'v3.4-rc5' into nextJames Morris
Linux 3.4-rc5 Merge to pull in prerequisite change for Smack: 86812bb0de1a3758dc6c7aa01a763158a7c0638a Requested by Casey.
2012-05-03smp, idle: Allocate idle thread for each possible cpu during bootSuresh Siddha
percpu areas are already allocated during boot for each possible cpu. percpu idle threads can be considered as an extension of the percpu areas, and allocate them for each possible cpu during boot. This will eliminate the need for workqueue based idle thread allocation. In future we can move the idle thread area into the percpu area too. [ tglx: Moved the loop into smpboot.c and added an error check when the init code failed to allocate an idle thread for a cpu which should be onlined ] Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: venki@google.com Link: http://lkml.kernel.org/r/1334966930.28674.245.camel@sbsiddha-desk.sc.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-05-03userns: Convert in_group_p and in_egroup_p to use kgid_tEric W. Biederman
Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-03userns: Convert ptrace, kill, set_priority permission checks to work with ↵Eric W. Biederman
kuids and kgids Update the permission checks to use the new uid_eq and gid_eq helpers and remove the now unnecessary user_ns equality comparison. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-03userns: Convert setting and getting uid and gid system calls to use kuid and ↵Eric W. Biederman
kgid Convert setregid, setgid, setreuid, setuid, setresuid, getresuid, setresgid, getresgid, setfsuid, setfsgid, getuid, geteuid, getgid, getegid, waitpid, waitid, wait4. Convert userspace uids and gids into kuids and kgids before being placed on struct cred. Convert struct cred kuids and kgids into userspace uids and gids when returning them. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-03userns: Convert sched_set_affinity and sched_set_scheduler's permission checksEric W. Biederman
- Compare kuids with uid_eq - kuid are uniuqe across all user namespaces so there is no longer the need for a user_namespace comparison. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-03userns: Replace user_ns_map_uid and user_ns_map_gid with from_kuid and from_kgidEric W. Biederman
These function are no longer needed replace them with their more useful equivalents. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-03userns: Store uid and gid values in struct cred with kuid_t and kgid_t typesEric W. Biederman
cred.h and a few trivial users of struct cred are changed. The rest of the users of struct cred are left for other patches as there are too many changes to make in one go and leave the change reviewable. If the user namespace is disabled and CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile and behave correctly. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-03userns: Convert group_info values from gid_t to kgid_t.Eric W. Biederman
As a first step to converting struct cred to be all kuid_t and kgid_t values convert the group values stored in group_info to always be kgid_t values. Unless user namespaces are used this change should have no effect. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-02rcu: Make exit_rcu() more precise and consolidatePaul E. McKenney
When running preemptible RCU, if a task exits in an RCU read-side critical section having blocked within that same RCU read-side critical section, the task must be removed from the list of tasks blocking a grace period (perhaps the current grace period, perhaps the next grace period, depending on timing). The exit() path invokes exit_rcu() to do this cleanup. However, the current implementation of exit_rcu() needlessly does the cleanup even if the task did not block within the current RCU read-side critical section, which wastes time and needlessly increases the size of the state space. Fix this by only doing the cleanup if the current task is actually on the list of tasks blocking some grace period. While we are at it, consolidate the two identical exit_rcu() functions into a single function. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Linus Torvalds <torvalds@linux-foundation.org> Conflicts: kernel/rcupdate.c
2012-05-02rcu: Move PREEMPT_RCU preemption to switch_to() invocationPaul E. McKenney
Currently, PREEMPT_RCU readers are enqueued upon entry to the scheduler. This is inefficient because enqueuing is required only if there is a context switch, and entry to the scheduler does not guarantee a context switch. The commit therefore moves the enqueuing to immediately precede the call to switch_to() from the scheduler. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-02Merge 3.4-rc5 into driver-core-nextGreg Kroah-Hartman
This was done to resolve a merge issue with the init/main.c file. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-02Merge 3.4-rc5 into staging-nextGreg Kroah-Hartman
This resolves the conflict in: drivers/staging/vt6656/ioctl.c Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-01PM / Sleep: Add user space interface for manipulating wakeup sources, v3Rafael J. Wysocki
Android allows user space to manipulate wakelocks using two sysfs file located in /sys/power/, wake_lock and wake_unlock. Writing a wakelock name and optionally a timeout to the wake_lock file causes the wakelock whose name was written to be acquired (it is created before is necessary), optionally with the given timeout. Writing the name of a wakelock to wake_unlock causes that wakelock to be released. Implement an analogous interface for user space using wakeup sources. Add the /sys/power/wake_lock and /sys/power/wake_unlock files allowing user space to create, activate and deactivate wakeup sources, such that writing a name and optionally a timeout to wake_lock causes the wakeup source of that name to be activated, optionally with the given timeout. If that wakeup source doesn't exist, it will be created and then activated. Writing a name to wake_unlock causes the wakeup source of that name, if there is one, to be deactivated. Wakeup sources created with the help of wake_lock that haven't been used for more than 5 minutes are garbage collected and destroyed. Moreover, there can be only WL_NUMBER_LIMIT wakeup sources created with the help of wake_lock present at a time. The data type used to track wakeup sources created by user space is called "struct wakelock" to indicate the origins of this feature. This version of the patch includes an rbtree manipulation fix from John Stultz. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NeilBrown <neilb@suse.de>
2012-05-01PM / Sleep: Add "prevent autosleep time" statistics to wakeup sourcesRafael J. Wysocki
Android uses one wakelock statistics that is only necessary for opportunistic sleep. Namely, the prevent_suspend_time field accumulates the total time the given wakelock has been locked while "automatic suspend" was enabled. Add an analogous field, prevent_sleep_time, to wakeup sources and make it behave in a similar way. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-01PM / Sleep: Implement opportunistic sleep, v2Rafael J. Wysocki
Introduce a mechanism by which the kernel can trigger global transitions to a sleep state chosen by user space if there are no active wakeup sources. It consists of a new sysfs attribute, /sys/power/autosleep, that can be written one of the strings returned by reads from /sys/power/state, an ordered workqueue and a work item carrying out the "suspend" operations. If a string representing the system's sleep state is written to /sys/power/autosleep, the work item triggering transitions to that state is queued up and it requeues itself after every execution until user space writes "off" to /sys/power/autosleep. That work item enables the detection of wakeup events using the functions already defined in drivers/base/power/wakeup.c (with one small modification) and calls either pm_suspend(), or hibernate() to put the system into a sleep state. If a wakeup event is reported while the transition is in progress, it will abort the transition and the "system suspend" work item will be queued up again. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NeilBrown <neilb@suse.de>
2012-05-01PM / Hibernate: Hibernate/thaw fixes/improvementsBojan Smojver
1. Do not allocate memory for buffers from emergency pools, unless absolutely required. Do not warn about and do not retry non-essential failed allocations. 2. Do not check the amount of free pages left on every single page write, but wait until one map is completely populated and then check. 3. Set maximum number of pages for read buffering consistently, instead of inadvertently depending on the size of the sector type. 4. Fix copyright line, which I missed when I submitted the hibernation threading patch. 5. Dispense with bit shifting arithmetic to improve readability. 6. Really recalculate the number of pages required to be free after all allocations have been done. 7. Fix calculation of pages required for read buffering. Only count in pages that do not belong to high memory. Signed-off-by: Bojan Smojver <bojan@rexursive.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-05-01rcu: Ensure that RCU_FAST_NO_HZ timers expire on correct CPUPaul E. McKenney
Timers are subject to migration, which can lead to the following system-hang scenario when CONFIG_RCU_FAST_NO_HZ=y: 1. CPU 0 executes synchronize_rcu(), which posts an RCU callback. 2. CPU 0 then goes idle. It cannot immediately invoke the callback, but there is nothing RCU needs from ti, so it enters dyntick-idle mode after posting a timer. 3. The timer gets migrated to CPU 1. 4. CPU 0 never wakes up, so the synchronize_rcu() never returns, so the system hangs. This commit fixes this problem by using mod_timer_pinned(), as suggested by Peter Zijlstra, to ensure that the timer is actually posted on the running CPU. Reported-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30dynamic_debug: make dynamic-debug work for module initializationJim Cromie
This introduces a fake module param $module.dyndbg. Its based upon Thomas Renninger's $module.ddebug boot-time debugging patch from https://lkml.org/lkml/2010/9/15/397 The 'fake' module parameter is provided for all modules, whether or not they need it. It is not explicitly added to each module, but is implemented in callbacks invoked from parse_args. For builtin modules, dynamic_debug_init() now directly calls parse_args(..., &ddebug_dyndbg_boot_params_cb), to process the params undeclared in the modules, just after the ddebug tables are processed. While its slightly weird to reprocess the boot params, parse_args() is already called repeatedly by do_initcall_levels(). More importantly, the dyndbg queries (given in ddebug_query or dyndbg params) cannot be activated until after the ddebug tables are ready, and reusing parse_args is cleaner than doing an ad-hoc parse. This reparse would break options like inc_verbosity, but they probably should be params, like verbosity=3. ddebug_dyndbg_boot_params_cb() handles both bare dyndbg (aka: ddebug_query) and module-prefixed dyndbg params, and ignores all other parameters. For example, the following will enable pr_debug()s in 4 builtin modules, in the order given: dyndbg="module params +p; module aio +p" module.dyndbg=+p pci.dyndbg For loadable modules, parse_args() in load_module() calls ddebug_dyndbg_module_params_cb(). This handles bare dyndbg params as passed from modprobe, and errors on other unknown params. Note that modprobe reads /proc/cmdline, so "modprobe foo" grabs all foo.params, strips the "foo.", and passes these to the kernel. ddebug_dyndbg_module_params_cb() is again called for the unknown params; it handles dyndbg, and errors on others. The "doing" arg added previously contains the module name. For non CONFIG_DYNAMIC_DEBUG builds, the stub function accepts and ignores $module.dyndbg params, other unknowns get -ENOENT. If no param value is given (as in pci.dyndbg example above), "+p" is assumed, which enables all pr_debug callsites in the module. The dyndbg fake parameter is not shown in /sys/module/*/parameters, thus it does not use any resources. Changes to it are made via the control file. Also change pr_info in ddebug_exec_queries to vpr_info, no need to see it all the time. Signed-off-by: Jim Cromie <jim.cromie@gmail.com> CC: Thomas Renninger <trenn@suse.de> CC: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-30params: add 3rd arg to option handler callback signatureJim Cromie
Add a 3rd arg, named "doing", to unknown-options callbacks invoked from parse_args(). The arg is passed as: "Booting kernel" from start_kernel(), initcall_level_names[i] from do_initcall_level(), mod->name from load_module(), via parse_args(), parse_one() parse_args() already has the "name" parameter, which is renamed to "doing" to better reflect current uses 1,2 above. parse_args() passes it to an altered parse_one(), which now passes it down into the unknown option handler callbacks. The mod->name will be needed to handle dyndbg for loadable modules, since params passed by modprobe are not qualified (they do not have a "$modname." prefix), and by the time the unknown-param callback is called, the module name is not otherwise available. Minor tweaks: Add param-name to parse_one's pr_debug(), current message doesnt identify the param being handled, add it. Add a pr_info to print current level and level_name of the initcall, and number of registered initcalls at that level. This adds 7 lines to dmesg output, like: initlevel:6=device, 172 registered initcalls Drop "parameters" from initcall_level_names[], its unhelpful in the pr_info() added above. This array is passed into parse_args() by do_initcall_level(). CC: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Acked-by: Jason Baron <jbaron@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-30rcu: Add rcutorture test for call_srcu()Lai Jiangshan
Add srcu_torture_deferred_free() for srcu_ops so as to test the new call_srcu(). Rename the original srcu_ops to srcu_sync_ops. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Implement per-domain single-threaded call_srcu() state machineLai Jiangshan
This commit implements an SRCU state machine in support of call_srcu(). The state machine is preemptible, light-weight, and single-threaded, minimizing synchronization overhead. In particular, there is no longer any need for synchronize_srcu() to be guarded by a mutex. Expedited processing is handled, at least in the absence of concurrent grace-period operations on that same srcu_struct structure, by having the synchronize_srcu_expedited() thread take on the role of the workqueue thread for one iteration. There is a reasonable probability that a given SRCU callback will be invoked on the same CPU that registered it, however, there is no guarantee. Concurrent SRCU grace-period primitives can cause callbacks to be executed elsewhere, even in absence of CPU-hotplug operations. Callbacks execute in process context, but under the influence of local_bh_disable(), so it is illegal to sleep in an SRCU callback function. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Use single value to handle expedited SRCU grace periodsLai Jiangshan
The earlier algorithm used an "expedited" flag combined with a "trycount" counter to differentiate between normal and expedited SRCU grace periods. However, the difference can be encoded into a single counter with a cutoff value and different initial values for expedited and normal SRCU grace periods. This commit makes that change. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Conflicts: kernel/srcu.c
2012-04-30rcu: Improve srcu_readers_active_idx()'s cache localityLai Jiangshan
Expand the calls to srcu_readers_active_idx() from srcu_readers_active() inline. This change improves cache locality by interating over the CPUs once rather than twice. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Implement a variant of Peter's SRCU algorithmLai Jiangshan
This commit implements a variant of Peter's algorithm, which may be found at https://lkml.org/lkml/2012/2/1/119. o Make the checking lock-free to enable parallel checking. Parallel checking is required when (1) the original checking task is preempted for a long time, (2) sychronize_srcu_expedited() starts during an ongoing SRCU grace period, or (3) we wish to avoid acquiring a lock. o Since the checking is lock-free, we avoid a mutex in state machine for call_srcu(). o Remove the SRCU_REF_MASK and remove the coupling with the flipping. This might allow us to remove the preempt_disable() in future versions, though such removal will need great care because it rescinds the one-old-reader-per-CPU guarantee. o Remove a smp_mb(), simplify the comments and make the smp_mb() pairs more intuitive. Inspired-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Improve SRCU's wait_idx() commentsLai Jiangshan
The safety of SRCU is provided byy wait_idx() rather than flipping. The flipping actually prevents starvation. This commit therefore updates the comments to more accurately and precisely describe what is going on. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Flip ->completed only once per SRCU grace periodLai Jiangshan
This is an optimization of the SRCU grace period. To guard against preempted readers with old values of the counter, it suffices to scan the old counters once more, then flip ->completed only one time. The reason this works is that the old readers must have incremented the old set of counters (if they have not yet incremented, then their critical section starts after this grace period, so they may be safely ignored). This commit therefore optimizes the second flip out in favor of a simple rescan. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Increment upper bit only for srcu_read_lock()Lai Jiangshan
The purpose of the upper bit of SRCU's per-CPU counters is to guarantee that no reasonable series of srcu_read_lock() and srcu_read_unlock() operations can return the value of the counter to its original value. This guarantee is require only after the index has been switched to the other set of counters, so at most one srcu_read_lock() can affect a given CPU's counter. The number of srcu_read_unlock() operations on a given counter is limited to the number of tasks in the system, which given the Linux kernel's current structure is limited to far less than 2^30 on 32-bit systems and far less than 2^62 on 64-bit systems. (Something about a limited number of bytes in the kernel's address space.) Therefore, if srcu_read_lock() increments the upper bits, then srcu_read_unlock() need not do so. In this case, an srcu_read_lock() and an srcu_read_unlock() will flip the lower bit of the upper field of the counter. An unreasonably large additional number of srcu_read_unlock() operations would be required to return the counter to its initial value, thus preserving the guarantee. This commit takes this approach, which further allows it to shrink the size of the upper field to one bit, making the number of srcu_read_unlock() operations required to return the counter to its initial value even more unreasonable than before. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Remove fast check path from __synchronize_srcu()Lai Jiangshan
The fastpath in __synchronize_srcu() is designed to handle cases where there are a large number of concurrent calls for the same srcu_struct structure. However, the Linux kernel currently does not use SRCU in this manner, so remove the fastpath checks for simplicity. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-30rcu: Direct algorithmic SRCU implementationPaul E. McKenney
The current implementation of synchronize_srcu_expedited() can cause severe OS jitter due to its use of synchronize_sched(), which in turn invokes try_stop_cpus(), which causes each CPU to be sent an IPI. This can result in severe performance degradation for real-time workloads and especially for short-interation-length HPC workloads. Furthermore, because only one instance of try_stop_cpus() can be making forward progress at a given time, only one instance of synchronize_srcu_expedited() can make forward progress at a time, even if they are all operating on distinct srcu_struct structures. This commit, inspired by an earlier implementation by Peter Zijlstra (https://lkml.org/lkml/2012/1/31/211) and by further offline discussions, takes a strictly algorithmic bits-in-memory approach. This has the disadvantage of requiring one explicit memory-barrier instruction in each of srcu_read_lock() and srcu_read_unlock(), but on the other hand completely dispenses with OS jitter and furthermore allows SRCU to be used freely by CPUs that RCU believes to be idle or offline. The update-side implementation handles the single read-side memory barrier by rechecking the per-CPU counters after summing them and by running through the update-side state machine twice. This implementation has passed moderate rcutorture testing on both x86 and Power. Also updated to use this_cpu_ptr() instead of per_cpu_ptr(), as suggested by Peter Zijlstra. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
2012-04-30rcu: Introduce rcutorture testing for rcu_barrier()Paul E. McKenney
Although rcutorture does invoke rcu_barrier() and friends, it cannot really be called a torture test given that it invokes them only once at the end of the test. This commit therefore introduces heavy-duty rcutorture testing for rcu_barrier(), which may be carried out concurrently with normal rcutorture testing. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-29Merge tag 'pm-for-3.4-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael J. Wysocki: "Fix for an issue causing hibernation to hang on systems with highmem (that practically means i386) due to broken memory management (bug introduced in 3.2, so -stable material) and PM documentation update making the freezer documentation follow the code again after some recent updates." * tag 'pm-for-3.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / Freezer / Docs: Update documentation about freezing of tasks PM / Hibernate: fix the number of pages used for hibernate/thaw buffering
2012-04-27Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU fix from Ingo Molnar. * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rcu: Permit call_rcu() from CPU_DYING notifiers
2012-04-27Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar. * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix OOPS when build_sched_domains() percpu allocation fails sched: Fix more load-balancing fallout
2012-04-27Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar. * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix perf_event_for_each() to use sibling perf symbols: Read plt symbols from proper symtab_type binary tracing: Fix stacktrace of latency tracers (irqsoff and friends) perf tools: Add 'G' and 'H' modifiers to event parsing tracing: Fix regression with tracing_on perf tools: Drop CROSS_COMPILE from flex and bison calls perf report: Fix crash showing warning related to kernel maps tracing: Fix build breakage without CONFIG_PERF_EVENTS (again)
2012-04-27Merge branch 'for-v3.4-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux Pull build fixes for less mainstream architectures from Paul Gortmaker: "These are fixes for frv(1), blackfin(2), powerpc(1) and xtensa(4). Fortunately the touches are nearly all specific to files just used by the arch in question. The two touches to shared/common files [kernel/irq/debug.h and drivers/pci/Makefile] are trivial to assess as no risk to anyone. Half of them relate to xtensa directly. It was only when I fixed the last xtensa issue that I realized that the arch has been broken for a significant time, and isn't a specific v3.4 regression. So if you wanted, we could leave xtensa lying bleeding in the street for a couple more weeks and queue those for 3.5. But given they are no risk to anyone outside of xtensa, I figured to just leave them in. If you are OK with taking the xtensa fixes, then please pull to get: - one last implicit include uncovered by system.h that is in a file specific to just one powerpc defconfig. (I'd sync'd with BenH). - fix an oversight in the PCI makefile where shared code wasn't being compiled for ARCH=frv - fix a missing include for GPIO in blackfin framebuffer. - audit and tag endif in blackfin ezkit board file, in order to find and fix the misplaced endif masking a block of code. - fix irq/debug.h choice of temporary macro names to be more internal so they don't conflict with names used by xtensa. - fix a reference to an undeclared local var in xtensa's signal.c - fix an implicit bug.h usage in xtensa's asm/io.h uncovered by my removing bug.h from kernel.h - fix xtensa to properly indicate it is using asm-generic/hardirq.h in order to resolve the link error - undefined ack_bad_irq The xtensa still fails final link as my latest binutils does something evil when ld forward-relocates unlikely() blocks, but in theory people who have older/valid toolchains could now use the thing." * 'for-v3.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: xtensa: fix build fail on undefined ack_bad_irq blackfin: fix ifdef fustercluck in mach-bf538/boards/ezkit.c blackfin: fix compile error in bfin-lq035q1-fb.c pci: frv architecture needs generic setup-bus infrastructure irq: hide debug macros so they don't collide with others. xtensa: fix build error in xtensa/include/asm/io.h xtensa: fix build failure in xtensa/kernel/signal.c powerpc: fix system.h fallout in sysdev/scom.c [chroma_defconfig]