summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2014-09-12alarmtimer: Do not signal SIGEV_NONE timersRichard Larocque
Avoids sending a signal to alarm timers created with sigev_notify set to SIGEV_NONE by checking for that special case in the timeout callback. The regular posix timers avoid sending signals to SIGEV_NONE timers by not scheduling any callbacks for them in the first place. Although it would be possible to do something similar for alarm timers, it's simpler to handle this as a special case in the timeout. Prior to this patch, the alarm timer would ignore the sigev_notify value and try to deliver signals to the process anyway. Even worse, the sanity check for the value of sigev_signo is skipped when SIGEV_NONE was specified, so the signal number could be bogus. If sigev_signo was an unitialized value (as it often would be if SIGEV_NONE is used), then it's hard to predict which signal will be sent. Cc: stable@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sharvil Nanavati <sharvil@google.com> Signed-off-by: Richard Larocque <rlarocque@google.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-09-12alarmtimer: Return relative times in timer_gettimeRichard Larocque
Returns the time remaining for an alarm timer, rather than the time at which it is scheduled to expire. If the timer has already expired or it is not currently scheduled, the it_value's members are set to zero. This new behavior matches that of the other posix-timers and the POSIX specifications. This is a change in user-visible behavior, and may break existing applications. Hopefully, few users rely on the old incorrect behavior. Cc: stable@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sharvil Nanavati <sharvil@google.com> Signed-off-by: Richard Larocque <rlarocque@google.com> [jstultz: minor style tweak] Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-09-12jiffies: Fix timeval conversion to jiffiesAndrew Hunter
timeval_to_jiffies tried to round a timeval up to an integral number of jiffies, but the logic for doing so was incorrect: intervals corresponding to exactly N jiffies would become N+1. This manifested itself particularly repeatedly stopping/starting an itimer: setitimer(ITIMER_PROF, &val, NULL); setitimer(ITIMER_PROF, NULL, &val); would add a full tick to val, _even if it was exactly representable in terms of jiffies_ (say, the result of a previous rounding.) Doing this repeatedly would cause unbounded growth in val. So fix the math. Here's what was wrong with the conversion: we essentially computed (eliding seconds) jiffies = usec * (NSEC_PER_USEC/TICK_NSEC) by using scaling arithmetic, which took the best approximation of NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC = x/(2^USEC_JIFFIE_SC), and computed: jiffies = (usec * x) >> USEC_JIFFIE_SC and rounded this calculation up in the intermediate form (since we can't necessarily exactly represent TICK_NSEC in usec.) But the scaling arithmetic is a (very slight) *over*approximation of the true value; that is, instead of dividing by (1 usec/ 1 jiffie), we effectively divided by (1 usec/1 jiffie)-epsilon (rounding down). This would normally be fine, but we want to round timeouts up, and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this would be fine if our division was exact, but dividing this by the slightly smaller factor was equivalent to adding just _over_ 1 to the final result (instead of just _under_ 1, as desired.) In particular, with HZ=1000, we consistently computed that 10000 usec was 11 jiffies; the same was true for any exact multiple of TICK_NSEC. We could possibly still round in the intermediate form, adding something less than 2^USEC_JIFFIE_SC - 1, but easier still is to convert usec->nsec, round in nanoseconds, and then convert using time*spec*_to_jiffies. This adds one constant multiplication, and is not observably slower in microbenchmarks on recent x86 hardware. Tested: the following program: int main() { struct itimerval zero = {{0, 0}, {0, 0}}; /* Initially set to 10 ms. */ struct itimerval initial = zero; initial.it_interval.tv_usec = 10000; setitimer(ITIMER_PROF, &initial, NULL); /* Save and restore several times. */ for (size_t i = 0; i < 10; ++i) { struct itimerval prev; setitimer(ITIMER_PROF, &zero, &prev); /* on old kernels, this goes up by TICK_USEC every iteration */ printf("previous value: %ld %ld %ld %ld\n", prev.it_interval.tv_sec, prev.it_interval.tv_usec, prev.it_value.tv_sec, prev.it_value.tv_usec); setitimer(ITIMER_PROF, &prev, NULL); } return 0; } Cc: stable@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paul Turner <pjt@google.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Reviewed-by: Paul Turner <pjt@google.com> Reported-by: Aaron Jacobs <jacobsa@google.com> Signed-off-by: Andrew Hunter <ahh@google.com> [jstultz: Tweaked to apply to 3.17-rc] Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-09-12futex: Unlock hb->lock in futex_wait_requeue_pi() error pathThomas Gleixner
futex_wait_requeue_pi() calls futex_wait_setup(). If futex_wait_setup() succeeds it returns with hb->lock held and preemption disabled. Now the sanity check after this does: if (match_futex(&q.key, &key2)) { ret = -EINVAL; goto out_put_keys; } which releases the keys but does not release hb->lock. So we happily return to user space with hb->lock held and therefor preemption disabled. Unlock hb->lock before taking the exit route. Reported-by: Dave "Trinity" Jones <davej@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Darren Hart <dvhart@linux.intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409112318500.4178@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-09-10kcmp: fix standard comparison bugRasmus Villemoes
The C operator <= defines a perfectly fine total ordering on the set of values representable in a long. However, unlike its namesake in the integers, it is not translation invariant, meaning that we do not have "b <= c" iff "a+b <= a+c" for all a,b,c. This means that it is always wrong to try to boil down the relationship between two longs to a question about the sign of their difference, because the resulting relation [a LEQ b iff a-b <= 0] is neither anti-symmetric or transitive. The former is due to -LONG_MIN==LONG_MIN (take any two a,b with a-b = LONG_MIN; then a LEQ b and b LEQ a, but a != b). The latter can either be seen observing that x LEQ x+1 for all x, implying x LEQ x+1 LEQ x+2 ... LEQ x-1 LEQ x; or more directly with the simple example a=LONG_MIN, b=0, c=1, for which a-b < 0, b-c < 0, but a-c > 0. Note that it makes absolutely no difference that a transmogrying bijection has been applied before the comparison is done. In fact, had the obfuscation not been done, one could probably not observe the bug (assuming all values being compared always lie in one half of the address space, the mathematical value of a-b is always representable in a long). As it stands, one can easily obtain three file descriptors exhibiting the non-transitivity of kcmp(). Side note 1: I can't see that ensuring the MSB of the multiplier is set serves any purpose other than obfuscating the obfuscating code. Side note 2: #include <stdio.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <unistd.h> #include <assert.h> #include <sys/syscall.h> enum kcmp_type { KCMP_FILE, KCMP_VM, KCMP_FILES, KCMP_FS, KCMP_SIGHAND, KCMP_IO, KCMP_SYSVSEM, KCMP_TYPES, }; pid_t pid; int kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1, unsigned long idx2) { return syscall(SYS_kcmp, pid1, pid2, type, idx1, idx2); } int cmp_fd(int fd1, int fd2) { int c = kcmp(pid, pid, KCMP_FILE, fd1, fd2); if (c < 0) { perror("kcmp"); exit(1); } assert(0 <= c && c < 3); return c; } int cmp_fdp(const void *a, const void *b) { static const int normalize[] = {0, -1, 1}; return normalize[cmp_fd(*(int*)a, *(int*)b)]; } #define MAX 100 /* This is plenty; I've seen it trigger for MAX==3 */ int main(int argc, char *argv[]) { int r, s, count = 0; int REL[3] = {0,0,0}; int fd[MAX]; pid = getpid(); while (count < MAX) { r = open("/dev/null", O_RDONLY); if (r < 0) break; fd[count++] = r; } printf("opened %d file descriptors\n", count); for (r = 0; r < count; ++r) { for (s = r+1; s < count; ++s) { REL[cmp_fd(fd[r], fd[s])]++; } } printf("== %d\t< %d\t> %d\n", REL[0], REL[1], REL[2]); qsort(fd, count, sizeof(fd[0]), cmp_fdp); memset(REL, 0, sizeof(REL)); for (r = 0; r < count; ++r) { for (s = r+1; s < count; ++s) { REL[cmp_fd(fd[r], fd[s])]++; } } printf("== %d\t< %d\t> %d\n", REL[0], REL[1], REL[2]); return (REL[0] + REL[2] != 0); } Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-10kernel/printk/printk.c: fix faulty logic in the case of recursive printkPatrick Palka
We shouldn't set text_len in the code path that detects printk recursion because text_len corresponds to the length of the string inside textbuf. A few lines down from the line text_len = strlen(recursion_msg); is the line text_len += vscnprintf(text + text_len, ...); So if printk detects recursion, it sets text_len to 29 (the length of recursion_msg) and logs an error. Then the message supplied by the caller of printk is stored inside textbuf but offset by 29 bytes. This means that the output of the recursive call to printk will contain 29 bytes of garbage in front of it. This defect is caused by commit 458df9fd4815 ("printk: remove separate printk_sched buffers and use printk buf instead") which turned the line text_len = vscnprintf(text, ...); into text_len += vscnprintf(text + text_len, ...); To fix this, this patch avoids setting text_len when logging the printk recursion error. This patch also marks unlikely() the branch leading up to this code. Fixes: 458df9fd4815b478 ("printk: remove separate printk_sched buffers and use printk buf instead") Signed-off-by: Patrick Palka <patrick@parcs.ath.cx> Reviewed-by: Petr Mladek <pmladek@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-10net: bpf: only build bpf_jit_binary_{alloc, free}() when jit selectedDaniel Borkmann
Since BPF JIT depends on the availability of module_alloc() and module_free() helpers (HAVE_BPF_JIT and MODULES), we better build that code only in case we have BPF_JIT in our config enabled, just like with other JIT code. Fixes builds for arm/marzen_defconfig and sh/rsk7269_defconfig. ==================== kernel/built-in.o: In function `bpf_jit_binary_alloc': /home/cwang/linux/kernel/bpf/core.c:144: undefined reference to `module_alloc' kernel/built-in.o: In function `bpf_jit_binary_free': /home/cwang/linux/kernel/bpf/core.c:164: undefined reference to `module_free' make: *** [vmlinux] Error 1 ==================== Reported-by: Fengguang Wu <fengguang.wu@intel.com> Fixes: 738cbe72adc5 ("net: bpf: consolidate JIT binary allocator") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-10kernel: trace_syscalls: Replace rcu_assign_pointer() with RCU_INIT_POINTER()Andreea-Cristina Bernat
The uses of "rcu_assign_pointer()" are NULLing out the pointers. According to RCU_INIT_POINTER()'s block comment: "1. This use of RCU_INIT_POINTER() is NULLing out the pointer" it is better to use it instead of rcu_assign_pointer() because it has a smaller overhead. The following Coccinelle semantic patch was used: @@ @@ - rcu_assign_pointer + RCU_INIT_POINTER (..., NULL) Link: http://lkml.kernel.org/p/20140822142822.GA32391@ada Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-10ftrace: Replace tramp_hash with old_*_hash to save spaceSteven Rostedt (Red Hat)
Allowing function callbacks to declare their own trampolines requires that each ftrace_ops that has a trampoline must have some sort of accounting that keeps track of which ops has a trampoline attached to a record. The easy way to solve this was to add a "tramp_hash" that created a hash entry for every function that a ops uses with a trampoline. But since we can have literally tens of thousands of functions being traced, that means we need tens of thousands of descriptors to map the ops to the function in the hash. This is quite expensive and can cause enabling and disabling the function graph tracer to take some time to start and stop. It can take up to several seconds to disable or enable all functions in the function graph tracer for this reason. The better approach albeit more complex, is to keep track of how ops are being enabled and disabled, and use that along with the counting of the number of ops attached to records, to determive what ops has a trampoline attached to a record at enabling and disabling of tracing. To do this, the tramp_hash has been replaced with an old_filter_hash and old_notrace_hash, which get the copy of the ops filter_hash and notrace_hash respectively. The old hashes is kept until the ops has been modified or removed and the old hashes are used with the logic of the accounting to determine the ops that have the trampoline of a record. The reason this has less of a footprint is due to the trick that an "empty" hash in the filter_hash means "all functions" and an empty hash in the notrace hash means "no functions" in the hash. This is much more efficienct, doesn't have the delay, and takes up much less memory, as we do not need to map all the functions but just figure out which functions are mapped at the time it is enabled or disabled. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-10ftrace: Annotate the ops operation on updateSteven Rostedt (Red Hat)
Add three new flags for ftrace_ops: FTRACE_OPS_FL_ADDING FTRACE_OPS_FL_REMOVING FTRACE_OPS_FL_MODIFYING These will be set for the ftrace_ops when they are first added to the function tracing, being removed from function tracing or just having their functions changed from function tracing, respectively. This will be needed to remove the tramp_hash, which can grow quite big. The tramp_hash is used to note what functions a ftrace_ops is using a trampoline for. Denoting which ftrace_ops is being modified, will allow us to use the ftrace_ops hashes themselves, which are much smaller as they have a global flag to denote if a ftrace_ops is tracing all functions, as well as a notrace hash if the ftrace_ops is tracing all but a few. The tramp_hash just creates a hash item for every function, which can go into the 10s of thousands if all functions are using the ftrace_ops trampoline. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-10ftrace: Grab any ops for a rec for enabled_functions outputSteven Rostedt (Red Hat)
When dumping the enabled_functions, use the first op that is found with a trampoline to the record, as there should only be one, as only one ops can be registered to a function that has a trampoline. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-10ftrace: Remove freeing of old_hash from ftrace_hash_move()Steven Rostedt (Red Hat)
ftrace_hash_move() currently frees the old hash that is passed to it after replacing the pointer with the new hash. Instead of having the function do that chore, have the caller perform the free. This lets the ftrace_hash_move() be used a bit more freely, which is needed for changing the way the trampoline logic is done. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-10ftrace: Set callback to ftrace_stub when no ops are registeredSteven Rostedt (Red Hat)
The clean up that adds the helper function ftrace_ops_get_func() caused the default function to not change when DYNAMIC_FTRACE was not set and no ftrace_ops were registered. Although static tracing is not very useful (not having DYNAMIC_FTRACE set), it is still supported and we don't want to break it. Clean up the if statement even more to specifically have the default function call ftrace_stub when no ftrace_ops are registered. This fixes the small bug for static tracing as well as makes the code a bit more understandable. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-09net: bpf: consolidate JIT binary allocatorDaniel Borkmann
Introduced in commit 314beb9bcabf ("x86: bpf_jit_comp: secure bpf jit against spraying attacks") and later on replicated in aa2d2c73c21f ("s390/bpf,jit: address randomize and write protect jit code") for s390 architecture, write protection for BPF JIT images got added and a random start address of the JIT code, so that it's not on a page boundary anymore. Since both use a very similar allocator for the BPF binary header, we can consolidate this code into the BPF core as it's mostly JIT independant anyway. This will also allow for future archs that support DEBUG_SET_MODULE_RONX to just reuse instead of reimplementing it. JIT tested on x86_64 and s390x with BPF test suite. Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09ftrace: Add helper function ftrace_ops_get_func()Steven Rostedt (Red Hat)
Add the helper function to what the mcount trampoline is to call for a ftrace_ops function. This helper will be used by arch code in the future to set up dynamic trampolines. But as this does the same tests that are performed in choosing what function to call for the default mcount trampoline, might as well use it to clean up the existing code. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-09net: filter: add "load 64-bit immediate" eBPF instructionAlexei Starovoitov
add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register. All previous instructions were 8-byte. This is first 16-byte instruction. Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction: insn[0].code = BPF_LD | BPF_DW | BPF_IMM insn[0].dst_reg = destination register insn[0].imm = lower 32-bit insn[1].code = 0 insn[1].imm = upper 32-bit All unused fields must be zero. Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads 32-bit immediate value into a register. x64 JITs it as single 'movabsq %rax, imm64' arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn Note that old eBPF programs are binary compatible with new interpreter. It helps eBPF programs load 64-bit constant into a register with one instruction instead of using two registers and 4 instructions: BPF_MOV32_IMM(R1, imm32) BPF_ALU64_IMM(BPF_LSH, R1, 32) BPF_MOV32_IMM(R2, imm32) BPF_ALU64_REG(BPF_OR, R1, R2) User space generated programs will use this instruction to load constants only. To tell kernel that user space needs a pointer the _pseudo_ variant of this instruction may be added later, which will use extra bits of encoding to indicate what type of pointer user space is asking kernel to provide. For example 'off' or 'src_reg' fields can be used for such purpose. src_reg = 1 could mean that user space is asking kernel to validate and load in-kernel map pointer. src_reg = 2 could mean that user space needs readonly data section pointer src_reg = 3 could mean that user space needs a pointer to per-cpu local data All such future pseudo instructions will not be carrying the actual pointer as part of the instruction, but rather will be treated as a request to kernel to provide one. The kernel will verify the request_for_a_pointer, then will drop _pseudo_ marking and will store actual internal pointer inside the instruction, so the end result is the interpreter and JITs never see pseudo BPF_LD_IMM64 insns and only operate on generic BPF_LD_IMM64 that loads 64-bit immediate into a register. User space never operates on direct pointers and verifier can easily recognize request_for_pointer vs other instructions. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09ftrace: Add separate function for non recursive callbacksSteven Rostedt (Red Hat)
Instead of using the generic list function for callbacks that are not recursive, call a new helper function from the mcount trampoline called ftrace_ops_recur_func() that will do the recursion checking for the callback. This eliminates an indirection as well as will help in future code that will use dynamically allocated trampolines. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-09Documentation: Docbook: Fix generated DocBook/kernel-api.xmlMasanari Iida
This patch fix spelling typo found in DocBook/kernel-api.xml. It is because the file is generated from the source comments, I have to fix the comments in source codes. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-09-09perf: Fix a race condition in perf_remove_from_context()Cong Wang
We saw a kernel soft lockup in perf_remove_from_context(), it looks like the `perf` process, when exiting, could not go out of the retry loop. Meanwhile, the target process was forking a child. So either the target process should execute the smp function call to deactive the event (if it was running) or it should do a context switch which deactives the event. It seems we optimize out a context switch in perf_event_context_sched_out(), and what's more important, we still test an obsolete task pointer when retrying, so no one actually would deactive that event in this situation. Fix it directly by reloading the task pointer in perf_remove_from_context(). This should cure the above soft lockup. Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1409696840-843-1-git-send-email-xiyou.wangcong@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-09perf/core: Replace rcu_assign_pointer() with RCU_INIT_POINTER()Andreea-Cristina Bernat
The use of "rcu_assign_pointer()" is NULLing out the pointer. According to RCU_INIT_POINTER()'s block comment: "1. This use of RCU_INIT_POINTER() is NULLing out the pointer" it is better to use it instead of rcu_assign_pointer() because it has a smaller overhead. The following Coccinelle semantic patch was used: @@ @@ - rcu_assign_pointer + RCU_INIT_POINTER (..., NULL) Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Link: http://lkml.kernel.org/r/20140822132605.GA20130@ada Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-09perf/callchain: Replace rcu_assign_pointer() with RCU_INIT_POINTER()Andreea-Cristina Bernat
The use of "rcu_assign_pointer()" is NULLing out the pointer. According to RCU_INIT_POINTER()'s block comment: "1. This use of RCU_INIT_POINTER() is NULLing out the pointer" it is better to use it instead of rcu_assign_pointer() because it has a smaller overhead. The following Coccinelle semantic patch was used: @@ @@ - rcu_assign_pointer + RCU_INIT_POINTER (..., NULL) Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: paulmck@linux.vnet.ibm.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Link: http://lkml.kernel.org/r/20140822141536.GA32051@ada Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-09Merge tag 'v3.17-rc4' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-09sched: Reduce contention in update_cfs_rq_blocked_load()Jason Low
When running workloads on 2+ socket systems, based on perf profiles, the update_cfs_rq_blocked_load() function often shows up as taking up a noticeable % of run time. Much of the contention is in __update_cfs_rq_tg_load_contrib() when we update the tg load contribution stats. However, it turns out that in many cases, they don't need to be updated and "tg_contrib" is 0. This patch adds a check in __update_cfs_rq_tg_load_contrib() to skip updating tg load contribution stats when nothing needs to be updated. This reduces the cacheline contention that would be unnecessary. Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: jason.low2@hp.com Cc: Yuyang Du <yuyang.du@intel.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Chegu Vinod <chegu_vinod@hp.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409643684.19197.15.camel@j-VirtualBox Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-09sched: Migrate waking tasksLai Jiangshan
Current code can fail to migrate a waking task (silently) when TTWU_QUEUE is enabled. When a task is waking, it is pending on the wake_list of the rq, but it is not queued (task->on_rq == 0). In this case, set_cpus_allowed_ptr() and __migrate_task() will not migrate it because its invisible to them. This behavior is incorrect, because the task has been already woken, it will be running on the wrong CPU without correct placement until the next wake-up or update for cpus_allowed. To fix this problem, we need to finish the wakeup (so they appear on the runqueue) before we migrate them. Reported-by: Sasha Levin <sasha.levin@oracle.com> Reported-by: Jason J. Herne <jjherne@linux.vnet.ibm.com> Tested-by: Jason J. Herne <jjherne@linux.vnet.ibm.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/538ED7EB.5050303@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-09PM / sleep: Enhance test_suspend option with repeat capabilitySrinivas Pandruvada
Enhanced test_suspend boot paramter to repeat tests multiple times, by adding optional repeat count. The new boot param syntax: test_suspend="mem|freeze|standby[,N]" Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-09PM / sleep: Support freeze as test_suspend optionSrinivas Pandruvada
Added freeze as one of the option for test_suspend boot param. Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-08sched, time: Atomically increment stime & utimeRik van Riel
The functions task_cputime_adjusted and thread_group_cputime_adjusted() can be called locklessly, as well as concurrently on many different CPUs. This can occasionally lead to the utime and stime reported by times(), and other syscalls like it, going backward. The cause for this appears to be multiple threads racing in cputime_adjust(), both with values for utime or stime that is larger than the original, but each with a different value. Sometimes the larger value gets saved first, only to be immediately overwritten with a smaller value by another thread. Using atomic exchange prevents that problem, and ensures time progresses monotonically. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: umgwanakikbuti@gmail.com Cc: fweisbec@gmail.com Cc: akpm@linux-foundation.org Cc: srao@redhat.com Cc: lwoodman@redhat.com Cc: atheurer@redhat.com Cc: oleg@redhat.com Link: http://lkml.kernel.org/r/1408133138-22048-4-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-08time, signal: Protect resource use statistics with seqlockRik van Riel
Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability issues on large systems, due to both functions being serialized with a lock. The lock protects against reporting a wrong value, due to a thread in the task group exiting, its statistics reporting up to the signal struct, and that exited task's statistics being counted twice (or not at all). Protecting that with a lock results in times() and clock_gettime() being completely serialized on large systems. This can be fixed by using a seqlock around the events that gather and propagate statistics. As an additional benefit, the protection code can be moved into thread_group_cputime(), slightly simplifying the calling functions. In the case of posix_cpu_clock_get_task() things can be simplified a lot, because the calling function already ensures that the task sticks around, and the rest is now taken care of in thread_group_cputime(). This way the statistics reporting code can run lockless. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Daeseok Youn <daeseok.youn@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guillaume Morin <guillaume@morinfr.org> Cc: Ionut Alexa <ionut.m.alexa@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Michal Schmidt <mschmidt@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: umgwanakikbuti@gmail.com Cc: fweisbec@gmail.com Cc: srao@redhat.com Cc: lwoodman@redhat.com Cc: atheurer@redhat.com Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-08exit: Always reap resource stats in __exit_signal()Rik van Riel
Oleg pointed out that wait_task_zombie adds a task's usage statistics to the parent's signal struct, but the task's own signal struct should also propagate the statistics at exit time. This allows thread_group_cputime(reaped_zombie) to get the statistics after __unhash_process() has made the task invisible to for_each_thread, but before the thread has actually been rcu freed, making sure no non-monotonic results are returned inside that window. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Guillaume Morin <guillaume@morinfr.org> Cc: Ionut Alexa <ionut.m.alexa@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Michal Schmidt <mschmidt@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: umgwanakikbuti@gmail.com Cc: fweisbec@gmail.com Cc: srao@redhat.com Cc: lwoodman@redhat.com Cc: atheurer@redhat.com Link: http://lkml.kernel.org/r/1408133138-22048-2-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-08Merge tag 'v3.17-rc4' into sched/core, to prevent conflicts with upcoming ↵Ingo Molnar
patches, and to refresh the tree Linux 3.17-rc4
2014-09-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2014-09-07Merge branch 'for-3.17-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "This pull request includes Alban's patch to disallow '\n' in cgroup names. Two other patches from Li to fix a possible oops when cgroup destruction races against other file operations and one from Vivek to fix a unified hierarchy devel behavior" * 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: check cgroup liveliness before unbreaking kernfs cgroup: delay the clearing of cgrp->kn->priv cgroup: Display legacy cgroup files on default hierarchy cgroup: reject cgroup names with '\n'
2014-09-08percpu-refcount: add @gfp to percpu_ref_init()Tejun Heo
Percpu allocator now supports allocation mask. Add @gfp to percpu_ref_init() so that !GFP_KERNEL allocation masks can be used with percpu_refs too. This patch doesn't make any functional difference. v2: blk-mq conversion was missing. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <koverstreet@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Jens Axboe <axboe@kernel.dk>
2014-09-07rcu: Per-CPU operation cleanups to rcu_*_qs() functionsPaul E. McKenney
The rcu_bh_qs(), rcu_preempt_qs(), and rcu_sched_qs() functions use old-style per-CPU variable access and write to ->passed_quiesce even if it is already set. This commit therefore updates to use the new-style per-CPU variable access functions and avoids the spurious writes. This commit also eliminates the "cpu" argument to these functions because they are always invoked on the indicated CPU. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Remove local_irq_disable() in rcu_preempt_note_context_switch()Paul E. McKenney
The rcu_preempt_note_context_switch() function is on a scheduling fast path, so it would be good to avoid disabling irqs. The reason that irqs are disabled is to synchronize process-level and irq-handler access to the task_struct ->rcu_read_unlock_special bitmask. This commit therefore makes ->rcu_read_unlock_special instead be a union of bools with a short allowing single-access checks in RCU's __rcu_read_unlock(). This results in the process-level and irq-handler accesses being simple loads and stores, so that irqs need no longer be disabled. This commit therefore removes the irq disabling from rcu_preempt_note_context_switch(). Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Additional information on RCU-tasks stall-warning messagesPaul E. McKenney
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Make rcu_tasks_kthread()'s GP-wait loop allow preemptionPaul E. McKenney
The grace-period-wait loop in rcu_tasks_kthread() is under (unnecessary) RCU protection, and therefore has no preemption points in a PREEMPT=n kernel. This commit therefore removes the RCU protection and inserts cond_resched(). Reported-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Make TASKS_RCU handle nohz_full= CPUsPaul E. McKenney
Currently TASKS_RCU would ignore a CPU running a task in nohz_full= usermode execution. There would be neither a context switch nor a scheduling-clock interrupt to tell TASKS_RCU that the task in question had passed through a quiescent state. The grace period would therefore extend indefinitely. This commit therefore makes RCU's dyntick-idle subsystem record the task_struct structure of the task that is running in dyntick-idle mode on each CPU. The TASKS_RCU grace period can then access this information and record a quiescent state on behalf of any CPU running in dyntick-idle usermode. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Defer rcu_tasks_kthread() creation till first call_rcu_tasks()Paul E. McKenney
It is expected that many sites will have CONFIG_TASKS_RCU=y, but will never actually invoke call_rcu_tasks(). For such sites, creating rcu_tasks_kthread() at boot is wasteful. This commit therefore defers creation of this kthread until the time of the first call_rcu_tasks(). This of course means that the first call_rcu_tasks() must be invoked from process context after the scheduler is fully operational. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Improve RCU-tasks energy efficiencyPaul E. McKenney
The current RCU-tasks implementation uses strict polling to detect callback arrivals. This works quite well, but is not so good for energy efficiency. This commit therefore replaces the strict polling with a wait queue. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Add stall-warning checks for RCU-tasksPaul E. McKenney
This commit adds a ten-minute RCU-tasks stall warning. The actual time is controlled by the boot/sysfs parameter rcu_task_stall_timeout, with values less than or equal to zero disabling the stall warnings. The default value is ten minutes, which means that the tasks that have not yet responded will get their stacks dumped every ten minutes, until they pass through a voluntary context switch. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcutorture: Add torture tests for RCU-tasksPaul E. McKenney
This commit adds torture tests for RCU-tasks. It also fixes a bug that would segfault for an RCU flavor lacking a callback-barrier function. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-09-07rcu: Export RCU-tasks APIs to GPL modulesSteven Rostedt
This commit exports the RCU-tasks synchronous APIs, synchronize_rcu_tasks() and rcu_barrier_tasks(), to GPL-licensed kernel modules. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2014-09-07rcu: Make TASKS_RCU handle tasks that are almost done exitingPaul E. McKenney
Once a task has passed exit_notify() in the do_exit() code path, it is no longer on the task lists, and is therefore no longer visible to rcu_tasks_kthread(). This means that an almost-exited task might be preempted while within a trampoline, and this task won't be waited on by rcu_tasks_kthread(). This commit fixes this bug by adding an srcu_struct. An exiting task does srcu_read_lock() just before calling exit_notify(), and does the corresponding srcu_read_unlock() after doing the final preempt_disable(). This means that rcu_tasks_kthread() can do synchronize_srcu() to wait for all mostly-exited tasks to reach their final preempt_disable() region, and then use synchronize_sched() to wait for those tasks to finish exiting. Reported-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Add synchronous grace-period waiting for RCU-tasksPaul E. McKenney
It turns out to be easier to add the synchronous grace-period waiting functions to RCU-tasks than to work around their absense in rcutorture, so this commit adds them. The key point is that the existence of call_rcu_tasks() means that rcutorture needs an rcu_barrier_tasks(). Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loopsPaul E. McKenney
RCU-tasks requires the occasional voluntary context switch from CPU-bound in-kernel tasks. In some cases, this requires instrumenting cond_resched(). However, there is some reluctance to countenance unconditionally instrumenting cond_resched() (see http://lwn.net/Articles/603252/), so this commit creates a separate cond_resched_rcu_qs() that may be used in place of cond_resched() in locations prone to long-duration in-kernel looping. This commit currently instruments only RCU-tasks. Future possibilities include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce IPI usage. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcu: Add call_rcu_tasks()Paul E. McKenney
This commit adds a new RCU-tasks flavor of RCU, which provides call_rcu_tasks(). This RCU flavor's quiescent states are voluntary context switch (not preemption!) and userspace execution (not the idle loop -- use some sort of schedule_on_each_cpu() if you need to handle the idle tasks. Note that unlike other RCU flavors, these quiescent states occur in tasks, not necessarily CPUs. Includes fixes from Steven Rostedt. This RCU flavor is assumed to have very infrequent latency-tolerant updaters. This assumption permits significant simplifications, including a single global callback list protected by a single global lock, along with a single task-private linked list containing all tasks that have not yet passed through a quiescent state. If experience shows this assumption to be incorrect, the required additional complexity will be added. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcutorture: Add callback-flood testPaul E. McKenney
Although RCU is designed to handle arbitrary floods of callbacks, this capability is not routinely tested. This commit therefore adds a cbflood capability in which kthreads repeatedly registers large numbers of callbacks. One such kthread is created for each four CPUs (rounding up), and the test may be controlled by several cbflood_* kernel boot parameters, which control the number of bursts per flood, the number of callbacks per burst, the time between bursts, and the time between floods. The default values are large enough to exercise RCU's emergency responses to callback flooding. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Miller <davem@davemloft.net> Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>
2014-09-07rcu: Use pr_alert/pr_cont for printing logsJoe Perches
User pr_alert/pr_cont for printing the logs from rcutorture module directly instead of writing it to a buffer and then printing it. This allows us from not having to allocate such buffers. Also remove a resulting empty function. I tested this using the parse-torture.sh script as follows: $ dmesg | grep torture > log.txt $ bash parse-torture.sh log.txt test $ There were no warnings which means that parsing went fine. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-09-07rcutorture: Fix a sparse warning by marking boost_mutex staticPranith Kumar
This commit fixes the following sparse warning by marking boost_mutex static: kernel/rcu/rcutorture.c:185:1: warning: symbol 'boost_mutex' was not declared. Should it be static? Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>