summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2013-02-19Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "Main changes: - scheduler side full-dynticks (user-space execution is undisturbed and receives no timer IRQs) preparation changes that convert the cputime accounting code to be full-dynticks ready, from Frederic Weisbecker. - Initial sched.h split-up changes, by Clark Williams - select_idle_sibling() performance improvement by Mike Galbraith: " 1 tbench pair (worst case) in a 10 core + SMT package: pre 15.22 MB/sec 1 procs post 252.01 MB/sec 1 procs " - sched_rr_get_interval() ABI fix/change. We think this detail is not used by apps (so it's not an ABI in practice), but lets keep it under observation. - misc RT scheduling cleanups, optimizations" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) sched/rt: Add <linux/sched/rt.h> header to <linux/init_task.h> cputime: Remove irqsave from seqlock readers sched, powerpc: Fix sched.h split-up build failure cputime: Restore CPU_ACCOUNTING config defaults for PPC64 sched/rt: Move rt specific bits into new header file sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice sched: Move sched.h sysctl bits into separate header sched: Fix signedness bug in yield_to() sched: Fix select_idle_sibling() bouncing cow syndrome sched/rt: Further simplify pick_rt_task() sched/rt: Do not account zero delta_exec in update_curr_rt() cputime: Safely read cputime of full dynticks CPUs kvm: Prepare to add generic guest entry/exit callbacks cputime: Use accessors to read task cputime stats cputime: Allow dynamic switch between tick/virtual based cputime accounting cputime: Generic on-demand virtual cputime accounting cputime: Move default nsecs_to_cputime() to jiffies based cputime file cputime: Librarize per nsecs resolution cputime definitions cputime: Avoid multiplication overflow on utime scaling context_tracking: Export context state for generic vtime ... Fix up conflict in kernel/context_tracking.c due to comment additions.
2013-02-18mm: fix pageblock bitmap allocationLinus Torvalds
Commit c060f943d092 ("mm: use aligned zone start for pfn_to_bitidx calculation") fixed out calculation of the index into the pageblock bitmap when a !SPARSEMEM zome was not aligned to pageblock_nr_pages. However, the _allocation_ of that bitmap had never taken this alignment requirement into accout, so depending on the exact size and alignment of the zone, the use of that index could then access past the allocation, resulting in some very subtle memory corruption. This was reported (and bisected) by Ingo Molnar: one of his random config builds would hang with certain very specific kernel command line options. In the meantime, commit c060f943d092 has been marked for stable, so this fix needs to be back-ported to the stable kernels that backported the commit to use the right alignment. Bisected-and-tested-by: Ingo Molnar <mingo@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-14s390/mm: implement software dirty bitsMartin Schwidefsky
The s390 architecture is unique in respect to dirty page detection, it uses the change bit in the per-page storage key to track page modifications. All other architectures track dirty bits by means of page table entries. This property of s390 has caused numerous problems in the past, e.g. see git commit ef5d437f71afdf4a "mm: fix XFS oops due to dirty pages without buffers on s390". To avoid future issues in regard to per-page dirty bits convert s390 to a fault based software dirty bit detection mechanism. All user page table entries which are marked as clean will be hardware read-only, even if the pte is supposed to be writable. A write by the user process will trigger a protection fault which will cause the user pte to be marked as dirty and the hardware read-only bit is removed. With this change the dirty bit in the storage key is irrelevant for Linux as a host, but the storage key is still required for KVM guests. The effect is that page_test_and_clear_dirty and the related code can be removed. The referenced bit in the storage key is still used by the page_test_and_clear_young primitive to provide page age information. For page cache pages of mappings with mapping_cap_account_dirty there will not be any change in behavior as the dirty bit tracking already uses read-only ptes to control the amount of dirty pages. Only for swap cache pages and pages of mappings without mapping_cap_account_dirty there can be additional protection faults. To avoid an excessive number of additional faults the mk_pte primitive checks for PageDirty if the pgprot value allows for writes and pre-dirties the pte. That avoids all additional faults for tmpfs and shmem pages until these pages are added to the swap cache. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-02-12mm: cma: fix accounting of CMA pages placed in high memoryMarek Szyprowski
The total number of low memory pages is determined as totalram_pages - totalhigh_pages, so without this patch all CMA pageblocks placed in highmem were accounted to low memory. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-12memcg: fix kmemcg registration for late cachesGlauber Costa
The designed workflow for the caches in kmemcg is: register it with memcg_register_cache() if kmemcg is already available or later on when a new kmemcg appears at memcg_update_cache_sizes() which will handle all caches in the system. The caches created at boot time will be handled by the later, and the memcg-caches as well as any system caches that are registered later on by the former. There is a bug, however, in memcg_register_cache: we correctly set up the array size, but do not mark the cache as a root cache. This means that allocations for any cache appearing late in the game will see memcg->memcg_params->is_root_cache == false, and in particular, trigger VM_BUG_ON(!cachep->memcg_params->is_root_cache) in __memcg_kmem_cache_get. The obvious fix is to include the missing assignment. Signed-off-by: Glauber Costa <glommer@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-12mm: don't overwrite mm->def_flags in do_mlockall()Gerald Schaefer
With commit 8e72033f2a48 ("thp: make MADV_HUGEPAGE check for mm->def_flags") the VM_NOHUGEPAGE flag may be set on s390 in mm->def_flags for certain processes, to prevent future thp mappings. This would be overwritten by do_mlockall(), which sets it back to 0 with an optional VM_LOCKED flag set. To fix this, instead of overwriting mm->def_flags in do_mlockall(), only the VM_LOCKED flag should be set or cleared. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Reported-by: Vivek Goyal <vgoyal@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-07sched/rt: Move rt specific bits into new header fileClark Williams
Move rt scheduler definitions out of include/linux/sched.h into new file include/linux/sched/rt.h Signed-off-by: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-07sched: Move sched.h sysctl bits into separate headerClark Williams
Move the sysctl-related bits from include/linux/sched.h into a new file: include/linux/sched/sysctl.h. Then update source files requiring access to those bits by including the new header file. Signed-off-by: Clark Williams <williams@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20130207094659.06dced96@riff.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-02-06mm/sl[au]b: correct allocation type check in kmalloc_slab()Joonsoo Kim
commit "slab: Common Kmalloc cache determination" made mistake in kmalloc_slab(). SLAB_CACHE_DMA is for kmem_cache creation, not for allocation. For allocation, we should use GFP_XXX to identify type of allocation. So, change SLAB_CACHE_DMA to GFP_DMA. Acked-by: Christoph Lameter <cl@linux.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-06slab: Fixup CONFIG_PAGE_ALLOC/DEBUG_SLAB_LEAK sectionsChristoph Lameter
Variables were not properly converted and the conversion caused a naming conflict. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-05mm: fix wrong comments about anon_vma lockYuanhan Liu
We use rwsem since commit 5a505085f043 ("mm/rmap: Convert the struct anon_vma::mutex to an rwsem"). And most of comments are converted to the new rwsem lock; while just 2 more missed from: $ git grep 'anon_vma->mutex' Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-05mm/hugetlb: set PTE as huge in hugetlb_change_protection and ↵Tony Lu
remove_migration_pte When setting a huge PTE, besides calling pte_mkhuge(), we also need to call arch_make_huge_pte(), which we indeed do in make_huge_pte(), but we forget to do in hugetlb_change_protection() and remove_migration_pte(). Signed-off-by: Zhigang Lu <zlu@tilera.com> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Hillf Danton <dhillf@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-05thp: avoid dumping huge zero pageKirill A. Shutemov
No reason to preserve the huge zero page in core dumps. Reported-by: Michel Lespinasse <walken@google.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-01slab: Common definition for kmem_cache_nodeChristoph Lameter
Put the definitions for the kmem_cache_node structures together so that we have one structure. That will allow us to create more common fields in the future which could yield more opportunities to share code. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Rename list3/l3 to nodeChristoph Lameter
The list3 or l3 pointers are pointing to per node structures. Reflect that in the names of variables used. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Common Kmalloc cache determinationChristoph Lameter
Extract the optimized lookup functions from slub and put them into slab_common.c. Then make slab use these functions as well. Joonsoo notes that this fixes some issues with constant folding which also reduces the code size for slub. https://lkml.org/lkml/2012/10/20/82 Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Common function to create the kmalloc arrayChristoph Lameter
The kmalloc array is created in similar ways in both SLAB and SLUB. Create a common function and have both allocators call that function. V1->V2: Whitespace cleanup Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Common definition for the array of kmalloc cachesChristoph Lameter
Have a common definition fo the kmalloc cache arrays in SLAB and SLUB Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Common constants for kmalloc boundariesChristoph Lameter
Standardize the constants that describe the smallest and largest object kept in the kmalloc arrays for SLAB and SLUB. Differentiate between the maximum size for which a slab cache is used (KMALLOC_MAX_CACHE_SIZE) and the maximum allocatable size (KMALLOC_MAX_SIZE, KMALLOC_MAX_ORDER). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Rename nodelists to nodeChristoph Lameter
Have a common naming between both slab caches for future changes. Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Common name for the per node structuresChristoph Lameter
Rename the structure used for the per node structures in slab to have a name that expresses that fact. Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Use common kmalloc_index/kmalloc_size functionsChristoph Lameter
Make slab use the common functions. We can get rid of a lot of old ugly stuff as a results. Among them the sizes array and the weird include/linux/kmalloc_sizes file and some pretty bad #include statements in slab_def.h. The one thing that is different in slab is that the 32 byte cache will also be created for arches that have page sizes larger than 4K. There are numerous smaller allocations that SLOB and SLUB can handle better because of their support for smaller allocation sizes so lets keep the 32 byte slab also for arches with > 4K pages. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-02-01slab: Use proper formatting specs for unsigned size_tChristoph Lameter
Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2013-01-29mm: Add alloc_bootmem_low_pages_nopanic()Yinghai Lu
We don't need to panic in some case, like for swiotlb preallocating. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-35-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29memblock: Add memblock_mem_size()Yinghai Lu
Use it to get mem size under the limit_pfn. to replace local version in x86 reserved_initrd. -v2: remove not needed cast that is pointed out by HPA. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Link: http://lkml.kernel.org/r/1359058816-7615-29-git-send-email-yinghai@kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-29Merge remote-tracking branch 'origin/x86/boot' into x86/mm2H. Peter Anvin
Coming patches to x86/mm2 require the changes and advanced baseline in x86/boot. Resolved Conflicts: arch/x86/kernel/setup.c mm/nobootmem.c Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-01-26userns: Allow the userns root to mount tmpfs.Eric W. Biederman
There is no backing store to tmpfs and file creation rules are the same as for any other filesystem so it is semantically safe to allow unprivileged users to mount it. ramfs is safe for the same reasons so allow either flavor of tmpfs to be mounted by a user namespace root user. The memory control group successfully limits how much memory tmpfs can consume on any system that cares about a user namespace root using tmpfs to exhaust memory the memory control group can be deployed. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-01-24mm: minor cleanup of iov_iter_single_seg_count()Maxim Patlasov
The function does not modify iov_iter which 'i' points to. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-01-24Negative (setpoint-dirty) in bdi_position_ratio()paul.szabo@sydney.edu.au
In bdi_position_ratio(), get difference (setpoint-dirty) right even when negative. Both setpoint and dirty are unsigned long, the difference was zero-padded thus wrongly sign-extended to s64. This issue affects all 32-bit architectures, does not affect 64-bit architectures where long and s64 are equivalent. In this function, dirty is between freerun and limit, the pseudo-float x is between [-1,1], expected to be negative about half the time. With zero-padding, instead of a small negative x we obtained a large positive one so bdi_position_ratio() returned garbage. Casting the difference to s64 also prevents overflow with left-shift; though normally these numbers are small and I never observed a 32-bit overflow there. (This patch does not solve the PAE OOM issue.) Paul Szabo psz@maths.usyd.edu.au http://www.maths.usyd.edu.au/u/psz/ School of Mathematics and Statistics University of Sydney Australia Reviewed-by: Jan Kara <jack@suse.cz> Reported-by: Paul Szabo <psz@maths.usyd.edu.au> Reference: http://bugs.debian.org/695182 Signed-off-by: Paul Szabo <psz@maths.usyd.edu.au> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2013-01-21taint: add explicit flag to show whether lock dep is still OK.Rusty Russell
Fix up all callers as they were before, with make one change: an unsigned module taints the kernel, but doesn't turn off lockdep. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-01-17Merge 3.9-rc4 into driver-core-nextGreg Kroah-Hartman
This is to fix up a build problem with a wireless driver due to the dynamic-debug patches in this branch. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-17mm: remove depends on CONFIG_EXPERIMENTALKees Cook
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. CC: Andrew Morton <akpm@linux-foundation.org> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> CC: Jan Beulich <JBeulich@novell.com> CC: Mel Gorman <mel@csn.ul.ie> CC: Seth Jennings <sjenning@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-14writeback: add more tracepointsTejun Heo
Add tracepoints for page dirtying, writeback_single_inode start, inode dirtying and writeback. For the latter two inode events, a pair of events are defined to denote start and end of the operations (the starting one has _start suffix and the one w/o suffix happens after the operation is complete). These inode ops are FS specific and can be non-trivial and having enclosing tracepoints is useful for external tracers. This is part of tracepoint additions to improve visiblity into dirtying / writeback operations for io tracer and userland. v2: writeback_dirty_inode[_start] TPs may be called for files on pseudo FSes w/ unregistered bdi. Check whether bdi->dev is %NULL before dereferencing. v3: buffer dirtying moved to a block TP. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-01-11mm: compaction: partially revert capture of suitable high-order pageMel Gorman
Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when waiting for POLLIN on a local TCP socket. It was easier to trigger if there was disk IO and dirty pages at the same time and he bisected it to commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page immediately when it is made available"). The intention of that patch was to improve high-order allocations under memory pressure after changes made to reclaim in 3.6 drastically hurt THP allocations but the approach was flawed. For Eric, the problem was that page->pfmemalloc was not being cleared for captured pages leading to a poor interaction with swap-over-NFS support causing the packets to be dropped. However, I identified a few more problems with the patch including the fact that it can increase contention on zone->lock in some cases which could result in async direct compaction being aborted early. In retrospect the capture patch took the wrong approach. What it should have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it was allocating for THP and avoided races that way. While the patch was showing to improve allocation success rates at the time, the benefit is marginal given the relative complexity and it should be revisited from scratch in the context of the other reclaim-related changes that have taken place since the patch was first written and tested. This patch partially reverts commit 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page immediately when it is made available"). Reported-and-tested-by: Eric Wong <normalperson@yhbt.net> Tested-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11mm: thp: acquire the anon_vma rwsem for write during splitMel Gorman
Zhouping Liu reported the following against 3.8-rc1 when running a mmap testcase from LTP. mapcount 0 page_mapcount 3 ------------[ cut here ]------------ kernel BUG at mm/huge_memory.c:1798! invalid opcode: 0000 [#1] SMP Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables bnep bluetooth rfkill iptable_mangle ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat dm_mirror dm_region_hash dm_log dm_mod cdc_ether iTCO_wdt i7core_edac coretemp usbnet iTCO_vendor_support mii crc32c_intel edac_core lpc_ich shpchp ioatdma mfd_core i2c_i801 pcspkr serio_raw bnx2 microcode dca vhost_net tun macvtap macvlan kvm_intel kvm uinput mgag200 sr_mod cdrom i2c_algo_bit sd_mod drm_kms_helper crc_t10dif ata_generic pata_acpi ttm ata_piix drm libata i2c_core megaraid_sas CPU 1 Pid: 23217, comm: mmap10 Not tainted 3.8.0-rc1mainline+ #17 IBM IBM System x3400 M3 Server -[7379I08]-/69Y4356 RIP: __split_huge_page+0x677/0x6d0 RSP: 0000:ffff88017a03fc08 EFLAGS: 00010293 RAX: 0000000000000003 RBX: ffff88027a6c22e0 RCX: 00000000000034d2 RDX: 000000000000748b RSI: 0000000000000046 RDI: 0000000000000246 RBP: ffff88017a03fcb8 R08: ffffffff819d2440 R09: 000000000000054a R10: 0000000000aaaaaa R11: 00000000ffffffff R12: 0000000000000000 R13: 00007f4f11a00000 R14: ffff880179e96e00 R15: ffffea0005c08000 FS: 00007f4f11f4a740(0000) GS:ffff88017bc20000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000037e9ebb404 CR3: 000000017a436000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process mmap10 (pid: 23217, threadinfo ffff88017a03e000, task ffff880172dd32e0) Stack: ffff88017a540ec8 ffff88017a03fc20 ffffffff816017b5 ffff88017a03fc88 ffffffff812fa014 0000000000000000 ffff880279ebd5c0 00000000f4f11a4c 00000007f4f11f49 00000007f4f11a00 ffff88017a540ef0 ffff88017a540ee8 Call Trace: split_huge_page+0x68/0xb0 __split_huge_page_pmd+0x134/0x330 split_huge_page_pmd_mm+0x51/0x60 split_huge_page_address+0x3b/0x50 __vma_adjust_trans_huge+0x9c/0xf0 vma_adjust+0x684/0x750 __split_vma.isra.28+0x1fa/0x220 do_munmap+0xf9/0x420 vm_munmap+0x4e/0x70 sys_munmap+0x2b/0x40 system_call_fastpath+0x16/0x1b Alexander Beregalov and Alex Xu reported similar bugs and Hillf Danton identified that commit 5a505085f043 ("mm/rmap: Convert the struct anon_vma::mutex to an rwsem") and commit 4fc3f1d66b1e ("mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable") were likely the problem. Reverting these commits was reported to solve the problem for Alexander. Despite the reason for these commits, NUMA balancing is not the direct source of the problem. split_huge_page() expects the anon_vma lock to be exclusive to serialise the whole split operation. Ordinarily it is expected that the anon_vma lock would only be required when updating the avcs but THP also uses the anon_vma rwsem for collapse and split operations where the page lock or compound lock cannot be used (as the page is changing from base to THP or vice versa) and the page table locks are insufficient. This patch takes the anon_vma lock for write to serialise against parallel split_huge_page as THP expected before the conversion to rwsem. Reported-and-tested-by: Zhouping Liu <zliu@redhat.com> Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Reported-by: Alex Xu <alex_y_xu@yahoo.ca> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11mm: mmap: annotate vm_lock_anon_vma locking properly for lockdepJiri Kosina
Commit 5a505085f043 ("mm/rmap: Convert the struct anon_vma::mutex to an rwsem") turned anon_vma mutex to rwsem. However, the properly annotated nested locking in mm_take_all_locks() has been converted from mutex_lock_nest_lock(&anon_vma->root->mutex, &mm->mmap_sem); to down_write(&anon_vma->root->rwsem); which is incomplete, and causes the false positive report from lockdep below. Annotate the fact that mmap_sem is used as an outter lock to serialize taking of all the anon_vma rwsems at once no matter the order, using the down_write_nest_lock() primitive. This patch fixes this lockdep report: ============================================= [ INFO: possible recursive locking detected ] 3.8.0-rc2-00036-g5f73896 #171 Not tainted --------------------------------------------- qemu-kvm/2315 is trying to acquire lock: (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0 but task is already holding lock: (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&anon_vma->rwsem); lock(&anon_vma->rwsem); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by qemu-kvm/2315: #0: (&mm->mmap_sem){++++++}, at: do_mmu_notifier_register+0xfc/0x170 #1: (mm_all_locks_mutex){+.+...}, at: mm_take_all_locks+0x36/0x1b0 #2: (&mapping->i_mmap_mutex){+.+...}, at: mm_take_all_locks+0xc9/0x1b0 #3: (&anon_vma->rwsem){+.+...}, at: mm_take_all_locks+0x149/0x1b0 stack backtrace: Pid: 2315, comm: qemu-kvm Not tainted 3.8.0-rc2-00036-g5f73896 #171 Call Trace: print_deadlock_bug+0xf2/0x100 validate_chain+0x4f6/0x720 __lock_acquire+0x359/0x580 lock_acquire+0x121/0x190 down_write+0x3f/0x70 mm_take_all_locks+0x149/0x1b0 do_mmu_notifier_register+0x68/0x170 mmu_notifier_register+0xe/0x10 kvm_create_vm+0x22b/0x330 [kvm] kvm_dev_ioctl+0xf8/0x1a0 [kvm] do_vfs_ioctl+0x9d/0x350 sys_ioctl+0x91/0xb0 system_call_fastpath+0x16/0x1b Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mel Gorman <mel@csn.ul.ie> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11mm: bootmem: fix free_all_bootmem_core() with odd bitmap alignmentMax Filippov
Currently free_all_bootmem_core ignores that node_min_pfn may be not multiple of BITS_PER_LONG. Eg commit 6dccdcbe2c3e ("mm: bootmem: fix checking the bitmap when finally freeing bootmem") shifts vec by lower bits of start instead of lower bits of idx. Also if (IS_ALIGNED(start, BITS_PER_LONG) && vec == ~0UL) assumes that vec bit 0 corresponds to start pfn, which is only true when node_min_pfn is a multiple of BITS_PER_LONG. Also loop in the else clause can double-free pages (e.g. with node_min_pfn == start == 1, map[0] == ~0 on 32-bit machine page 32 will be double-freed). This bug causes the following message during xtensa kernel boot: bootmem::free_all_bootmem_core nid=0 start=1 end=8000 BUG: Bad page state in process swapper pfn:00001 page:d04bd020 count:0 mapcount:-127 mapping: (null) index:0x2 page flags: 0x0() Call Trace: bad_page+0x8c/0x9c free_pages_prepare+0x5e/0x88 free_hot_cold_page+0xc/0xa0 __free_pages+0x24/0x38 __free_pages_bootmem+0x54/0x56 free_all_bootmem_core$part$11+0xeb/0x138 free_all_bootmem+0x46/0x58 mem_init+0x25/0xa4 start_kernel+0x11e/0x25c should_never_return+0x0/0x3be7 The fix is the following: - always align vec so that its bit 0 corresponds to start - provide BITS_PER_LONG bits in vec, if those bits are available in the map - don't free pages past next start position in the else clause. Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Prasad Koya <prasad.koya@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11mm: use aligned zone start for pfn_to_bitidx calculationLaura Abbott
The current calculation in pfn_to_bitidx assumes that (pfn - zone->zone_start_pfn) >> pageblock_order will return the same bit for all pfn in a pageblock. If zone_start_pfn is not aligned to pageblock_nr_pages, this may not always be correct. Consider the following with pageblock order = 10, zone start 2MB: pfn | pfn - zone start | (pfn - zone start) >> page block order ---------------------------------------------------------------- 0x26000 | 0x25e00 | 0x97 0x26100 | 0x25f00 | 0x97 0x26200 | 0x26000 | 0x98 0x26300 | 0x26100 | 0x98 This means that calling {get,set}_pageblock_migratetype on a single page will not set the migratetype for the full block. Fix this by rounding down zone_start_pfn when doing the bitidx calculation. For our use case, the effects of this bug were mostly tied to the fact that CMA allocations would either take a long time or fail to happen. Depending on the driver using CMA, this could result in anything from visual glitches to application failures. Signed-off-by: Laura Abbott <lauraa@codeaurora.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11mm: compaction: fix echo 1 > compact_memory return error issueJason Liu
when run the folloing command under shell, it will return error sh/$ echo 1 > /proc/sys/vm/compact_memory sh/$ sh: write error: Bad address After strace, I found the following log: ... write(1, "1\n", 2) = 3 write(1, "", 4294967295) = -1 EFAULT (Bad address) write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address ) = 31 This tells system return 3(COMPACT_COMPLETE) after write data to compact_memory. The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE) from sysctl_compaction_handler after compaction_nodes finished. Signed-off-by: Jason Liu <r64343@freescale.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11mm: memblock: fix wrong memmove size in memblock_merge_regions()Lin Feng
The memmove span covers from (next+1) to the end of the array, and the index of next is (i+1), so the index of (next+1) is (i+2). So the size of remaining array elements is (type->cnt - (i + 2)). Since the remaining elements of the memblock array are move forward by one element and there is only one additional element caused by this bug. So there won't be any write overflow here but read overflow. It may read one more element out of the array address if the array happens to be full. Commonly it doesn't matter at all but if the array happens to be located at the end a memblock, it may cause a invalid read operation for the physical address doesn't exist. There are 2 *happens to be* here, so I think the probability is quite low, I don't know if any guy is haunted by this bug before. Mostly I think it's user-invisible. Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-11mm: migrate: check page_count of THP before migratingMel Gorman
Hugh Dickins pointed out that migrate_misplaced_transhuge_page() does not check page_count before migrating like base page migration and khugepage. He could not see why this was safe and he is right. The potential impact of the bug is avoided due to the limitations of NUMA balancing. The page_mapcount() check ensures that only a single address space is using this page and as THPs are typically private it should not be possible for another address space to fault it in parallel. If the address space has one associated task then it's difficult to have both a GUP pin and be referencing the page at the same time. If there are multiple tasks then a buggy scenario requires that another thread be accessing the page while the direct IO is in flight. This is dodgy behaviour as there is a possibility of corruption with or without THP migration. It would be While we happen to be safe for the most part it is shoddy to depend on such "safety" so this patch checks the page count similar to anonymous pages. Note that this does not mean that the page_mapcount() check can go away. If we were to remove the page_mapcount() check the the THP would have to be unmapped from all referencing PTEs, replaced with migration PTEs and restored properly afterwards. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-09mm: reinstante dropped pmd_trans_splitting() checkLinus Torvalds
The check for a pmd being in the process of being split was dropped by mistake by commit d10e63f29488 ("mm: numa: Create basic numa page hinting infrastructure"). Put it back. Reported-by: Dave Jones <davej@redhat.com> Debugged-by: Hillf Danton <dhillf@gmail.com> Acked-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Kirill Shutemov <kirill@shutemov.name> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-04mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPTMichal Hocko
Since commit e303297e6c3a ("mm: extended batches for generic mmu_gather") we are batching pages to be freed until either tlb_next_batch cannot allocate a new batch or we are done. This works just fine most of the time but we can get in troubles with non-preemptible kernel (CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY) on large machines where too aggressive batching might lead to soft lockups during process exit path (exit_mmap) because there are no scheduling points down the free_pages_and_swap_cache path and so the freeing can take long enough to trigger the soft lockup. The lockup is harmless except when the system is setup to panic on softlockup which is not that unusual. The simplest way to work around this issue is to limit the maximum number of batches in a single mmu_gather. 10k of collected pages should be safe to prevent from soft lockups (we would have 2ms for one) even if they are all freed without an explicit scheduling point. This patch doesn't add any new explicit scheduling points because it relies on zap_pmd_range during page tables zapping which calls cond_resched per PMD. The following lockup has been reported for 3.0 kernel with a huge process (in order of hundreds gigs but I do know any more details). BUG: soft lockup - CPU#56 stuck for 22s! [kernel:31053] Modules linked in: af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc mptctl mptbase autofs4 binfmt_misc dm_round_robin dm_multipath bonding cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode fuse loop osst sg sd_mod crc_t10dif st qla2xxx scsi_transport_fc scsi_tgt netxen_nic i7core_edac iTCO_wdt joydev e1000e serio_raw pcspkr edac_core iTCO_vendor_support acpi_power_meter rtc_cmos hpwdt hpilo button container usbhid hid dm_mirror dm_region_hash dm_log linear uhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh dm_snapshot pcnet32 mii edd dm_mod raid1 ext3 mbcache jbd fan thermal processor thermal_sys hwmon cciss scsi_mod Supported: Yes CPU 56 Pid: 31053, comm: kernel Not tainted 3.0.31-0.9-default #1 HP ProLiant DL580 G7 RIP: 0010: _raw_spin_unlock_irqrestore+0x8/0x10 RSP: 0018:ffff883ec1037af0 EFLAGS: 00000206 RAX: 0000000000000e00 RBX: ffffea01a0817e28 RCX: ffff88803ffd9e80 RDX: 0000000000000200 RSI: 0000000000000206 RDI: 0000000000000206 RBP: 0000000000000002 R08: 0000000000000001 R09: ffff887ec724a400 R10: 0000000000000000 R11: dead000000200200 R12: ffffffff8144c26e R13: 0000000000000030 R14: 0000000000000297 R15: 000000000000000e FS: 00007ed834282700(0000) GS:ffff88c03f200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000068b240 CR3: 0000003ec13c5000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kernel (pid: 31053, threadinfo ffff883ec1036000, task ffff883ebd5d4100) Call Trace: release_pages+0xc5/0x260 free_pages_and_swap_cache+0x9d/0xc0 tlb_flush_mmu+0x5c/0x80 tlb_finish_mmu+0xe/0x50 exit_mmap+0xbd/0x120 mmput+0x49/0x120 exit_mm+0x122/0x160 do_exit+0x17a/0x430 do_group_exit+0x3d/0xb0 get_signal_to_deliver+0x247/0x480 do_signal+0x71/0x1b0 do_notify_resume+0x98/0xb0 int_signal+0x12/0x17 DWARF2 unwinder stuck at int_signal+0x12/0x17 Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> [3.0+] Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-04mm: fix zone_watermark_ok_safe() accounting of isolated pagesBartlomiej Zolnierkiewicz
Commit 702d1a6e0766 ("memory-hotplug: fix kswapd looping forever problem") added an isolated pageblocks counter (nr_pageblock_isolate in struct zone) and used it to adjust free pages counter in zone_watermark_ok_safe() to prevent kswapd looping forever problem. Then later, commit 2139cbe627b8 ("cma: fix counting of isolated pages") fixed accounting of isolated pages in global free pages counter. It made the previous zone_watermark_ok_safe() fix unnecessary and potentially harmful (cause now isolated pages may be accounted twice making free pages counter incorrect). This patch removes the special isolated pageblocks counter altogether which fixes zone_watermark_ok_safe() free pages check. Reported-by: Tomasz Stanislawski <t.stanislaws@samsung.com> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-03MM: vmscan: remove __devinit attribute.Greg Kroah-Hartman
CONFIG_HOTPLUG is going away as an option. As a result, the __dev* markings need to be removed. This change removes the use of __devinit from the file. Based on patches originally written by Bill Pemberton, but redone by me in order to handle some of the coding style issues better, by hand. Cc: Bill Pemberton <wfp5p@virginia.edu> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-02mm: mempolicy: Convert shared_policy mutex to spinlockMel Gorman
Sasha was fuzzing with trinity and reported the following problem: BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main 2 locks held by trinity-main/6361: #0: (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>] __do_page_fault+0x1e4/0x4f0 #1: (&(&mm->page_table_lock)->rlock){+.+...}, at: [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0 Pid: 6361, comm: trinity-main Tainted: G W 3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74 Call Trace: __might_sleep+0x1c3/0x1e0 mutex_lock_nested+0x29/0x50 mpol_shared_policy_lookup+0x2e/0x90 shmem_get_policy+0x2e/0x30 get_vma_policy+0x5a/0xa0 mpol_misplaced+0x41/0x1d0 handle_pte_fault+0x465/0x6a0 This was triggered by a different version of automatic NUMA balancing but in theory the current version is vunerable to the same problem. do_numa_page -> numa_migrate_prep -> mpol_misplaced -> get_vma_policy -> shmem_get_policy It's very unlikely this will happen as shared pages are not marked pte_numa -- see the page_mapcount() check in change_pte_range() -- but it is possible. To address this, this patch restores sp->lock as originally implemented by Kosaki Motohiro. In the path where get_vma_policy() is called, it should not be calling sp_alloc() so it is not necessary to treat the PTL specially. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-02mempolicy: remove arg from mpol_parse_str, mpol_to_strHugh Dickins
Remove the unused argument (formerly no_context) from mpol_parse_str() and from mpol_to_str(). Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-02tmpfs mempolicy: fix /proc/mounts corrupting memoryHugh Dickins
Recently I suggested using "mount -o remount,mpol=local /tmp" in NUMA mempolicy testing. Very nasty. Reading /proc/mounts, /proc/pid/mounts or /proc/pid/mountinfo may then corrupt one bit of kernel memory, often in a page table (causing "Bad swap" or "Bad page map" warning or "Bad pagetable" oops), sometimes in a vm_area_struct or rbnode or somewhere worse. "mpol=prefer" and "mpol=prefer:Node" are equally toxic. Recent NUMA enhancements are not to blame: this dates back to 2.6.35, when commit e17f74af351c "mempolicy: don't call mpol_set_nodemask() when no_context" skipped mpol_parse_str()'s call to mpol_set_nodemask(), which used to initialize v.preferred_node, or set MPOL_F_LOCAL in flags. With slab poisoning, you can then rely on mpol_to_str() to set the bit for node 0x6b6b, probably in the next page above the caller's stack. mpol_parse_str() is only called from shmem_parse_options(): no_context is always true, so call it unused for now, and remove !no_context code. Set v.nodes or v.preferred_node or MPOL_F_LOCAL as mpol_to_str() might expect. Then mpol_to_str() can ignore its no_context argument also, the mpol being appropriately initialized whether contextualized or not. Rename its no_context unused too, and let subsequent patch remove them (that's not needed for stable backporting, which would involve rejects). I don't understand why MPOL_LOCAL is described as a pseudo-policy: it's a reasonable policy which suffers from a confusing implementation in terms of MPOL_PREFERRED with MPOL_F_LOCAL. I believe this would be much more robust if MPOL_LOCAL were recognized in switch statements throughout, MPOL_F_LOCAL deleted, and MPOL_PREFERRED use the (possibly empty) nodes mask like everyone else, instead of its preferred_node variant (I presume an optimization from the days before MPOL_LOCAL). But that would take me too long to get right and fully tested. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-28mm: fix null pointer dereference in wait_iff_congested()Zlatko Calusic
An unintended consequence of commit 4ae0a48b5efc ("mm: modify pgdat_balanced() so that it also handles order-0") is that wait_iff_congested() can now be called with NULL 'struct zone *' producing kernel oops like this: BUG: unable to handle kernel NULL pointer dereference IP: [<ffffffff811542d9>] wait_iff_congested+0x59/0x140 This trivial patch fixes it. Reported-by: Zhouping Liu <zliu@redhat.com> Reported-and-tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-23mm: modify pgdat_balanced() so that it also handles order-0Zlatko Calusic
Teach pgdat_balanced() about order-0 allocations so that we can simplify code in a few places in vmstat.c. Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>