summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2014-05-20mm/memblock: add physical memory listPhilipp Hachtmann
Add the physmem list to the memblock structure. This list only exists if HAVE_MEMBLOCK_PHYS_MAP is selected and contains the unmodified list of physically available memory. It differs from the memblock memory list as it always contains all memory ranges even if the memory has been restricted, e.g. by use of the mem= kernel parameter. Signed-off-by: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2014-05-20mm/memblock: Do some refactoring, enhance APIPhilipp Hachtmann
Refactor the memblock code and extend the memblock API to make it more flexible. With the extended API it is simple to define and work with additional memory lists. The static functions memblock_add_region and __memblock_remove are renamed to memblock_add_range and meblock_remove_range and added to the memblock API. The __next_free_mem_range and __next_free_mem_range_rev functions are replaced with calls to the more generic list walkers __next_mem_range and __next_mem_range_rev. To walk an arbitrary memory list two new macros for_each_mem_range and for_each_mem_range_rev are added. These new macros are used to define for_each_free_mem_range and for_each_free_mem_range_reverse. Signed-off-by: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2014-05-20Merge tag 'metag-for-v3.15-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag Pull Metag architecture and related fixes from James Hogan: "Mostly fixes for metag and parisc relating to upgrowing stacks. - Fix missing compiler barriers in metag memory barriers. - Fix BUG_ON on metag when RLIMIT_STACK hard limit is increased beyond safe value. - Make maximum stack size configurable. This reduces the default user stack size back to 80MB (especially on parisc after their removal of _STK_LIM_MAX override). This only affects metag and parisc. - Remove metag _STK_LIM_MAX override to match other arches and follow parisc, now that it is safe to do so (due to the BUG_ON fix mentioned above). - Finally now that both metag and parisc _STK_LIM_MAX overrides have been removed, it makes sense to remove _STK_LIM_MAX altogether" * tag 'metag-for-v3.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag: asm-generic: remove _STK_LIM_MAX metag: Remove _STK_LIM_MAX override parisc,metag: Do not hardcode maximum userspace stack size metag: Reduce maximum stack size to 256MB metag: fix memory barriers
2014-05-19block: move mm/bounce.c to block/Jens Axboe
Continue moving some of the block files that are scattered around. bounce.c contains only code for bouncing the contents of a bio. It's block proper code, not mm code. Suggested-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2014-05-16memcg: update memcg_has_children() to use css_next_child()Tejun Heo
Currently, memcg_has_children() and mem_cgroup_hierarchy_write() directly test cgroup->children for list emptiness. It's semantically correct in traditional hierarchies as it actually wants to test for any children dead or alive; however, cgroup->children is not a published field and scheduled to go away. This patch moves out .use_hierarchy test out of memcg_has_children() and updates it to use css_next_child() to test whether there exists any children. With .use_hierarchy test moved out, it can also be used by mem_cgroup_hierarchy_write(). A side note: As .use_hierarchy is going away, it doesn't really matter but I'm not sure about how it's used in __memcg_activate_kmem(). The condition tested by memcg_has_children() is mushy when seen from userland as its result is affected by dead csses which aren't visible from userland. I think the rule would be a lot clearer if we have a dedicated "freshly minted" flag which gets cleared when the first task is migrated into it or the first child is created and then gate activation with that. v2: Added comment noting that testing use_hierarchy is the responsibility of the callers of memcg_has_children() as suggested by Michal Hocko. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org>
2014-05-16memcg: remove tasks/children test from mem_cgroup_force_empty()Michal Hocko
Tejun has correctly pointed out that tasks/children test in mem_cgroup_force_empty is not correct because there is no other locking which preserves this state throughout the rest of the function so both new tasks can join the group or new children groups can be added while somebody is writing to memory.force_empty. A new task would break mem_cgroup_reparent_charges expectation that all failures as described by mem_cgroup_force_empty_list are temporal and there is no way out. The main use case for the knob as described by Documentation/cgroups/memory.txt is to: " The typical use case for this interface is before calling rmdir(). Because rmdir() moves all pages to parent, some out-of-use page caches can be moved to the parent. If you want to avoid that, force_empty will be useful. " This means that reparenting is not really required as rmdir will reparent pages implicitly from the safe context. If we remove it from mem_cgroup_force_empty then we are safe even with existing tasks because the number of reclaim attempts is bounded. Moreover the knob still does what the documentation claims (modulo reparenting which doesn't make any difference) and users might expect. Longterm we want to deprecate the whole knob and put the reparented pages to the tail of parent LRU during cgroup removal. tj: Removed unused variable @cgrp from mem_cgroup_force_empty() Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-05-16cgroup: remove css_parent()Tejun Heo
cgroup in general is moving towards using cgroup_subsys_state as the fundamental structural component and css_parent() was introduced to convert from using cgroup->parent to css->parent. It was quite some time ago and we're moving forward with making css more prominent. This patch drops the trivial wrapper css_parent() and let the users dereference css->parent. While at it, explicitly mark fields of css which are public and immutable. v2: New usage from device_cgroup.c converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Johannes Weiner <hannes@cmpxchg.org>
2014-05-15parisc,metag: Do not hardcode maximum userspace stack sizeHelge Deller
This patch affects only architectures where the stack grows upwards (currently parisc and metag only). On those do not hardcode the maximum initial stack size to 1GB for 32-bit processes, but make it configurable via a config option. The main problem with the hardcoded stack size is, that we have two memory regions which grow upwards: stack and heap. To keep most of the memory available for heap in a flexmap memory layout, it makes no sense to hard allocate up to 1GB of the memory for stack which can't be used as heap then. This patch makes the stack size for 32-bit processes configurable and uses 80MB as default value which has been in use during the last few years on parisc and which hasn't showed any problems yet. Signed-off-by: Helge Deller <deller@gmx.de> Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: linux-parisc@vger.kernel.org Cc: linux-metag@vger.kernel.org Cc: John David Anglin <dave.anglin@bell.net>
2014-05-13cgroup: replace cftype->trigger() with cftype->write()Tejun Heo
cftype->trigger() is pointless. It's trivial to ignore the input buffer from a regular ->write() operation. Convert all ->trigger() users to ->write() and remove ->trigger(). This patch doesn't introduce any visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz>
2014-05-13cgroup: replace cftype->write_string() with cftype->write()Tejun Heo
Convert all cftype->write_string() users to the new cftype->write() which maps directly to kernfs write operation and has full access to kernfs and cgroup contexts. The conversions are mostly mechanical. * @css and @cft are accessed using of_css() and of_cft() accessors respectively instead of being specified as arguments. * Should return @nbytes on success instead of 0. * @buf is not trimmed automatically. Trim if necessary. Note that blkcg and netprio don't need this as the parsers already handle whitespaces. cftype->write_string() has no user left after the conversions and removed. While at it, remove unnecessary local variable @p in cgroup_subtree_control_write() and stale comment about CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c. This patch doesn't introduce any visible behavior changes. v2: netprio was missing from conversion. Converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: "David S. Miller" <davem@davemloft.net>
2014-05-13cgroup: rename css_tryget*() to css_tryget_online*()Tejun Heo
Unlike the more usual refcnting, what css_tryget() provides is the distinction between online and offline csses instead of protection against upping a refcnt which already reached zero. cgroup is planning to provide actual tryget which fails if the refcnt already reached zero. Let's rename the existing trygets so that they clearly indicate that they're onliness. I thought about keeping the existing names as-are and introducing new names for the planned actual tryget; however, given that each controller participates in the synchronization of the online state, it seems worthwhile to make it explicit that these functions are about on/offline state. Rename css_tryget() to css_tryget_online() and css_tryget_from_dir() to css_tryget_online_from_dir(). This is pure rename. v2: cgroup_freezer grew new usages of css_tryget(). Update accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
2014-05-13Merge branch 'for-3.15-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull a percpu fix from Tejun Heo: "Fix for a percpu allocator bug where it could try to kfree() a memory region allocated using vmalloc(). The bug has been there for years now and is unlikely to have ever triggered given the size of struct pcpu_chunk. It's still theoretically possible and the fix is simple and safe enough, so the patch is marked with -stable" * 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: make pcpu_alloc_chunk() use pcpu_mem_free() instead of kfree()
2014-05-12ext4: fix data integrity sync in ordered modeNamjae Jeon
When we perform a data integrity sync we tag all the dirty pages with PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages. Later we check for this tag in write_cache_pages_da and creates a struct mpage_da_data containing contiguously indexed pages tagged with this tag and sync these pages with a call to mpage_da_map_and_submit. This process is done in while loop until all the PAGECACHE_TAG_TOWRITE pages are synced. We also do journal start and stop in each iteration. journal_stop could initiate journal commit which would call ext4_writepage which in turn will call ext4_bio_write_page even for delayed OR unwritten buffers. When ext4_bio_write_page is called for such buffers, even though it does not sync them but it clears the PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages are also not synced by the currently running data integrity sync. We will end up with dirty pages although sync is completed. This could cause a potential data loss when the sync call is followed by a truncate_pagecache call, which is exactly the case in collapse_range. (It will cause generic/127 failure in xfstests) To avoid this issue, we can use set_page_writeback_keepwrite instead of set_page_writeback, which doesn't clear TOWRITE tag. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2014-05-11mm, thp: close race between mremap() and split_huge_page()Kirill A. Shutemov
It's critical for split_huge_page() (and migration) to catch and freeze all PMDs on rmap walk. It gets tricky if there's concurrent fork() or mremap() since usually we copy/move page table entries on dup_mm() or move_page_tables() without rmap lock taken. To get it work we rely on rmap walk order to not miss any entry. We expect to see destination VMA after source one to work correctly. But after switching rmap implementation to interval tree it's not always possible to preserve expected walk order. It works fine for dup_mm() since new VMA has the same vma_start_pgoff() / vma_last_pgoff() and explicitly insert dst VMA after src one with vma_interval_tree_insert_after(). But on move_vma() destination VMA can be merged into adjacent one and as result shifted left in interval tree. Fortunately, we can detect the situation and prevent race with rmap walk by moving page table entries under rmap lock. See commit 38a76013ad80. Problem is that we miss the lock when we move transhuge PMD. Most likely this bug caused the crash[1]. [1] http://thread.gmane.org/gmane.linux.kernel.mm/96473 Fixes: 108d6642ad81 ("mm anon rmap: remove anon_vma_moveto_tail") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Michel Lespinasse <walken@google.com> Cc: Dave Jones <davej@redhat.com> Cc: David Miller <davem@davemloft.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-11mm: postpone the disabling of kmemleak early loggingCatalin Marinas
Commit 8910ae896c8c ("kmemleak: change some global variables to int"), in addition to the atomic -> int conversion, moved the disabling of kmemleak_early_log to the beginning of the kmemleak_init() function, before the full kmemleak tracing is actually enabled. In this small window, kmem_cache_create() is called by kmemleak which triggers additional memory allocation that are not traced. This patch restores the original logic with kmemleak_early_log disabling when kmemleak is fully functional. Fixes: 8910ae896c8c (kmemleak: change some global variables to int) Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-07mm/numa: Remove BUG_ON() in __handle_mm_fault()Rik van Riel
Changing PTEs and PMDs to pte_numa & pmd_numa is done with the mmap_sem held for reading, which means a pmd can be instantiated and turned into a numa one while __handle_mm_fault() is examining the value of old_pmd. If that happens, __handle_mm_fault() should just return and let the page fault retry, instead of throwing an oops. This is handled by the test for pmd_trans_huge(*pmd) below. Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Sunil Pandey <sunil.k.pandey@intel.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: linux-mm@kvack.org Cc: lwoodman@redhat.com Cc: dave.hansen@intel.com Link: http://lkml.kernel.org/r/20140429153615.2d72098e@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-07Merge branch 'sched/urgent' into sched/core, to avoid conflictsIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-05-06bio_vec-backed iov_iterAl Viro
New variant of iov_iter - ITER_BVEC in iter->type, backed with bio_vec array instead of iovec one. Primitives taught to deal with such beasts, __swap_write() switched to using that kind of iov_iter. Note that bio_vec is just a <page, offset, length> triple - there's nothing block-specific about it. I've left the definition where it was, but took it from under ifdef CONFIG_BLOCK. Next target: ->splice_write()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06optimize copy_page_{to,from}_iter()Al Viro
if we'd ended up in the end of a segment, jump to the beginning of the next one (iov_offset = 0, iov++), rather than having the next primitive deal with that. Ought to be folded back... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06bury generic_file_aio_{read,write}Al Viro
no callers left Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06new helper: copy_page_from_iter()Al Viro
parallel to copy_page_to_iter(). pipe_write() switched to it (and became ->write_iter()). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06bury __generic_file_aio_write()Al Viro
all users converted to __generic_file_write_iter() now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06write_iter variants of {__,}generic_file_aio_write()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06shmem: switch to ->read_iter()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06iov_iter_truncate()Al Viro
Now It Can Be Done(tm) - we don't need to do iov_shorten() in generic_file_direct_write() anymore, now that all ->direct_IO() instances are converted to proper iov_iter methods and honour iter->count and iter->iov_offset properly. Get rid of count/ocount arguments of generic_file_direct_write(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06new helper: iov_iter_get_pages_alloc()Al Viro
same as iov_iter_get_pages(), except that pages array is allocated (kmalloc if possible, vmalloc if that fails) and left for caller to free. Lustre and NFS ->direct_IO() switched to it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06new helper: iov_iter_npages()Al Viro
counts the pages covered by iov_iter, up to given limit. do_block_direct_io() and fuse_iter_npages() switched to it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06new helper: iov_iter_get_pages()Al Viro
iov_iter_get_pages(iter, pages, maxsize, &start) grabs references pinning the pages of up to maxsize of (contiguous) data from iter. Returns the amount of memory grabbed or -error. In case of success, the requested area begins at offset start in pages[0] and runs through pages[1], etc. Less than requested amount might be returned - either because the contiguous area in the beginning of iterator is smaller than requested, or because the kernel failed to pin that many pages. direct-io.c switched to using iov_iter_get_pages() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06start adding the tag to iov_iterAl Viro
For now, just use the same thing we pass to ->direct_IO() - it's all iovec-based at the moment. Pass it explicitly to iov_iter_init() and account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO() uses. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06new helper: generic_file_read_iter()Al Viro
iov_iter-using variant of generic_file_aio_read(). Some callers converted. Note that it's still not quite there for use as ->read_iter() - we depend on having zero iter->iov_offset in O_DIRECT case. Fortunately, that's true for all converted callers (and for generic_file_aio_read() itself). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06new primitive: iov_iter_alignment()Al Viro
returns the value aligned as badly as the worst remaining segment in iov_iter is. Use instead of open-coded equivalents. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06give ->direct_IO() a copy of iov_iterAl Viro
the thing is, we want to advance what's given to ->direct_IO() as we are forming the request; however, the callers care about the amount of data actually transferred, not the amount we tried to transfer. It's more convenient to allow ->direct_IO() instances do use iov_iter_advance() on the copy of iov_iter, leaving the actual advancing of the original to caller. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06get rid of pointless iov_length() in ->direct_IO()Al Viro
all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06pass iov_iter to ->direct_IO()Al Viro
unmodified, for now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06kill generic_segment_checks()Al Viro
all callers of ->aio_read() and ->aio_write() have iov/nr_segs already checked - generic_segment_checks() done after that is just an odd way to spell iov_length(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06generic_file_direct_write(): switch to iov_iterAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06kill iov_iter_copy_from_user()Al Viro
all callers can use copy_page_from_iter() and it actually simplifies them. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06Merge branch 'akpm' (incoming from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "13 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: agp: info leak in agpioc_info_wrap() fs/affs/super.c: bugfix / double free fanotify: fix -EOVERFLOW with large files on 64-bit slub: use sysfs'es release mechanism for kmem_cache revert "mm: vmscan: do not swap anon pages just because free+file is low" autofs: fix lockref lookup mm: filemap: update find_get_pages_tag() to deal with shadow entries mm/compaction: make isolate_freepages start at pageblock boundary MAINTAINERS: zswap/zbud: change maintainer email address mm/page-writeback.c: fix divide by zero in pos_ratio_polynom hugetlb: ensure hugepage access is denied if hugepages are not supported slub: fix memcg_propagate_slab_attrs drivers/rtc/rtc-pcf8523.c: fix month definition
2014-05-06slub: use sysfs'es release mechanism for kmem_cacheChristoph Lameter
debugobjects warning during netfilter exit: ------------[ cut here ]------------ WARNING: CPU: 6 PID: 4178 at lib/debugobjects.c:260 debug_print_object+0x8d/0xb0() ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20 Modules linked in: CPU: 6 PID: 4178 Comm: kworker/u16:2 Tainted: G W 3.11.0-next-20130906-sasha #3984 Workqueue: netns cleanup_net Call Trace: dump_stack+0x52/0x87 warn_slowpath_common+0x8c/0xc0 warn_slowpath_fmt+0x46/0x50 debug_print_object+0x8d/0xb0 __debug_check_no_obj_freed+0xa5/0x220 debug_check_no_obj_freed+0x15/0x20 kmem_cache_free+0x197/0x340 kmem_cache_destroy+0x86/0xe0 nf_conntrack_cleanup_net_list+0x131/0x170 nf_conntrack_pernet_exit+0x5d/0x70 ops_exit_list+0x5e/0x70 cleanup_net+0xfb/0x1c0 process_one_work+0x338/0x550 worker_thread+0x215/0x350 kthread+0xe7/0xf0 ret_from_fork+0x7c/0xb0 Also during dcookie cleanup: WARNING: CPU: 12 PID: 9725 at lib/debugobjects.c:260 debug_print_object+0x8c/0xb0() ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20 Modules linked in: CPU: 12 PID: 9725 Comm: trinity-c141 Not tainted 3.15.0-rc2-next-20140423-sasha-00018-gc4ff6c4 #408 Call Trace: dump_stack (lib/dump_stack.c:52) warn_slowpath_common (kernel/panic.c:430) warn_slowpath_fmt (kernel/panic.c:445) debug_print_object (lib/debugobjects.c:262) __debug_check_no_obj_freed (lib/debugobjects.c:697) debug_check_no_obj_freed (lib/debugobjects.c:726) kmem_cache_free (mm/slub.c:2689 mm/slub.c:2717) kmem_cache_destroy (mm/slab_common.c:363) dcookie_unregister (fs/dcookies.c:302 fs/dcookies.c:343) event_buffer_release (arch/x86/oprofile/../../../drivers/oprofile/event_buffer.c:153) __fput (fs/file_table.c:217) ____fput (fs/file_table.c:253) task_work_run (kernel/task_work.c:125 (discriminator 1)) do_notify_resume (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751) int_signal (arch/x86/kernel/entry_64.S:807) Sysfs has a release mechanism. Use that to release the kmem_cache structure if CONFIG_SYSFS is enabled. Only slub is changed - slab currently only supports /proc/slabinfo and not /sys/kernel/slab/*. We talked about adding that and someone was working on it. [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build] [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build even more] Signed-off-by: Christoph Lameter <cl@linux.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Greg KH <greg@kroah.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06revert "mm: vmscan: do not swap anon pages just because free+file is low"Johannes Weiner
This reverts commit 0bf1457f0cfc ("mm: vmscan: do not swap anon pages just because free+file is low") because it introduced a regression in mostly-anonymous workloads, where reclaim would become ineffective and trap every allocating task in direct reclaim. The problem is that there is a runaway feedback loop in the scan balance between file and anon, where the balance tips heavily towards a tiny thrashing file LRU and anonymous pages are no longer being looked at. The commit in question removed the safe guard that would detect such situations and respond with forced anonymous reclaim. This commit was part of a series to fix premature swapping in loads with relatively little cache, and while it made a small difference, the cure is obviously worse than the disease. Revert it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06mm: filemap: update find_get_pages_tag() to deal with shadow entriesJohannes Weiner
Dave Jones reports the following crash when find_get_pages_tag() runs into an exceptional entry: kernel BUG at mm/filemap.c:1347! RIP: find_get_pages_tag+0x1cb/0x220 Call Trace: find_get_pages_tag+0x36/0x220 pagevec_lookup_tag+0x21/0x30 filemap_fdatawait_range+0xbe/0x1e0 filemap_fdatawait+0x27/0x30 sync_inodes_sb+0x204/0x2a0 sync_inodes_one_sb+0x19/0x20 iterate_supers+0xb2/0x110 sys_sync+0x44/0xb0 ia32_do_call+0x13/0x13 1343 /* 1344 * This function is never used on a shmem/tmpfs 1345 * mapping, so a swap entry won't be found here. 1346 */ 1347 BUG(); After commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees") this comment and BUG() are out of date because exceptional entries can now appear in all mappings - as shadows of recently evicted pages. However, as Hugh Dickins notes, "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably any other PAGECACHE_TAG_*) to appear on an exceptional entry. I expect it comes down to an occasional race in RCU lookup of the radix_tree: lacking absolute synchronization, we might sometimes catch an exceptional entry, with the tag which really belongs with the unexceptional entry which was there an instant before." And indeed, not only is the tree walk lockless, the tags are also read in chunks, one radix tree node at a time. There is plenty of time for page reclaim to swoop in and replace a page that was already looked up as tagged with a shadow entry. Remove the BUG() and update the comment. While reviewing all other lookup sites for whether they properly deal with shadow entries of evicted pages, update all the comments and fix memcg file charge moving to not miss shmem/tmpfs swapcache pages. Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Dave Jones <davej@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06mm/compaction: make isolate_freepages start at pageblock boundaryVlastimil Babka
The compaction freepage scanner implementation in isolate_freepages() starts by taking the current cc->free_pfn value as the first pfn. In a for loop, it scans from this first pfn to the end of the pageblock, and then subtracts pageblock_nr_pages from the first pfn to obtain the first pfn for the next for loop iteration. This means that when cc->free_pfn starts at offset X rather than being aligned on pageblock boundary, the scanner will start at offset X in all scanned pageblock, ignoring potentially many free pages. Currently this can happen when a) zone's end pfn is not pageblock aligned, or b) through zone->compact_cached_free_pfn with CONFIG_HOLES_IN_ZONE enabled and a hole spanning the beginning of a pageblock This patch fixes the problem by aligning the initial pfn in isolate_freepages() to pageblock boundary. This also permits replacing the end-of-pageblock alignment within the for loop with a simple pageblock_nr_pages increment. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Heesub Shin <heesub.shin@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Sunghwan Yun <sunghwan.yun@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06mm/page-writeback.c: fix divide by zero in pos_ratio_polynomRik van Riel
It is possible for "limit - setpoint + 1" to equal zero, after getting truncated to a 32 bit variable, and resulting in a divide by zero error. Using the fully 64 bit divide functions avoids this problem. It also will cause pos_ratio_polynom() to return the correct value when (setpoint - limit) exceeds 2^32. Also uninline pos_ratio_polynom, at Andrew's request. Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06hugetlb: ensure hugepage access is denied if hugepages are not supportedNishanth Aravamudan
Currently, I am seeing the following when I `mount -t hugetlbfs /none /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's related to the fact that hugetlbfs is properly not correctly setting itself up in this state?: Unable to handle kernel paging request for data at address 0x00000031 Faulting instruction address: 0xc000000000245710 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2048 NUMA pSeries .... In KVM guests on Power, in a guest not backed by hugepages, we see the following: AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 64 kB HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages are not supported at boot-time, but this is only checked in hugetlb_init(). Extract the check to a helper function, and use it in a few relevant places. This does make hugetlbfs not supported (not registered at all) in this environment. I believe this is fine, as there are no valid hugepages and that won't change at runtime. [akpm@linux-foundation.org: use pr_info(), per Mel] [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined] Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06slub: fix memcg_propagate_slab_attrsVladimir Davydov
After creating a cache for a memcg we should initialize its sysfs attrs with the values from its parent. That's what memcg_propagate_slab_attrs is for. Currently it's broken - we clearly muddled root-vs-memcg caches there. Let's fix it up. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "dcache fixes + kvfree() (uninlined, exported by mm/util.c) + posix_acl bugfix from hch" The dcache fixes are for a subtle LRU list corruption bug reported by Miklos Szeredi, where people inside IBM saw list corruptions with the LTP/host01 test. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: nick kvfree() from apparmor posix_acl: handle NULL ACL in posix_acl_equiv_mode dcache: don't need rcu in shrink_dentry_list() more graceful recovery in umount_collect() don't remove from shrink list in select_collect() dentry_kill(): don't try to remove from shrink list expand the call of dentry_lru_del() in dentry_kill() new helper: dentry_free() fold try_prune_one_dentry() fold d_kill() and d_free() fix races between __d_instantiate() and checks of dentry flags
2014-05-06nick kvfree() from apparmorAl Viro
too many places open-code it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-05slab: Fix off by one in object max number tests.David Miller
If freelist_idx_t is a byte, SLAB_OBJ_MAX_NUM should be 255 not 256, and likewise if freelist_idx_t is a short, then it should be 65535 not 65536. This was leading to all kinds of random crashes on sparc64 where PAGE_SIZE is 8192. One problem shown was that if spinlock debugging was enabled, we'd get deadlocks in copy_pte_range() or do_wp_page() with the same cpu already holding a lock it shouldn't hold, or the lock belonging to a completely unrelated process. Fixes: a41adfaa23df ("slab: introduce byte sized index for the freelist of a slab") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-05slab: fix the type of the index on freelist index accessorJoonsoo Kim
Commit a41adfaa23df ("slab: introduce byte sized index for the freelist of a slab") changes the size of freelist index and also changes prototype of accessor function to freelist index. And there was a mistake. The mistake is that although it changes the size of freelist index correctly, it changes the size of the index of freelist index incorrectly. With patch, freelist index can be 1 byte or 2 bytes, that means that num of object on on a slab can be more than 255. So we need more than 1 byte for the index to find the index of free object on freelist. But, above patch makes this index type 1 byte, so slab which have more than 255 objects cannot work properly and in consequence of it, the system cannot boot. This issue was reported by Steven King on m68knommu which would use 2 bytes freelist index: https://lkml.org/lkml/2014/4/16/433 To fix is easy. To change the type of the index of freelist index on accessor functions is enough to fix this bug. Although 2 bytes is enough, I use 4 bytes since it have no bad effect and make things more easier. This fix was suggested and tested by Steven in his original report. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-and-acked-by: Steven King <sfking@fdwdc.com> Acked-by: Christoph Lameter <cl@linux.com> Tested-by: James Hogan <james.hogan@imgtec.com> Tested-by: David Miller <davem@davemloft.net> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-05-05mm: Fix printk typo in dmapool.cHiroshige Sato
Fix printk typo in dmapool.c Signed-off-by: Hiroshige Sato <sato.vintage@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>