From 10fbcf4c6cb122005cdf36fc24d7683da92c7a27 Mon Sep 17 00:00:00 2001 From: Kay Sievers Date: Wed, 21 Dec 2011 14:48:43 -0800 Subject: convert 'memory' sysdev_class to a regular subsystem This moves the 'memory sysdev_class' over to a regular 'memory' subsystem and converts the devices to regular devices. The sysdev drivers are implemented as subsystem interfaces now. After all sysdev classes are ported to regular driver core entities, the sysdev implementation will be entirely removed from the kernel. Signed-off-by: Kay Sievers Signed-off-by: Greg Kroah-Hartman --- mm/compaction.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index 899d95638586..1253d7ac332b 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -721,23 +721,23 @@ int sysctl_extfrag_handler(struct ctl_table *table, int write, } #if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) -ssize_t sysfs_compact_node(struct sys_device *dev, - struct sysdev_attribute *attr, +ssize_t sysfs_compact_node(struct device *dev, + struct device_attribute *attr, const char *buf, size_t count) { compact_node(dev->id); return count; } -static SYSDEV_ATTR(compact, S_IWUSR, NULL, sysfs_compact_node); +static DEVICE_ATTR(compact, S_IWUSR, NULL, sysfs_compact_node); int compaction_register_node(struct node *node) { - return sysdev_create_file(&node->sysdev, &attr_compact); + return device_create_file(&node->dev, &dev_attr_compact); } void compaction_unregister_node(struct node *node) { - return sysdev_remove_file(&node->sysdev, &attr_compact); + return device_remove_file(&node->dev, &dev_attr_compact); } #endif /* CONFIG_SYSFS && CONFIG_NUMA */ -- cgit v1.2.3 From 31b8384a555d94c78e2ea2284a323cb985441f60 Mon Sep 17 00:00:00 2001 From: Hillf Danton Date: Tue, 10 Jan 2012 15:07:59 -0800 Subject: mm: compaction: push isolate search base of compact control one pfn ahead After isolated the current pfn will no longer be scanned and isolated if the next round is necessary, so push the isolate_migratepages search base of the given compact_control one step ahead. Signed-off-by: Hillf Danton Reviewed-by: Andrea Arcangeli Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index 1253d7ac332b..e6670c34eb49 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -365,8 +365,10 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, nr_isolated++; /* Avoid isolating too much */ - if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) + if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) { + ++low_pfn; break; + } } acct_isolated(zone, cc); -- cgit v1.2.3 From a77ebd333cd810d7b680d544be88c875131c2bd3 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Jan 2012 17:19:22 -0800 Subject: mm: compaction: allow compaction to isolate dirty pages Short summary: There are severe stalls when a USB stick using VFAT is used with THP enabled that are reduced by this series. If you are experiencing this problem, please test and report back and considering I have seen complaints from openSUSE and Fedora users on this as well as a few private mails, I'm guessing it's a widespread issue. This is a new type of USB-related stall because it is due to synchronous compaction writing where as in the past the big problem was dirty pages reaching the end of the LRU and being written by reclaim. Am cc'ing Andrew this time and this series would replace mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch. I'm also cc'ing Dave Jones as he might have merged that patch to Fedora for wider testing and ideally it would be reverted and replaced by this series. That said, the later patches could really do with some review. If this series is not the answer then a new direction needs to be discussed because as it is, the stalls are unacceptable as the results in this leader show. For testers that try backporting this to 3.1, it won't work because there is a non-obvious dependency on not writing back pages in direct reclaim so you need those patches too. Changelog since V5 o Rebase to 3.2-rc5 o Tidy up the changelogs a bit Changelog since V4 o Added reviewed-bys, credited Andrea properly for sync-light o Allow dirty pages without mappings to be considered for migration o Bound the number of pages freed for compaction o Isolate PageReclaim pages on their own LRU list This is against 3.2-rc5 and follows on from discussions on "mm: Do not stall in synchronous compaction for THP allocations" and "[RFC PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed patch eliminated stalls due to compaction which sometimes resulted in user-visible interactivity problems on browsers by simply never using sync compaction. The downside was that THP success allocation rates were lower because dirty pages were not being migrated as reported by Andrea. His approach at fixing this was nacked on the grounds that it reverted fixes from Rik merged that reduced the amount of pages reclaimed as it severely impacted his workloads performance. This series attempts to reconcile the requirements of maximising THP usage, without stalling in a user-visible fashion due to compaction or cheating by reclaiming an excessive number of pages. Patch 1 partially reverts commit 39deaf85 to allow migration to isolate dirty pages. This is because migration can move some dirty pages without blocking. Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using synchronous compaction when it should be. This is unrelated to the reported stalls but is worth fixing. Patch 3 checks if we isolated a compound page during lumpy scan and account for it properly. For the most part, this affects tracing so it's unrelated to the stalls but worth fixing. Patch 4 notes that it is possible to abort reclaim early for compaction and return 0 to the page allocator potentially entering the "may oom" path. This has not been observed in practice but the rest of the series potentially makes it easier to happen. Patch 5 adds a sync parameter to the migratepage callback and gives the callback responsibility for migrating the page without blocking if sync==false. For example, fallback_migrate_page will not call writepage if sync==false. This increases the number of pages that can be handled by asynchronous compaction thereby reducing stalls. Patch 6 restores filter-awareness to isolate_lru_page for migration. In practice, it means that pages under writeback and pages without a ->migratepage callback will not be isolated for migration. Patch 7 avoids calling direct reclaim if compaction is deferred but makes sure that compaction is only deferred if sync compaction was used. Patch 8 introduces a sync-light migration mechanism that sync compaction uses. The objective is to allow some stalls but to not call ->writepage which can lead to significant user-visible stalls. Patch 9 notes that while we want to abort reclaim ASAP to allow compation to go ahead that we leave a very small window of opportunity for compaction to run. This patch allows more pages to be freed by reclaim but bounds the number to a reasonable level based on the high watermark on each zone. Patch 10 allows slabs to be shrunk even after compaction_ready() is true for one zone. This is to avoid a problem whereby a single small zone can abort reclaim even though no pages have been reclaimed and no suitably large zone is in a usable state. Patch 11 fixes a problem with the rate of page scanning. As reclaim is rarely stalling on pages under writeback it means that scan rates are very high. This is particularly true for direct reclaim which is not calling writepage. The vmstat figures implied that much of this was busy work with PageReclaim pages marked for immediate reclaim. This patch is a prototype that moves these pages to their own LRU list. This has been tested and other than 2 USB keys getting trashed, nothing horrible fell out. That said, I am a bit unhappy with the rescue logic in patch 11 but did not find a better way around it. It does significantly reduce scan rates and System CPU time indicating it is the right direction to take. What is of critical importance is that stalls due to compaction are massively reduced even though sync compaction was still allowed. Testing from people complaining about stalls copying to USBs with THP enabled are particularly welcome. The following tests all involve THP usage and USB keys in some way. Each test follows this type of pattern 1. Read from some fast fast storage, be it raw device or file. Each time the copy finishes, start again until the test ends 2. Write a large file to a filesystem on a USB stick. Each time the copy finishes, start again until the test ends 3. When memory is low, start an alloc process that creates a mapping the size of physical memory to stress THP allocation. This is the "real" part of the test and the part that is meant to trigger stalls when THP is enabled. Copying continues in the background. 4. Record the CPU usage and time to execute of the alloc process 5. Record the number of THP allocs and fallbacks as well as the number of THP pages in use a the end of the test just before alloc exited 6. Run the test 5 times to get an idea of variability 7. Between each run, sync is run and caches dropped and the test waits until nr_dirty is a small number to avoid interference or caching between iterations that would skew the figures. The individual tests were then writebackCPDeviceBasevfat Disable THP, read from a raw device (sda), vfat on USB stick writebackCPDeviceBaseext4 Disable THP, read from a raw device (sda), ext4 on USB stick writebackCPDevicevfat THP enabled, read from a raw device (sda), vfat on USB stick writebackCPDeviceext4 THP enabled, read from a raw device (sda), ext4 on USB stick writebackCPFilevfat THP enabled, read from a file on fast storage and USB, both vfat writebackCPFileext4 THP enabled, read from a file on fast storage and USB, both ext4 The kernels tested were 3.1 3.1 vanilla 3.2-rc5 freemore Patches 1-10 immediate Patches 1-11 andrea The 8 patches Andrea posted as a basis of comparison The results are very long unfortunately. I'll start with the case where we are not using THP at all writebackCPDeviceBasevfat 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 System Time 1.28 ( 0.00%) 54.49 (-4143.46%) 48.63 (-3687.69%) 4.69 ( -265.11%) 51.88 (-3940.81%) +/- 0.06 ( 0.00%) 2.45 (-4305.55%) 4.75 (-8430.57%) 7.46 (-13282.76%) 4.76 (-8440.70%) User Time 0.09 ( 0.00%) 0.05 ( 40.91%) 0.06 ( 29.55%) 0.07 ( 15.91%) 0.06 ( 27.27%) +/- 0.02 ( 0.00%) 0.01 ( 45.39%) 0.02 ( 25.07%) 0.00 ( 77.06%) 0.01 ( 52.24%) Elapsed Time 110.27 ( 0.00%) 56.38 ( 48.87%) 49.95 ( 54.70%) 11.77 ( 89.33%) 53.43 ( 51.54%) +/- 7.33 ( 0.00%) 3.77 ( 48.61%) 4.94 ( 32.63%) 6.71 ( 8.50%) 4.76 ( 35.03%) THP Active 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Fault Alloc 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Fault Fallback 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) The THP figures are obviously all 0 because THP was enabled. The main thing to watch is the elapsed times and how they compare to times when THP is enabled later. It's also important to note that elapsed time is improved by this series as System CPu time is much reduced. writebackCPDevicevfat 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 System Time 1.22 ( 0.00%) 13.89 (-1040.72%) 46.40 (-3709.20%) 4.44 ( -264.37%) 47.37 (-3789.33%) +/- 0.06 ( 0.00%) 22.82 (-37635.56%) 3.84 (-6249.44%) 6.48 (-10618.92%) 6.60 (-10818.53%) User Time 0.06 ( 0.00%) 0.06 ( -6.90%) 0.05 ( 17.24%) 0.05 ( 13.79%) 0.04 ( 31.03%) +/- 0.01 ( 0.00%) 0.01 ( 33.33%) 0.01 ( 33.33%) 0.01 ( 39.14%) 0.01 ( 25.46%) Elapsed Time 10445.54 ( 0.00%) 2249.92 ( 78.46%) 70.06 ( 99.33%) 16.59 ( 99.84%) 472.43 ( 95.48%) +/- 643.98 ( 0.00%) 811.62 ( -26.03%) 10.02 ( 98.44%) 7.03 ( 98.91%) 59.99 ( 90.68%) THP Active 15.60 ( 0.00%) 35.20 ( 225.64%) 65.00 ( 416.67%) 70.80 ( 453.85%) 62.20 ( 398.72%) +/- 18.48 ( 0.00%) 51.29 ( 277.59%) 15.99 ( 86.52%) 37.91 ( 205.18%) 22.02 ( 119.18%) Fault Alloc 121.80 ( 0.00%) 76.60 ( 62.89%) 155.40 ( 127.59%) 181.20 ( 148.77%) 286.60 ( 235.30%) +/- 73.51 ( 0.00%) 61.11 ( 83.12%) 34.89 ( 47.46%) 31.88 ( 43.36%) 68.13 ( 92.68%) Fault Fallback 881.20 ( 0.00%) 926.60 ( -5.15%) 847.60 ( 3.81%) 822.00 ( 6.72%) 716.60 ( 18.68%) +/- 73.51 ( 0.00%) 61.26 ( 16.67%) 34.89 ( 52.54%) 31.65 ( 56.94%) 67.75 ( 7.84%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 3540.88 1945.37 716.04 64.97 1937.03 Total Elapsed Time (seconds) 52417.33 11425.90 501.02 230.95 2520.28 The first thing to note is the "Elapsed Time" for the vanilla kernels of 2249 seconds versus 56 with THP disabled which might explain the reports of USB stalls with THP enabled. Applying the patches brings performance in line with THP-disabled performance while isolating pages for immediate reclaim from the LRU cuts down System CPU time. The "Fault Alloc" success rate figures are also improved. The vanilla kernel only managed to allocate 76.6 pages on average over the course of 5 iterations where as applying the series allocated 181.20 on average albeit it is well within variance. It's worth noting that applies the series at least descreases the amount of variance which implies an improvement. Andrea's series had a higher success rate for THP allocations but at a severe cost to elapsed time which is still better than vanilla but still much worse than disabling THP altogether. One can bring my series close to Andrea's by removing this check /* * If compaction is deferred for high-order allocations, it is because * sync compaction recently failed. In this is the case and the caller * has requested the system not be heavily disrupted, fail the * allocation now instead of entering direct reclaim */ if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD)) goto nopage; I didn't include a patch that removed the above check because hurting overall performance to improve the THP figure is not what the average user wants. It's something to consider though if someone really wants to maximise THP usage no matter what it does to the workload initially. This is summary of vmstat figures from the same test. 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 Page Ins 3257266139 1111844061 17263623 10901575 161423219 Page Outs 81054922 30364312 3626530 3657687 8753730 Swap Ins 3294 2851 6560 4964 4592 Swap Outs 390073 528094 620197 790912 698285 Direct pages scanned 1077581700 3024951463 1764930052 115140570 5901188831 Kswapd pages scanned 34826043 7112868 2131265 1686942 1893966 Kswapd pages reclaimed 28950067 4911036 1246044 966475 1497726 Direct pages reclaimed 805148398 280167837 3623473 2215044 40809360 Kswapd efficiency 83% 69% 58% 57% 79% Kswapd velocity 664.399 622.521 4253.852 7304.360 751.490 Direct efficiency 74% 9% 0% 1% 0% Direct velocity 20557.737 264745.137 3522673.849 498551.938 2341481.435 Percentage direct scans 96% 99% 99% 98% 99% Page writes by reclaim 722646 529174 620319 791018 699198 Page writes file 332573 1080 122 106 913 Page writes anon 390073 528094 620197 790912 698285 Page reclaim immediate 0 2552514720 1635858848 111281140 5478375032 Page rescued immediate 0 0 0 87848 0 Slabs scanned 23552 23552 9216 8192 9216 Direct inode steals 231 0 0 0 0 Kswapd inode steals 0 0 0 0 0 Kswapd skipped wait 28076 786 0 61 6 THP fault alloc 609 383 753 906 1433 THP collapse alloc 12 6 0 0 6 THP splits 536 211 456 593 1136 THP fault fallback 4406 4633 4263 4110 3583 THP collapse fail 120 127 0 0 4 Compaction stalls 1810 728 623 779 3200 Compaction success 196 53 60 80 123 Compaction failures 1614 675 563 699 3077 Compaction pages moved 193158 53545 243185 333457 226688 Compaction move failure 9952 9396 16424 23676 45070 The main things to look at are 1. Page In/out figures are much reduced by the series. 2. Direct page scanning is incredibly high (264745.137 pages scanned per second on the vanilla kernel) but isolating PageReclaim pages on their own list reduces the number of pages scanned significantly. 3. The fact that "Page rescued immediate" is a positive number implies that we sometimes race removing pages from the LRU_IMMEDIATE list that need to be put back on a normal LRU but it happens only for 0.07% of the pages marked for immediate reclaim. writebackCPDeviceext4 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%) +/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%) User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%) +/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%) Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%) +/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%) THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%) +/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%) Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%) +/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%) Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%) +/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45 Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26 Similar test but the USB stick is using ext4 instead of vfat. As ext4 does not use writepage for migration, the large stalls due to compaction when THP is enabled are not observed. Still, isolating PageReclaim pages on their own list helped completion time largely by reducing the number of pages scanned by direct reclaim although time spend in congestion_wait could also be a factor. Again, Andrea's series had far higher success rates for THP allocation at the cost of elapsed time. I didn't look too closely but a quick look at the vmstat figures tells me kswapd reclaimed 8 times more pages than the patch series and direct reclaim reclaimed roughly three times as many pages. It follows that if memory is aggressively reclaimed, there will be more available for THP. writebackCPFilevfat 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 System Time 1.76 ( 0.00%) 29.10 (-1555.52%) 46.01 (-2517.18%) 4.79 ( -172.35%) 54.89 (-3022.53%) +/- 0.14 ( 0.00%) 25.61 (-18185.17%) 2.15 (-1434.83%) 6.60 (-4610.03%) 9.75 (-6863.76%) User Time 0.05 ( 0.00%) 0.07 ( -45.83%) 0.05 ( -4.17%) 0.06 ( -29.17%) 0.06 ( -16.67%) +/- 0.02 ( 0.00%) 0.02 ( 20.11%) 0.02 ( -3.14%) 0.01 ( 31.58%) 0.01 ( 47.41%) Elapsed Time 22520.79 ( 0.00%) 1082.85 ( 95.19%) 73.30 ( 99.67%) 32.43 ( 99.86%) 291.84 ( 98.70%) +/- 7277.23 ( 0.00%) 706.29 ( 90.29%) 19.05 ( 99.74%) 17.05 ( 99.77%) 125.55 ( 98.27%) THP Active 83.80 ( 0.00%) 12.80 ( 15.27%) 15.60 ( 18.62%) 13.00 ( 15.51%) 0.80 ( 0.95%) +/- 66.81 ( 0.00%) 20.19 ( 30.22%) 5.92 ( 8.86%) 15.06 ( 22.54%) 1.17 ( 1.75%) Fault Alloc 171.00 ( 0.00%) 67.80 ( 39.65%) 97.40 ( 56.96%) 125.60 ( 73.45%) 133.00 ( 77.78%) +/- 82.91 ( 0.00%) 30.69 ( 37.02%) 53.91 ( 65.02%) 55.05 ( 66.40%) 21.19 ( 25.56%) Fault Fallback 832.00 ( 0.00%) 935.20 ( -12.40%) 906.00 ( -8.89%) 877.40 ( -5.46%) 870.20 ( -4.59%) +/- 82.91 ( 0.00%) 30.69 ( 62.98%) 54.01 ( 34.86%) 55.05 ( 33.60%) 20.91 ( 74.78%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 7229.81 928.42 704.52 80.68 1330.76 Total Elapsed Time (seconds) 112849.04 5618.69 571.11 360.54 1664.28 In this case, the test is reading/writing only from filesystems but as it's vfat, it's slow due to calling writepage during compaction. Little to observe really - the time to complete the test goes way down with the series applied and THP allocation success rates go up in comparison to 3.2-rc5. The success rates are lower than 3.1.0 but the elapsed time for that kernel is abysmal so it is not really a sensible comparison. As before, Andrea's series allocates more THPs at the cost of overall performance. writebackCPFileext4 3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1 System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%) +/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%) User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%) +/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%) Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%) +/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%) THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%) +/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%) Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%) +/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%) Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%) +/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%) MMTests Statistics: duration User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45 Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26 Same type of story - elapsed times go down. In this case, allocation success rates are roughtly the same. As before, Andrea's has higher success rates but takes a lot longer. Overall the series does reduce latencies and while the tests are inherency racy as alloc competes with the cp processes, the variability was included. The THP allocation rates are not as high as they could be but that is because we would have to be more aggressive about reclaim and compaction impacting overall performance. This patch: Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware") noted that compaction does not migrate dirty or writeback pages and that is was meaningless to pick the page and re-add it to the LRU list. What was missed during review is that asynchronous migration moves dirty pages if their ->migratepage callback is migrate_page() because these can be moved without blocking. This potentially impacted hugepage allocation success rates by a factor depending on how many dirty pages are in the system. This patch partially reverts 39deaf85 to allow migration to isolate dirty pages again. This increases how much compaction disrupts the LRU but that is addressed later in the series. Signed-off-by: Mel Gorman Reviewed-by: Andrea Arcangeli Reviewed-by: Rik van Riel Reviewed-by: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 3 --- 1 file changed, 3 deletions(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index e6670c34eb49..396ea2b47f7e 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -349,9 +349,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, continue; } - if (!cc->sync) - mode |= ISOLATE_CLEAN; - /* Try isolate the page */ if (__isolate_lru_page(page, mode, 0) != 0) continue; -- cgit v1.2.3 From b16d3d5a5219d01e9be5e586e5d50fbf1ca955ea Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Jan 2012 17:19:26 -0800 Subject: mm: compaction: use synchronous compaction for /proc/sys/vm/compact_memory When asynchronous compaction was introduced, the /proc/sys/vm/compact_memory handler should have been updated to always use synchronous compaction. This did not happen so this patch addresses it. The assumption is if a user writes to /proc/sys/vm/compact_memory, they are willing for that process to stall. Signed-off-by: Mel Gorman Reviewed-by: Andrea Arcangeli Reviewed-by: Rik van Riel Reviewed-by: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 1 + 1 file changed, 1 insertion(+) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index 396ea2b47f7e..d31e64becb38 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -668,6 +668,7 @@ static int compact_node(int nid) .nr_freepages = 0, .nr_migratepages = 0, .order = -1, + .sync = true, }; zone = &pgdat->node_zones[zoneid]; -- cgit v1.2.3 From c82449352854ff09e43062246af86bdeb628f0c3 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Jan 2012 17:19:38 -0800 Subject: mm: compaction: make isolate_lru_page() filter-aware again Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware") noted that compaction does not migrate dirty or writeback pages and that is was meaningless to pick the page and re-add it to the LRU list. This had to be partially reverted because some dirty pages can be migrated by compaction without blocking. This patch updates "mm: compaction: make isolate_lru_page" by skipping over pages that migration has no possibility of migrating to minimise LRU disruption. Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Reviewed-by: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/mmzone.h | 2 ++ mm/compaction.c | 3 +++ mm/vmscan.c | 35 +++++++++++++++++++++++++++++++++-- 3 files changed, 38 insertions(+), 2 deletions(-) (limited to 'mm/compaction.c') diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 42e544cd4c9f..2038b90ca6e3 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -177,6 +177,8 @@ struct lruvec { #define ISOLATE_CLEAN ((__force isolate_mode_t)0x4) /* Isolate unmapped file */ #define ISOLATE_UNMAPPED ((__force isolate_mode_t)0x8) +/* Isolate for asynchronous migration */ +#define ISOLATE_ASYNC_MIGRATE ((__force isolate_mode_t)0x10) /* LRU Isolation modes. */ typedef unsigned __bitwise__ isolate_mode_t; diff --git a/mm/compaction.c b/mm/compaction.c index d31e64becb38..fb291585e1bf 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -349,6 +349,9 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, continue; } + if (!cc->sync) + mode |= ISOLATE_ASYNC_MIGRATE; + /* Try isolate the page */ if (__isolate_lru_page(page, mode, 0) != 0) continue; diff --git a/mm/vmscan.c b/mm/vmscan.c index cb68c53db4ec..efbcab1c8f54 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1075,8 +1075,39 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file) ret = -EBUSY; - if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page))) - return ret; + /* + * To minimise LRU disruption, the caller can indicate that it only + * wants to isolate pages it will be able to operate on without + * blocking - clean pages for the most part. + * + * ISOLATE_CLEAN means that only clean pages should be isolated. This + * is used by reclaim when it is cannot write to backing storage + * + * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages + * that it is possible to migrate without blocking + */ + if (mode & (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)) { + /* All the caller can do on PageWriteback is block */ + if (PageWriteback(page)) + return ret; + + if (PageDirty(page)) { + struct address_space *mapping; + + /* ISOLATE_CLEAN means only clean pages */ + if (mode & ISOLATE_CLEAN) + return ret; + + /* + * Only pages without mappings or that have a + * ->migratepage callback are possible to migrate + * without blocking + */ + mapping = page_mapping(page); + if (mapping && !mapping->a_ops->migratepage) + return ret; + } + } if ((mode & ISOLATE_UNMAPPED) && page_mapped(page)) return ret; -- cgit v1.2.3 From a6bc32b899223a877f595ef9ddc1e89ead5072b8 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 12 Jan 2012 17:19:43 -0800 Subject: mm: compaction: introduce sync-light migration for use by compaction This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT mode that avoids writing back pages to backing storage. Async compaction maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT. For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is used. This avoids sync compaction stalling for an excessive length of time, particularly when copying files to a USB stick where there might be a large number of dirty pages backed by a filesystem that does not support ->writepages. [aarcange@redhat.com: This patch is heavily based on Andrea's work] [akpm@linux-foundation.org: fix fs/nfs/write.c build] [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build] Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Cc: Andrea Arcangeli Cc: Minchan Kim Cc: Dave Jones Cc: Jan Kara Cc: Andy Isaacson Cc: Nai Xia Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/btrfs/disk-io.c | 5 ++-- fs/hugetlbfs/inode.c | 2 +- fs/nfs/internal.h | 2 +- fs/nfs/write.c | 4 +-- include/linux/fs.h | 6 ++-- include/linux/migrate.h | 23 +++++++++++---- mm/compaction.c | 2 +- mm/memory-failure.c | 2 +- mm/memory_hotplug.c | 2 +- mm/mempolicy.c | 2 +- mm/migrate.c | 78 +++++++++++++++++++++++++++---------------------- 11 files changed, 76 insertions(+), 52 deletions(-) (limited to 'mm/compaction.c') diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 1375494c8cb6..d8525662ca7a 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -872,7 +872,8 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio, #ifdef CONFIG_MIGRATION static int btree_migratepage(struct address_space *mapping, - struct page *newpage, struct page *page, bool sync) + struct page *newpage, struct page *page, + enum migrate_mode mode) { /* * we can't safely write a btree page from here, @@ -887,7 +888,7 @@ static int btree_migratepage(struct address_space *mapping, if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL)) return -EAGAIN; - return migrate_page(mapping, newpage, page, sync); + return migrate_page(mapping, newpage, page, mode); } #endif diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 06fd4608a990..1e85a7ac0217 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -584,7 +584,7 @@ static int hugetlbfs_set_page_dirty(struct page *page) static int hugetlbfs_migrate_page(struct address_space *mapping, struct page *newpage, struct page *page, - bool sync) + enum migrate_mode mode) { int rc; diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 114398a15830..8102db9b926c 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -332,7 +332,7 @@ void nfs_commit_release_pages(struct nfs_write_data *data); #ifdef CONFIG_MIGRATION extern int nfs_migrate_page(struct address_space *, - struct page *, struct page *, bool); + struct page *, struct page *, enum migrate_mode); #else #define nfs_migrate_page NULL #endif diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 889e98bc5a21..834f0fe96f89 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -1688,7 +1688,7 @@ out_error: #ifdef CONFIG_MIGRATION int nfs_migrate_page(struct address_space *mapping, struct page *newpage, - struct page *page, bool sync) + struct page *page, enum migrate_mode mode) { /* * If PagePrivate is set, then the page is currently associated with @@ -1703,7 +1703,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage, nfs_fscache_release_page(page, GFP_KERNEL); - return migrate_page(mapping, newpage, page, sync); + return migrate_page(mapping, newpage, page, mode); } #endif diff --git a/include/linux/fs.h b/include/linux/fs.h index b92b73d0b2b9..e694bd4434a4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -525,6 +525,7 @@ enum positive_aop_returns { struct page; struct address_space; struct writeback_control; +enum migrate_mode; struct iov_iter { const struct iovec *iov; @@ -614,7 +615,7 @@ struct address_space_operations { * is false, it must not block. */ int (*migratepage) (struct address_space *, - struct page *, struct page *, bool); + struct page *, struct page *, enum migrate_mode); int (*launder_page) (struct page *); int (*is_partially_uptodate) (struct page *, read_descriptor_t *, unsigned long); @@ -2540,7 +2541,8 @@ extern int generic_check_addressable(unsigned, u64); #ifdef CONFIG_MIGRATION extern int buffer_migrate_page(struct address_space *, - struct page *, struct page *, bool); + struct page *, struct page *, + enum migrate_mode); #else #define buffer_migrate_page NULL #endif diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 14e6d2a88475..eaf867412f7a 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -6,18 +6,31 @@ typedef struct page *new_page_t(struct page *, unsigned long private, int **); +/* + * MIGRATE_ASYNC means never block + * MIGRATE_SYNC_LIGHT in the current implementation means to allow blocking + * on most operations but not ->writepage as the potential stall time + * is too significant + * MIGRATE_SYNC will block when migrating pages + */ +enum migrate_mode { + MIGRATE_ASYNC, + MIGRATE_SYNC_LIGHT, + MIGRATE_SYNC, +}; + #ifdef CONFIG_MIGRATION #define PAGE_MIGRATION 1 extern void putback_lru_pages(struct list_head *l); extern int migrate_page(struct address_space *, - struct page *, struct page *, bool); + struct page *, struct page *, enum migrate_mode); extern int migrate_pages(struct list_head *l, new_page_t x, unsigned long private, bool offlining, - bool sync); + enum migrate_mode mode); extern int migrate_huge_pages(struct list_head *l, new_page_t x, unsigned long private, bool offlining, - bool sync); + enum migrate_mode mode); extern int fail_migrate_page(struct address_space *, struct page *, struct page *); @@ -36,10 +49,10 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping, static inline void putback_lru_pages(struct list_head *l) {} static inline int migrate_pages(struct list_head *l, new_page_t x, unsigned long private, bool offlining, - bool sync) { return -ENOSYS; } + enum migrate_mode mode) { return -ENOSYS; } static inline int migrate_huge_pages(struct list_head *l, new_page_t x, unsigned long private, bool offlining, - bool sync) { return -ENOSYS; } + enum migrate_mode mode) { return -ENOSYS; } static inline int migrate_prep(void) { return -ENOSYS; } static inline int migrate_prep_local(void) { return -ENOSYS; } diff --git a/mm/compaction.c b/mm/compaction.c index fb291585e1bf..71a58f67f481 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -557,7 +557,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc) nr_migrate = cc->nr_migratepages; err = migrate_pages(&cc->migratepages, compaction_alloc, (unsigned long)cc, false, - cc->sync); + cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC); update_nr_listpages(cc); nr_remaining = cc->nr_migratepages; diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 06d3479513aa..56080ea36140 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1557,7 +1557,7 @@ int soft_offline_page(struct page *page, int flags) page_is_file_cache(page)); list_add(&page->lru, &pagelist); ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL, - 0, true); + 0, MIGRATE_SYNC); if (ret) { putback_lru_pages(&pagelist); pr_info("soft offline: %#lx: migration failed %d, type %lx\n", diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 2168489c0bc9..6629fafd6ce4 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -809,7 +809,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) } /* this function returns # of failed pages */ ret = migrate_pages(&source, hotremove_migrate_alloc, 0, - true, true); + true, MIGRATE_SYNC); if (ret) putback_lru_pages(&source); } diff --git a/mm/mempolicy.c b/mm/mempolicy.c index e3d58f088466..06b145fb64ab 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -942,7 +942,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest, if (!list_empty(&pagelist)) { err = migrate_pages(&pagelist, new_node_page, dest, - false, true); + false, MIGRATE_SYNC); if (err) putback_lru_pages(&pagelist); } diff --git a/mm/migrate.c b/mm/migrate.c index 4e86f3bacb85..9871a56d82c3 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -218,12 +218,13 @@ out: #ifdef CONFIG_BLOCK /* Returns true if all buffers are successfully locked */ -static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync) +static bool buffer_migrate_lock_buffers(struct buffer_head *head, + enum migrate_mode mode) { struct buffer_head *bh = head; /* Simple case, sync compaction */ - if (sync) { + if (mode != MIGRATE_ASYNC) { do { get_bh(bh); lock_buffer(bh); @@ -259,7 +260,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync) } #else static inline bool buffer_migrate_lock_buffers(struct buffer_head *head, - bool sync) + enum migrate_mode mode) { return true; } @@ -275,7 +276,7 @@ static inline bool buffer_migrate_lock_buffers(struct buffer_head *head, */ static int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, - struct buffer_head *head, bool sync) + struct buffer_head *head, enum migrate_mode mode) { int expected_count; void **pslot; @@ -311,7 +312,8 @@ static int migrate_page_move_mapping(struct address_space *mapping, * the mapping back due to an elevated page count, we would have to * block waiting on other references to be dropped. */ - if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) { + if (mode == MIGRATE_ASYNC && head && + !buffer_migrate_lock_buffers(head, mode)) { page_unfreeze_refs(page, expected_count); spin_unlock_irq(&mapping->tree_lock); return -EAGAIN; @@ -472,13 +474,14 @@ EXPORT_SYMBOL(fail_migrate_page); * Pages are locked upon entry and exit. */ int migrate_page(struct address_space *mapping, - struct page *newpage, struct page *page, bool sync) + struct page *newpage, struct page *page, + enum migrate_mode mode) { int rc; BUG_ON(PageWriteback(page)); /* Writeback must be complete */ - rc = migrate_page_move_mapping(mapping, newpage, page, NULL, sync); + rc = migrate_page_move_mapping(mapping, newpage, page, NULL, mode); if (rc) return rc; @@ -495,17 +498,17 @@ EXPORT_SYMBOL(migrate_page); * exist. */ int buffer_migrate_page(struct address_space *mapping, - struct page *newpage, struct page *page, bool sync) + struct page *newpage, struct page *page, enum migrate_mode mode) { struct buffer_head *bh, *head; int rc; if (!page_has_buffers(page)) - return migrate_page(mapping, newpage, page, sync); + return migrate_page(mapping, newpage, page, mode); head = page_buffers(page); - rc = migrate_page_move_mapping(mapping, newpage, page, head, sync); + rc = migrate_page_move_mapping(mapping, newpage, page, head, mode); if (rc) return rc; @@ -515,8 +518,8 @@ int buffer_migrate_page(struct address_space *mapping, * with an IRQ-safe spinlock held. In the sync case, the buffers * need to be locked now */ - if (sync) - BUG_ON(!buffer_migrate_lock_buffers(head, sync)); + if (mode != MIGRATE_ASYNC) + BUG_ON(!buffer_migrate_lock_buffers(head, mode)); ClearPagePrivate(page); set_page_private(newpage, page_private(page)); @@ -593,10 +596,11 @@ static int writeout(struct address_space *mapping, struct page *page) * Default handling if a filesystem does not provide a migration function. */ static int fallback_migrate_page(struct address_space *mapping, - struct page *newpage, struct page *page, bool sync) + struct page *newpage, struct page *page, enum migrate_mode mode) { if (PageDirty(page)) { - if (!sync) + /* Only writeback pages in full synchronous migration */ + if (mode != MIGRATE_SYNC) return -EBUSY; return writeout(mapping, page); } @@ -609,7 +613,7 @@ static int fallback_migrate_page(struct address_space *mapping, !try_to_release_page(page, GFP_KERNEL)) return -EAGAIN; - return migrate_page(mapping, newpage, page, sync); + return migrate_page(mapping, newpage, page, mode); } /* @@ -624,7 +628,7 @@ static int fallback_migrate_page(struct address_space *mapping, * == 0 - success */ static int move_to_new_page(struct page *newpage, struct page *page, - int remap_swapcache, bool sync) + int remap_swapcache, enum migrate_mode mode) { struct address_space *mapping; int rc; @@ -645,7 +649,7 @@ static int move_to_new_page(struct page *newpage, struct page *page, mapping = page_mapping(page); if (!mapping) - rc = migrate_page(mapping, newpage, page, sync); + rc = migrate_page(mapping, newpage, page, mode); else if (mapping->a_ops->migratepage) /* * Most pages have a mapping and most filesystems provide a @@ -654,9 +658,9 @@ static int move_to_new_page(struct page *newpage, struct page *page, * is the most common path for page migration. */ rc = mapping->a_ops->migratepage(mapping, - newpage, page, sync); + newpage, page, mode); else - rc = fallback_migrate_page(mapping, newpage, page, sync); + rc = fallback_migrate_page(mapping, newpage, page, mode); if (rc) { newpage->mapping = NULL; @@ -671,7 +675,7 @@ static int move_to_new_page(struct page *newpage, struct page *page, } static int __unmap_and_move(struct page *page, struct page *newpage, - int force, bool offlining, bool sync) + int force, bool offlining, enum migrate_mode mode) { int rc = -EAGAIN; int remap_swapcache = 1; @@ -680,7 +684,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, struct anon_vma *anon_vma = NULL; if (!trylock_page(page)) { - if (!force || !sync) + if (!force || mode == MIGRATE_ASYNC) goto out; /* @@ -726,10 +730,12 @@ static int __unmap_and_move(struct page *page, struct page *newpage, if (PageWriteback(page)) { /* - * For !sync, there is no point retrying as the retry loop - * is expected to be too short for PageWriteback to be cleared + * Only in the case of a full syncronous migration is it + * necessary to wait for PageWriteback. In the async case, + * the retry loop is too short and in the sync-light case, + * the overhead of stalling is too much */ - if (!sync) { + if (mode != MIGRATE_SYNC) { rc = -EBUSY; goto uncharge; } @@ -800,7 +806,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage, skip_unmap: if (!page_mapped(page)) - rc = move_to_new_page(newpage, page, remap_swapcache, sync); + rc = move_to_new_page(newpage, page, remap_swapcache, mode); if (rc && remap_swapcache) remove_migration_ptes(page, page); @@ -823,7 +829,8 @@ out: * to the newly allocated page in newpage. */ static int unmap_and_move(new_page_t get_new_page, unsigned long private, - struct page *page, int force, bool offlining, bool sync) + struct page *page, int force, bool offlining, + enum migrate_mode mode) { int rc = 0; int *result = NULL; @@ -843,7 +850,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private, if (unlikely(split_huge_page(page))) goto out; - rc = __unmap_and_move(page, newpage, force, offlining, sync); + rc = __unmap_and_move(page, newpage, force, offlining, mode); out: if (rc != -EAGAIN) { /* @@ -891,7 +898,8 @@ out: */ static int unmap_and_move_huge_page(new_page_t get_new_page, unsigned long private, struct page *hpage, - int force, bool offlining, bool sync) + int force, bool offlining, + enum migrate_mode mode) { int rc = 0; int *result = NULL; @@ -904,7 +912,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, rc = -EAGAIN; if (!trylock_page(hpage)) { - if (!force || !sync) + if (!force || mode != MIGRATE_SYNC) goto out; lock_page(hpage); } @@ -915,7 +923,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page, try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS); if (!page_mapped(hpage)) - rc = move_to_new_page(new_hpage, hpage, 1, sync); + rc = move_to_new_page(new_hpage, hpage, 1, mode); if (rc) remove_migration_ptes(hpage, hpage); @@ -958,7 +966,7 @@ out: */ int migrate_pages(struct list_head *from, new_page_t get_new_page, unsigned long private, bool offlining, - bool sync) + enum migrate_mode mode) { int retry = 1; int nr_failed = 0; @@ -979,7 +987,7 @@ int migrate_pages(struct list_head *from, rc = unmap_and_move(get_new_page, private, page, pass > 2, offlining, - sync); + mode); switch(rc) { case -ENOMEM: @@ -1009,7 +1017,7 @@ out: int migrate_huge_pages(struct list_head *from, new_page_t get_new_page, unsigned long private, bool offlining, - bool sync) + enum migrate_mode mode) { int retry = 1; int nr_failed = 0; @@ -1026,7 +1034,7 @@ int migrate_huge_pages(struct list_head *from, rc = unmap_and_move_huge_page(get_new_page, private, page, pass > 2, offlining, - sync); + mode); switch(rc) { case -ENOMEM: @@ -1155,7 +1163,7 @@ set_status: err = 0; if (!list_empty(&pagelist)) { err = migrate_pages(&pagelist, new_page_node, - (unsigned long)pm, 0, true); + (unsigned long)pm, 0, MIGRATE_SYNC); if (err) putback_lru_pages(&pagelist); } -- cgit v1.2.3 From 0bf380bc70ecba68cb4d74dc656cc2fa8c4d801a Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Fri, 3 Feb 2012 15:37:18 -0800 Subject: mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block during isolation for migration When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72ec #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d14] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8de #9 [d72d3d68] compact_zone_order at c030bba1 #10 [d72d3db4] try_to_compact_pages at c030bc84 #11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 #12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 #13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 #14 [d72d3eb8] alloc_pages_vma at c030a845 #15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb #16 [d72d3f00] handle_mm_fault at c02f36c6 #17 [d72d3f30] do_page_fault at c05c70ed #18 [d72d3fb0] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh Tested-by: Herbert van den Bergh Signed-off-by: Mel Gorman Acked-by: Michal Nazarewicz Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index 71a58f67f481..bd939a574b84 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -313,6 +313,19 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, } else if (!locked) spin_lock_irq(&zone->lru_lock); + /* + * migrate_pfn does not necessarily start aligned to a + * pageblock. Ensure that pfn_valid is called when moving + * into a new MAX_ORDER_NR_PAGES range in case of large + * memory holes within the zone + */ + if ((low_pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) { + if (!pfn_valid(low_pfn)) { + low_pfn += MAX_ORDER_NR_PAGES - 1; + continue; + } + } + if (!pfn_valid_within(low_pfn)) continue; nr_scanned++; -- cgit v1.2.3 From dc9086004b3d5db75997a645b3fe08d9138b7ad0 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 8 Feb 2012 17:13:38 -0800 Subject: mm: compaction: check for overlapping nodes during isolation for migration When isolating pages for migration, migration starts at the start of a zone while the free scanner starts at the end of the zone. Migration avoids entering a new zone by never going beyond the free scanned. Unfortunately, in very rare cases nodes can overlap. When this happens, migration isolates pages without the LRU lock held, corrupting lists which will trigger errors in reclaim or during page free such as in the following oops BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [] free_pcppages_bulk+0xcc/0x450 PGD 1dda554067 PUD 1e1cb58067 PMD 0 Oops: 0000 [#1] SMP CPU 37 Pid: 17088, comm: memcg_process_s Tainted: G X RIP: free_pcppages_bulk+0xcc/0x450 Process memcg_process_s (pid: 17088, threadinfo ffff881c2926e000, task ffff881c2926c0c0) Call Trace: free_hot_cold_page+0x17e/0x1f0 __pagevec_free+0x90/0xb0 release_pages+0x22a/0x260 pagevec_lru_move_fn+0xf3/0x110 putback_lru_page+0x66/0xe0 unmap_and_move+0x156/0x180 migrate_pages+0x9e/0x1b0 compact_zone+0x1f3/0x2f0 compact_zone_order+0xa2/0xe0 try_to_compact_pages+0xdf/0x110 __alloc_pages_direct_compact+0xee/0x1c0 __alloc_pages_slowpath+0x370/0x830 __alloc_pages_nodemask+0x1b1/0x1c0 alloc_pages_vma+0x9b/0x160 do_huge_pmd_anonymous_page+0x160/0x270 do_page_fault+0x207/0x4c0 page_fault+0x25/0x30 The "X" in the taint flag means that external modules were loaded but but is unrelated to the bug triggering. The real problem was because the PFN layout looks like this Zone PFN ranges: DMA 0x00000010 -> 0x00001000 DMA32 0x00001000 -> 0x00100000 Normal 0x00100000 -> 0x01e80000 Movable zone start PFN for each node early_node_map[14] active PFN ranges 0: 0x00000010 -> 0x0000009b 0: 0x00000100 -> 0x0007a1ec 0: 0x0007a354 -> 0x0007a379 0: 0x0007f7ff -> 0x0007f800 0: 0x00100000 -> 0x00680000 1: 0x00680000 -> 0x00e80000 0: 0x00e80000 -> 0x01080000 1: 0x01080000 -> 0x01280000 0: 0x01280000 -> 0x01480000 1: 0x01480000 -> 0x01680000 0: 0x01680000 -> 0x01880000 1: 0x01880000 -> 0x01a80000 0: 0x01a80000 -> 0x01c80000 1: 0x01c80000 -> 0x01e80000 The fix is straight-forward. isolate_migratepages() has to make a similar check to isolate_freepage to ensure that it never isolates pages from a zone it does not hold the LRU lock for. This was discovered in a 3.0-based kernel but it affects 3.1.x, 3.2.x and current mainline. Signed-off-by: Mel Gorman Acked-by: Michal Nazarewicz Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/compaction.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) (limited to 'mm/compaction.c') diff --git a/mm/compaction.c b/mm/compaction.c index bd939a574b84..d9ebebe1a2aa 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -330,8 +330,17 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone, continue; nr_scanned++; - /* Get the page and skip if free */ + /* + * Get the page and ensure the page is within the same zone. + * See the comment in isolate_freepages about overlapping + * nodes. It is deliberate that the new zone lock is not taken + * as memory compaction should not move pages between nodes. + */ page = pfn_to_page(low_pfn); + if (page_zone(page) != zone) + continue; + + /* Skip if free */ if (PageBuddy(page)) continue; -- cgit v1.2.3