summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2017-04-20dm rq: don't pass irrelevant error code to blk_mq_complete_requestChristoph Hellwig
dm never uses rq->errors, so there is no need to pass an error argument to blk_mq_complete_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20md/raid10: wait up frozen array in handle_write_completedGuoqing Jiang
Since nr_queued is changed, we need to call wake_up here if the array is already frozen and waiting for condition "nr_pending == nr_queued + extra" to be true. And commit 824e47daddbf ("RAID1: avoid unnecessary spin locks in I/O barrier code") which has already added the wake_up for raid1. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-14md-cluster: Fix a memleak in an error handling pathChristophe JAILLET
We know that 'bm_lockres' is NULL here, so 'lockres_free(bm_lockres)' is a no-op. According to resource handling in case of error a few lines below, it is likely that 'bitmap_free(bitmap)' was expected instead. Fixes: b98938d16a10 ("md-cluster: introduce cluster_check_sync_size") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-12md: support disabling of create-on-open semantics.NeilBrown
md allows a new array device to be created by simply opening a device file. This make it difficult to remove the device and udev is likely to open the device file as part of processing the REMOVE event. There is an alternate mechanism for creating arrays by writing to the new_array module parameter. When using tools that work with this parameter, it is best to disable the old semantics. This new module parameter allows that. Signed-off-by: NeilBrown <neilb@suse.com> Acted-by: Coly Li <colyli@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-12md: allow creation of mdNNN arrays via md_mod/parameters/new_arrayNeilBrown
The intention when creating the "new_array" parameter and the possibility of having array names line "md_HOME" was to transition away from the old way of creating arrays and to eventually only use this new way. The "old" way of creating array is to create a device node in /dev and then open it. The act of opening creates the array. This is problematic because sometimes the device node can be opened when we don't want to create an array. This can easily happen when some rule triggered by udev looks at a device as it is being destroyed. The node in /dev continues to exist for a short period after an array is stopped, and opening it during this time recreates the array (as an inactive array). Unfortunately no clear plan for the transition was created. It is now time to fix that. This patch allows devices with numeric names, like "md999" to be created by writing to "new_array". This will only work if the minor number given is not already in use. This will allow mdadm to support the creation of arrays with numbers > 511 (currently not possible) by writing to new_array. mdadm can, at some point, use this approach to create *all* arrays, which will allow the transition to only using the new-way. Signed-off-by: NeilBrown <neilb@suse.com> Acted-by: Coly Li <colyli@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11raid5-ppl: use a single mempool for ppl_io_unit and header_pageArtur Paszkiewicz
Allocate both struct ppl_io_unit and its header_page from a shared mempool to avoid a possible deadlock. Implement allocate and free functions for the mempool, remove the second pool for allocating header_page. The header_pages are now freed with their io_units, not when the ppl bio completes. Also, use GFP_NOWAIT instead of GFP_ATOMIC for allocating ppl_io_unit because we can handle failed allocations and there is no reason to utilize emergency reserves. Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11md/raid0: fix up bio splitting.NeilBrown
raid0_make_request() should use a private bio_set rather than the shared fs_bio_set, which is only meant for filesystems to use. raid0_make_request() shouldn't loop around using the bio_set multiple times as that can deadlock. So use mddev->bio_set and pass the tail to generic_make_request() instead of looping on it. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11md/linear: improve bio splitting.NeilBrown
linear_make_request() uses fs_bio_set, which is meant for filesystems to use, and loops, possible allocating from the same bio set multiple times. These behaviors can theoretically cause deadlocks, though as linear requests are hardly ever split, it is unlikely in practice. Change to use mddev->bio_set - otherwise unused for linear, and submit the tail of a split request to generic_make_request() for it to handle. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11md/raid5: make chunk_aligned_read() split bios more cleanly.NeilBrown
chunk_aligned_read() currently uses fs_bio_set - which is meant for filesystems to use - and loops if multiple splits are needed, which is not best practice. As this is only used for READ requests, not writes, it is unlikely to cause a problem. However it is best to be consistent in how we split bios, and to follow the pattern used in raid1/raid10. So create a private bioset, bio_split, and use it to perform a single split, submitting the remainder to generic_make_request() for later processing. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11md/raid10: simplify handle_read_error()NeilBrown
handle_read_error() duplicates a lot of the work that raid10_read_request() does, so it makes sense to just use that function. handle_read_error() relies on the same r10bio being re-used so that, in the case of a read-only array, setting IO_BLOCKED in r1bio->devs[].bio ensures read_balance() won't re-use that device. So when called from raid10_make_request() we clear that array, but not when called from handle_read_error(). Two parts of handle_read_error() that need to be preserved are the warning message it prints, so they are conditionally added to raid10_read_request(). If the failing rdev can be found, messages are printed. Otherwise they aren't. Not that as rdev_dec_pending() has already been called on the failing rdev, we need to use rcu_read_lock() to get a new reference from the conf. We only use this to get the name of the failing block device. With this change, we no longer need inc_pending(). Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11md/raid10: simplify the splitting of requests.NeilBrown
raid10 splits requests in two different ways for two different reasons. First, bio_split() is used to ensure the bio fits with a chunk. Second, multiple r10bio structures are allocated to represent the different sections that need to go to different devices, to avoid known bad blocks. This can be simplified to just use bio_split() once, and not to use multiple r10bios. We delay the split until we know a maximum bio size that can be handled with a single r10bio, and then split the bio and queue the remainder for later handling. As with raid1, we allocate a new bio_set to help with the splitting. It is not correct to use fs_bio_set in a device driver. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11md/raid1: factor out flush_bio_list()NeilBrown
flush_pending_writes() and raid1_unplug() each contain identical copies of a fairly large slab of code. So factor that out into new flush_bio_list() to simplify maintenance. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11md/raid1: simplify handle_read_error().NeilBrown
handle_read_error() duplicates a lot of the work that raid1_read_request() does, so it makes sense to just use that function. This doesn't quite work as handle_read_error() relies on the same r1bio being re-used so that, in the case of a read-only array, setting IO_BLOCKED in r1bio->bios[] ensures read_balance() won't re-use that device. So we need to allow a r1bio to be passed to raid1_read_request(), and to have that function mostly initialise the r1bio, but leave the bios[] array untouched. Two parts of handle_read_error() that need to be preserved are the warning message it prints, so they are conditionally added to raid1_read_request(). Note that this highlights a minor bug on alloc_r1bio(). It doesn't initalise the bios[] array, so it is possible that old content is there, which might cause read_balance() to ignore some devices with no good reason. With this change, we no longer need inc_pending(), or the sectors_handled arg to alloc_r1bio(). As handle_read_error() is called from raid1d() and allocates memory, there is tiny chance of a deadlock. All element of various pools could be queued waiting for raid1 to handle them, and there may be no extra memory free. Achieving guaranteed forward progress would probably require a second thread and another mempool. Instead of that complexity, add __GFP_HIGH to any allocations when read1_read_request() is called from raid1d. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11md/raid1: simplify alloc_behind_master_bio()NeilBrown
Now that we always always pass an offset of 0 and a size that matches the bio to alloc_behind_master_bio(), we can remove the offset/size args and simplify the code. We could probably remove bio_copy_data_partial() too. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-11md/raid1: simplify the splitting of requests.NeilBrown
raid1 currently splits requests in two different ways for two different reasons. First, bio_split() is used to ensure the bio fits within a resync accounting region. Second, multiple r1bios are allocated for each bio to handle the possiblity of known bad blocks on some devices. This can be simplified to just use bio_split() once, and not use multiple r1bios. We delay the split until we know a maximum bio size that can be handled with a single r1bio, and then split the bio and queue the remainder for later handling. This avoids all loops inside raid1.c request handling. Just a single read, or a single set of writes, is submitted to lower-level devices for each bio that comes from generic_make_request(). When the bio needs to be split, generic_make_request() will do the necessary looping and call md_make_request() multiple times. raid1_make_request() no longer queues request for raid1 to handle, so we can remove that branch from the 'if'. This patch also creates a new private bio_set (conf->bio_split) for splitting bios. Using fs_bio_set is wrong, as it is meant to be used by filesystems, not block devices. Using it inside md can lead to deadlocks under high memory pressure. Delete unused variable in raid1_write_request() (Shaohua) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10raid5-ppl: partial parity calculation optimizationArtur Paszkiewicz
In case of read-modify-write, partial partity is the same as the result of ops_run_prexor5(), so we can just copy sh->dev[pd_idx].page into sh->ppl_page instead of calculating it again. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10raid5-ppl: use resize_stripes() when enabling or disabling pplArtur Paszkiewicz
Use resize_stripes() instead of raid5_reset_stripe_cache() to allocate or free sh->ppl_page at runtime for all stripes in the stripe cache. raid5_reset_stripe_cache() required suspending the mddev and could deadlock because of GFP_KERNEL allocations. Move the 'newsize' check to check_reshape() to allow reallocating the stripes with the same number of disks. Allocate sh->ppl_page in alloc_stripe() instead of grow_buffers(). Pass 'struct r5conf *conf' as a parameter to alloc_stripe() because it is needed to check whether to allocate ppl_page. Add free_stripe() and use it to free stripes rather than directly call kmem_cache_free(). Also free sh->ppl_page in free_stripe(). Set MD_HAS_PPL at the end of ppl_init_log() instead of explicitly setting it in advance and add another parameter to log_init() to allow calling ppl_init_log() without the bit set. Don't try to calculate partial parity or add a stripe to log if it does not have ppl_page set. Enabling ppl can now be performed without suspending the mddev, because the log won't be used until new stripes are allocated with ppl_page. Calling mddev_suspend/resume is still necessary when disabling ppl, because we want all stripes to finish before stopping the log, but resize_stripes() can be called after mddev_resume() when ppl is no longer active. Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10raid5-ppl: move no_mem_stripes to struct ppl_confArtur Paszkiewicz
Use a single no_mem_stripes list instead of per member device lists for handling stripes that need retrying in case of failed io_unit allocation. Because io_units are allocated from a memory pool shared between all member disks, the no_mem_stripes list should be checked when an io_unit for any member is freed. This fixes a deadlock that could happen if there are stripes in more than one no_mem_stripes list. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10md/raid1: avoid reusing a resync bio after error handling.NeilBrown
fix_sync_read_error() modifies a bio on a newly faulty device by setting bi_end_io to end_sync_write. This ensure that put_buf() will still call rdev_dec_pending() as required, but makes sure that subsequent code in fix_sync_read_error() doesn't try to read from the device. Unfortunately this interacts badly with sync_request_write() which assumes that any bio with bi_end_io set to non-NULL other than end_sync_read is safe to write to. As the device is now faulty it doesn't make sense to write. As the bio was recently used for a read, it is "dirty" and not suitable for immediate submission. In particular, ->bi_next might be non-NULL, which will cause generic_make_request() to complain. Break this interaction by refusing to write to devices which are marked as Faulty. Reported-and-tested-by: Michael Wang <yun.wang@profitbricks.com> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.") Cc: stable@vger.kernel.org (v4.10+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10md.c:didn't unlock the mddev before return EINVAL in array_size_storeZhilong Liu
md.c: it needs to release the mddev lock before the array_size_store() returns. Fixes: ab5a98b132fd ("md-cluster: change array_sectors and update size are not supported") Signed-off-by: Zhilong Liu <zlliu@suse.com> Reviewed-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stopNeilBrown
if called md_set_readonly and set MD_CLOSING bit, the mddev cannot be opened any more due to the MD_CLOING bit wasn't cleared. Thus it needs to be cleared in md_ioctl after any call to md_set_readonly() or do_md_stop(). Signed-off-by: NeilBrown <neilb@suse.com> Fixes: af8d8e6f0315 ("md: changes for MD_STILL_CLOSED flag") Cc: stable@vger.kernel.org (v4.9+) Signed-off-by: Zhilong Liu <zlliu@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10md/raid10: reset the 'first' at the end of loopGuoqing Jiang
We need to set "first = 0' at the end of rdev_for_each loop, so we can get the array's min_offset_diff correctly otherwise min_offset_diff just means the last rdev's offset diff. Suggested-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10md/raid6: Fix anomily when recovering a single device in RAID6.NeilBrown
When recoverying a single missing/failed device in a RAID6, those stripes where the Q block is on the missing device are handled a bit differently. In these cases it is easy to check that the P block is correct, so we do. This results in the P block be destroy. Consequently the P block needs to be read a second time in order to compute Q. This causes lots of seeks and hurts performance. It shouldn't be necessary to re-read P as it can be computed from the DATA. But we only compute blocks on missing devices, since c337869d9501 ("md: do not compute parity unless it is on a failed drive"). So relax the change made in that commit to allow computing of the P block in a RAID6 which it is the only missing that block. This makes RAID6 recovery run much faster as the disk just "before" the recovering device is no longer seeking back-and-forth. Reported-by-tested-by: Brad Campbell <lists2009@fnarfbargle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-10md: update slab_cache before releasing new stripes when stripes resizingDennis Yang
When growing raid5 device on machine with small memory, there is chance that mdadm will be killed and the following bug report can be observed. The same bug could also be reproduced in linux-4.10.6. [57600.075774] BUG: unable to handle kernel NULL pointer dereference at (null) [57600.083796] IP: [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20 [57600.110378] PGD 421cf067 PUD 4442d067 PMD 0 [57600.114678] Oops: 0002 [#1] SMP [57600.180799] CPU: 1 PID: 25990 Comm: mdadm Tainted: P O 4.2.8 #1 [57600.187849] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS QV05AR66 03/06/2013 [57600.197490] task: ffff880044e47240 ti: ffff880043070000 task.ti: ffff880043070000 [57600.204963] RIP: 0010:[<ffffffff81a6aa87>] [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20 [57600.213057] RSP: 0018:ffff880043073810 EFLAGS: 00010046 [57600.218359] RAX: 0000000000000000 RBX: 000000000000000c RCX: ffff88011e296dd0 [57600.225486] RDX: 0000000000000001 RSI: ffffe8ffffcb46c0 RDI: 0000000000000000 [57600.232613] RBP: ffff880043073878 R08: ffff88011e5f8170 R09: 0000000000000282 [57600.239739] R10: 0000000000000005 R11: 28f5c28f5c28f5c3 R12: ffff880043073838 [57600.246872] R13: ffffe8ffffcb46c0 R14: 0000000000000000 R15: ffff8800b9706a00 [57600.253999] FS: 00007f576106c700(0000) GS:ffff88011e280000(0000) knlGS:0000000000000000 [57600.262078] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [57600.267817] CR2: 0000000000000000 CR3: 00000000428fe000 CR4: 00000000001406e0 [57600.274942] Stack: [57600.276949] ffffffff8114ee35 ffff880043073868 0000000000000282 000000000000eb3f [57600.284383] ffffffff81119043 ffff880043073838 ffff880043073838 ffff88003e197b98 [57600.291820] ffffe8ffffcb46c0 ffff88003e197360 0000000000000286 ffff880043073968 [57600.299254] Call Trace: [57600.301698] [<ffffffff8114ee35>] ? cache_flusharray+0x35/0xe0 [57600.307523] [<ffffffff81119043>] ? __page_cache_release+0x23/0x110 [57600.313779] [<ffffffff8114eb53>] kmem_cache_free+0x63/0xc0 [57600.319344] [<ffffffff81579942>] drop_one_stripe+0x62/0x90 [57600.324915] [<ffffffff81579b5b>] raid5_cache_scan+0x8b/0xb0 [57600.330563] [<ffffffff8111b98a>] shrink_slab.part.36+0x19a/0x250 [57600.336650] [<ffffffff8111e38c>] shrink_zone+0x23c/0x250 [57600.342039] [<ffffffff8111e4f3>] do_try_to_free_pages+0x153/0x420 [57600.348210] [<ffffffff8111e851>] try_to_free_pages+0x91/0xa0 [57600.353959] [<ffffffff811145b1>] __alloc_pages_nodemask+0x4d1/0x8b0 [57600.360303] [<ffffffff8157a30b>] check_reshape+0x62b/0x770 [57600.365866] [<ffffffff8157a4a5>] raid5_check_reshape+0x55/0xa0 [57600.371778] [<ffffffff81583df7>] update_raid_disks+0xc7/0x110 [57600.377604] [<ffffffff81592b73>] md_ioctl+0xd83/0x1b10 [57600.382827] [<ffffffff81385380>] blkdev_ioctl+0x170/0x690 [57600.388307] [<ffffffff81195238>] block_ioctl+0x38/0x40 [57600.393525] [<ffffffff811731c5>] do_vfs_ioctl+0x2b5/0x480 [57600.399010] [<ffffffff8115e07b>] ? vfs_write+0x14b/0x1f0 [57600.404400] [<ffffffff811733cc>] SyS_ioctl+0x3c/0x70 [57600.409447] [<ffffffff81a6ad97>] entry_SYSCALL_64_fastpath+0x12/0x6a [57600.415875] Code: 00 00 00 00 55 48 89 e5 8b 07 85 c0 74 04 31 c0 5d c3 ba 01 00 00 00 f0 0f b1 17 85 c0 75 ef b0 01 5d c3 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 85 d1 63 ff 5d [57600.435460] RIP [<ffffffff81a6aa87>] _raw_spin_lock+0x7/0x20 [57600.441208] RSP <ffff880043073810> [57600.444690] CR2: 0000000000000000 [57600.448000] ---[ end trace cbc6b5cc4bf9831d ]--- The problem is that resize_stripes() releases new stripe_heads before assigning new slab cache to conf->slab_cache. If the shrinker function raid5_cache_scan() gets called after resize_stripes() starting releasing new stripes but right before new slab cache being assigned, it is possible that these new stripe_heads will be freed with the old slab_cache which was already been destoryed and that triggers this bug. Signed-off-by: Dennis Yang <dennisyang@qnap.com> Fixes: edbe83ab4c27 ("md/raid5: allow the stripe_cache to grow and shrink.") Cc: stable@vger.kernel.org (4.1+) Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-04-08Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Here's a pull request for 4.11-rc, fixing a set of issues mostly centered around the new scheduling framework. These have been brewing for a while, but split up into what we absolutely need in 4.11, and what we can defer until 4.12. These are well tested, on both single queue and multiqueue setups, and with and without shared tags. They fix several hangs that have happened in testing. This is obviously larger than I would have preferred at this point in time, but I don't think we can shave much off this and still get the desired results. In detail, this pull request contains: - a set of five fixes for NVMe, mostly from Christoph and one from Roland. - a series from Bart, fixing issues with dm-mq and SCSI shared tags and scheduling. Note that one of those patches commit messages may read like an optimization, but it is in fact an important fix for queue restarts in particular. - a series from Omar, most importantly fixing a hang with multiple hardware queues when we fail to get a driver tag. Another important fix in there is for resizing hardware queues, which nbd does when handling multiple sockets for one connection. - fixing an imbalance in putting the ctx for hctx request allocations from Minchan" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: Restart a single queue if tag sets are shared dm rq: Avoid that request processing stalls sporadically scsi: Avoid that SCSI queues get stuck blk-mq: Introduce blk_mq_delay_run_hw_queue() blk-mq: remap queues when adding/removing hardware queues blk-mq-sched: fix crash in switch error path blk-mq-sched: set up scheduler tags when bringing up new queues blk-mq-sched: refactor scheduler initialization blk-mq: use the right hctx when getting a driver tag fails nvmet: fix byte swap in nvmet_parse_io_cmd nvmet: fix byte swap in nvmet_execute_write_zeroes nvmet: add missing byte swap in nvmet_get_smart_log nvme: add missing byte swap in nvme_setup_discard nvme: Correct NVMF enum values to match NVMe-oF rev 1.0 block: do not put mq context in blk_mq_alloc_request_hctx
2017-04-08block: remove the discard_zeroes_data flagChristoph Hellwig
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can kill this hack. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08dm kcopyd: switch to use REQ_OP_WRITE_ZEROESChristoph Hellwig
It seems like the code currently passes whatever it was using for writes to WRITE SAME. Just switch it to WRITE ZEROES, although that doesn't need any payload. Untested, and confused by the code, maybe someone who understands it better than me can help.. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08dm: support REQ_OP_WRITE_ZEROESChristoph Hellwig
Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08dm io: discards don't take a payloadChristoph Hellwig
Fix up do_region to not allocate a bio_vec for discards. We've got rid of the discard payload allocated by the caller years ago. Obviously this wasn't actually harmful given how long it's been there, but it's still good to avoid the pointless allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08md: support REQ_OP_WRITE_ZEROESChristoph Hellwig
Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07Merge branch 'for-linus' into for-4.12/blockJens Axboe
We've added a considerable amount of fixes for stalls and issues with the blk-mq scheduling in the 4.11 series since forking off the for-4.12/block branch. We need to do improvements on top of that for 4.12, so pull in the previous fixes to make our lives easier going forward. Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07dm rq: Avoid that request processing stalls sporadicallyBart Van Assche
While running the srp-test software I noticed that request processing stalls sporadically at the beginning of a test, namely when mkfs is run against a dm-mpath device. Every time when that happened the following command was sufficient to resume request processing: echo run >/sys/kernel/debug/block/dm-0/state This patch avoids that such request processing stalls occur. The test I ran is as follows: while srp-test/run_tests -d -r 30 -t 02-mq; do :; done Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07Merge tag 'dm-4.11-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - two stable fixes for the verity target's FEC support - a stable fix for raid target's raid1 support (when no bitmap is used) - a 4.11 cache metadata v2 format fix to properly test blocks are clean * tag 'dm-4.11-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm verity fec: fix bufio leaks dm raid: fix NULL pointer dereference for raid1 without bitmap dm cache metadata: fix metadata2 format's blocks_are_clean_separate_dirty dm verity fec: limit error correction recursion
2017-04-07block: trace completion of all bios.NeilBrown
Currently only dm and md/raid5 bios trigger trace_block_bio_complete(). Now that we have bio_chain() and bio_inc_remaining(), it is not possible, in general, for a driver to know when the bio is really complete. Only bio_endio() knows that. So move the trace_block_bio_complete() call to bio_endio(). Now trace_block_bio_complete() pairs with trace_block_bio_queue(). Any bio for which a 'queue' event is traced, will subsequently generate a 'complete' event. There are a few cases where completion tracing is not wanted. 1/ If blk_update_request() has already generated a completion trace event at the 'request' level, there is no point generating one at the bio level too. In this case the bi_sector and bi_size will have changed, so the bio level event would be wrong 2/ If the bio hasn't actually been queued yet, but is being aborted early, then a trace event could be confusing. Some filesystems call bio_endio() but do not want tracing. 3/ The bio_integrity code interposes itself by replacing bi_end_io, then restoring it and calling bio_endio() again. This would produce two identical trace events if left like that. To handle these, we introduce a flag BIO_TRACE_COMPLETION and only produce the trace event when this is set. We address point 1 above by clearing the flag in blk_update_request(). We address point 2 above by only setting the flag when generic_make_request() is called. We address point 3 above by clearing the flag after generating a completion event. When bio_split() is used on a bio, particularly in blk_queue_split(), there is an extra complication. A new bio is split off the front, and may be handle directly without going through generic_make_request(). The old bio, which has been advanced, is passed to generic_make_request(), so it will trigger a trace event a second time. Probably the best result when a split happens is to see a single 'queue' event for the whole bio, then multiple 'complete' events - one for each component. To achieve this was can: - copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split() - avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set. This way, the split-off bio won't create a queue event, the original won't either even if it re-submitted to generic_make_request(), but both will produce completion events, each for their own range. So if generic_make_request() is called (which generates a QUEUED event), then bi_endio() will create a single COMPLETE event for each range that the bio is split into, unless the driver has explicitly requested it not to. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-31dm verity fec: fix bufio leaksSami Tolvanen
Buffers read through dm_bufio_read() were not released in all code paths. Fixes: a739ff3f543a ("dm verity: add support for forward error correction") Cc: stable@vger.kernel.org # v4.5+ Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-31dm cache policy smq: make the cleaner policy write-back more aggressivelyJoe Thornber
By ignoring the sentinels the cleaner policy is able to write-back dirty cache data much faster. There is no reason to respect the sentinels, which denote that a block was changed recently, when using the cleaner policy given that the cleaner is tasked with writing back all dirty data. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-31dm cache: set/clear the cache core's dirty_bitset when loading mappingsJoe Thornber
When loading metadata make sure to set/clear the dirty bits in the cache core's dirty_bitset as well as the policy. Otherwise the cache core is unaware that any blocks were dirty when the cache was last shutdown. A very serious side-effect being that the cleaner policy would therefore never be tasked with writing back dirty data from a cache that was in writeback mode (e.g. when switching from smq policy to cleaner policy when decommissioning a writeback cache). This fixes a serious data corruption bug associated with writeback mode. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-31dm raid: fix NULL pointer dereference for raid1 without bitmapDmitry Bilunov
Commit 4257e08 ("dm raid: support to change bitmap region size") introduced a bitmap resize call during preresume phase. User can create a DM device with "raid" target configured as raid1 with no metadata devices to hold superblock/bitmap info. It can be achieved using the following sequence: truncate -s 32M /dev/shm/raid-test LOOP=$(losetup --show -f /dev/shm/raid-test) dmsetup create raid-test-linear0 --table "0 1024 linear $LOOP 0" dmsetup create raid-test-linear1 --table "0 1024 linear $LOOP 1024" dmsetup create raid-test --table "0 1024 raid raid1 1 2048 2 - /dev/mapper/raid-test-linear0 - /dev/mapper/raid-test-linear1" This results in the following crash: [ 4029.110216] device-mapper: raid: Ignoring chunk size parameter for RAID 1 [ 4029.110217] device-mapper: raid: Choosing default region size of 4MiB [ 4029.111349] md/raid1:mdX: active with 2 out of 2 mirrors [ 4029.114770] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 [ 4029.114802] IP: bitmap_resize+0x25/0x7c0 [md_mod] [ 4029.114816] PGD 0 … [ 4029.115059] Hardware name: Aquarius Pro P30 S85 BUY-866/B85M-E, BIOS 2304 05/25/2015 [ 4029.115079] task: ffff88015cc29a80 task.stack: ffffc90001a5c000 [ 4029.115097] RIP: 0010:bitmap_resize+0x25/0x7c0 [md_mod] [ 4029.115112] RSP: 0018:ffffc90001a5fb68 EFLAGS: 00010246 [ 4029.115127] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 0000000000000000 [ 4029.115146] RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000000 [ 4029.115166] RBP: ffffc90001a5fc28 R08: 0000000800000000 R09: 00000008ffffffff [ 4029.115185] R10: ffffea0005661600 R11: ffff88015cc29a80 R12: ffff88021231f058 [ 4029.115204] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 4029.115223] FS: 00007fe73a6b4740(0000) GS:ffff88021ea80000(0000) knlGS:0000000000000000 [ 4029.115245] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4029.115261] CR2: 0000000000000030 CR3: 0000000159a74000 CR4: 00000000001426e0 [ 4029.115281] Call Trace: [ 4029.115291] ? raid_iterate_devices+0x63/0x80 [dm_raid] [ 4029.115309] ? dm_table_all_devices_attribute.isra.23+0x41/0x70 [dm_mod] [ 4029.115329] ? dm_table_set_restrictions+0x225/0x2d0 [dm_mod] [ 4029.115346] raid_preresume+0x81/0x2e0 [dm_raid] [ 4029.115361] dm_table_resume_targets+0x47/0xe0 [dm_mod] [ 4029.115378] dm_resume+0xa8/0xd0 [dm_mod] [ 4029.115391] dev_suspend+0x123/0x250 [dm_mod] [ 4029.115405] ? table_load+0x350/0x350 [dm_mod] [ 4029.115419] ctl_ioctl+0x1c2/0x490 [dm_mod] [ 4029.115433] dm_ctl_ioctl+0xe/0x20 [dm_mod] [ 4029.115447] do_vfs_ioctl+0x8d/0x5a0 [ 4029.115459] ? ____fput+0x9/0x10 [ 4029.115470] ? task_work_run+0x79/0xa0 [ 4029.115481] SyS_ioctl+0x3c/0x70 [ 4029.115493] entry_SYSCALL_64_fastpath+0x13/0x94 The raid_preresume() function incorrectly assumes that the raid_set has a bitmap enabled if RT_FLAG_RS_BITMAP_LOADED is set. But RT_FLAG_RS_BITMAP_LOADED is getting set in __load_dirty_region_bitmap() even if there is no bitmap present (and bitmap_load() happily returns 0 even if a bitmap isn't present). So the only way forward in the near-term is to check if the bitmap is present by seeing if mddev->bitmap is not NULL after bitmap_load() has been called. By doing so the above NULL pointer is avoided. Fixes: 4257e08 ("dm raid: support to change bitmap region size") Cc: stable@vger.kernel.org # v4.8+ Signed-off-by: Dmitry Bilunov <kmeaw@yandex-team.ru> Signed-off-by: Andrey Smetanin <asmetanin@yandex-team.ru> Acked-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-31blk-mq: constify struct blk_mq_opsEric Biggers
Constify all instances of blk_mq_ops, as they are never modified. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-30dm raid: select the Kconfig option CONFIG_MD_RAID0Mikulas Patocka
Since the commit 0cf4503174c1 ("dm raid: add support for the MD RAID0 personality"), the dm-raid subsystem can activate a RAID-0 array. Therefore, add MD_RAID0 to the dependencies of DM_RAID, so that MD_RAID0 will be selected when DM_RAID is selected. Fixes: 0cf4503174c1 ("dm raid: add support for the MD RAID0 personality") Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-28md: raid1: kill warning on powerpc_pseriesMing Lei
This patch kills the warning reported on powerpc_pseries, and actually we don't need the initialization. After merging the md tree, today's linux-next build (powerpc pseries_le_defconfig) produced this warning: drivers/md/raid1.c: In function 'raid1d': drivers/md/raid1.c:2172:9: warning: 'page_len$' may be used uninitialized in this function [-Wmaybe-uninitialized] if (memcmp(page_address(ppages[j]), ^ drivers/md/raid1.c:2160:7: note: 'page_len$' was declared here int page_len[RESYNC_PAGES]; ^ Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-28Fix dead URLs to ftp.kernel.orgSeongJae Park
URLs to ftp.kernel.org are still exist though the service is closed [0]. This commit fixes the URLs to use www.kernel.org instead. [0] https://www.kernel.org/shutting-down-ftp-services.html Signed-off-by: SeongJae Park <sj38.park@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2017-03-27md/raid5: use consistency_policy to remove journal featureSong Liu
When journal device of an array fails, the array is forced into read-only mode. To make the array normal without adding another journal device, we need to remove journal _feature_ from the array. This patch allows remove journal _feature_ from an array, For journal existing journal should be either missing or faulty. To remove journal feature, it is necessary to remove the journal device first: mdadm --fail /dev/md0 /dev/sdb mdadm: set /dev/sdb faulty in /dev/md0 mdadm --remove /dev/md0 /dev/sdb mdadm: hot removed /dev/sdb from /dev/md0 Then the journal feature can be removed by echoing into the sysfs file: cat /sys/block/md0/md/consistency_policy journal echo resync > /sys/block/md0/md/consistency_policy cat /sys/block/md0/md/consistency_policy resync Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-27dm raid: add raid4/5/6 journal write-back support via journal_mode optionHeinz Mauelshagen
Commit 63c32ed4afc ("dm raid: add raid4/5/6 journaling support") added journal support to close the raid4/5/6 "write hole" -- in terms of writethrough caching. Introduce a "journal_mode" feature and use the new r5c_journal_mode_set() API to add support for switching the journal device's cache mode between write-through (the current default) and write-back. NOTE: If the journal device is not layered on resilent storage and it fails, write-through mode will cause the "write hole" to reoccur. But if the journal fails while in write-back mode it will cause data loss for any dirty cache entries unless resilent storage is used for the journal. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-27dm raid: fix table line argument order in statusHeinz Mauelshagen
Commit 3a1c1ef2f ("dm raid: enhance status interface and fixup takeover/raid0") added new table line arguments and introduced an ordering flaw. The sequence of the raid10_copies and raid10_format raid parameters got reversed which causes lvm2 userspace to fail by falsely assuming a changed table line. Sequence those 2 parameters as before so that old lvm2 can function properly with new kernels by adjusting the table line output as documented in Documentation/device-mapper/dm-raid.txt. Also, add missing version 1.10.1 highlight to the documention. Fixes: 3a1c1ef2f ("dm raid: enhance status interface and fixup takeover/raid0") Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-27md: add raid4/5/6 journal mode switching APIHeinz Mauelshagen
Commit 2ded370373a4 ("md/r5cache: State machine for raid5-cache write back mode") added support for "write-back" caching on the raid journal device. In order to allow the dm-raid target to switch between the available "write-through" and "write-back" modes, provide a new r5c_journal_mode_set() API. Use the new API in existing r5c_journal_mode_store() Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Acked-by: Shaohua Li <shli@fb.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-25md/raid5-cache: fix payload endianness problem in raid5-cacheJason Yan
The payload->header.type and payload->size are little-endian, so just convert them to the right byte order. Signed-off-by: Jason Yan <yanaijie@huawei.com> Cc: <stable@vger.kernel.org> #v4.10+ Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-25md/raid1: skip data copy for behind io for discard requestShaohua Li
discard request doesn't have data attached, so it's meaningless to allocate memory and copy from original bio for behind IO. And the copy is bogus because bio_copy_data_partial can't handle discard request. We don't support writesame/writezeros request so far. Reviewed-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-24dm crypt: use shifts instead of sector_divMikulas Patocka
sector_div is very slow, so we introduce a variable sector_shift and use shift instead of sector_div. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-03-24dm integrity: add recovery modeMikulas Patocka
In recovery mode, we don't: - replay the journal - check checksums - allow writes to the device This mode can be used as a last resort for data recovery. The motivation for recovery mode is that when there is a single error in the journal, the user should not lose access to the whole device. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>