summaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2016-11-07md/bitmap: Don't write bitmap while earlier writes might be in-flightNeilBrown
As we don't wait for writes to complete in bitmap_daemon_work, they could still be in-flight when bitmap_unplug writes again. Or when bitmap_daemon_work tries to write again. This can be confusing and could risk the wrong data being written last. So make sure we wait for old writes to complete before new writes start. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md/raid10: abort delayed writes when device fails.NeilBrown
When writing to an array with a bitmap enabled, the writes are grouped in batches which are preceded by an update to the bitmap. It is quite likely if that a drive develops a problem which is not media related, that the bitmap write will be the first to report an error and cause the device to be marked faulty (as the bitmap write is at the start of a batch). In this case, there is point submiting the subsequent writes to the failed device - that just wastes times. So re-check the Faulty state of a device before submitting a delayed write. This requires that we keep the 'rdev', rather than the 'bdev' in the bio, then swap in the bdev just before final submission. Reported-by: Hannes Reinecke <hare@suse.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md/raid1: abort delayed writes when device fails.NeilBrown
When writing to an array with a bitmap enabled, the writes are grouped in batches which are preceded by an update to the bitmap. It is quite likely if that a drive develops a problem which is not media related, that the bitmap write will be the first to report an error and cause the device to be marked faulty (as the bitmap write is at the start of a batch). In this case, there is point submiting the subsequent writes to the failed device - that just wastes times. So re-check the Faulty state of a device before submitting a delayed write. This requires that we keep the 'rdev', rather than the 'bdev' in the bio, then swap in the bdev just before final submission. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md: perform async updates for metadata where possible.NeilBrown
When adding devices to, or removing device from, an array we need to update the metadata. However we don't need to do it synchronously as data integrity doesn't depend on these changes being recorded instantly. So avoid the synchronous call to md_update_sb and just set a flag so that the thread will do it. This can reduce the number of updates performed when lots of devices are being added or removed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07raid5-cache: restrict the use area of the log_offset variableJackieLiu
We can calculate this offset by using ctx->meta_total_blocks, without passing in from the function Signed-off-by: JackieLiu <liuyun01@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md/raid5: change printk() to pr_*()NeilBrown
Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md/raid10: change printk() to pr_*()NeilBrown
Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md/raid1: change printk() to pr_*()NeilBrown
Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md/raid0: replace printk() with pr_*()NeilBrown
This makes md/raid0 much less verbose as the messages about the array geometry are now pr_debug() Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md/multipath: replace printk() with pr_*()NeilBrown
Also remove all messages about memory allocation failure. page_alloc() reports those. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md/linear: replace printk() with pr_*()NeilBrown
Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md/bitmap: change all printk() to pr_*()NeilBrown
Follow err/warn distinction introduced in md.c Join multi-part strings into single string. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md: change all printk() to pr_err() or pr_warn() etc.NeilBrown
1/ using pr_debug() for a number of messages reduces the noise of md, but still allows them to be enabled when needed. 2/ try to be consistent in the usage of pr_err() and pr_warn(), and document the intention 3/ When strings have been split onto multiple lines, rejoin into a single string. The cost of having lines > 80 chars is less than the cost of not being able to easily search for a particular message. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md: fix some issues with alloc_disk_sb()NeilBrown
1/ don't print a warning if allocation fails. page_alloc() does that already. 2/ always check return status for error. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md/bitmap: call bitmap_file_unmap once bitmap_storage_alloc returns -ENOMEMGuoqing Jiang
It is possible that bitmap_storage_alloc could return -ENOMEM, and some member inside store could be allocated such as filemap. To avoid memory leak, we need to call bitmap_file_unmap to free those members in the bitmap_resize. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07raid5: revert commit 11367799f3d1Tomasz Majchrzak
Revert commit 11367799f3d1 ("md: Prevent IO hold during accessing to faulty raid5 array") as it doesn't comply with commit c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns."). That change is not required anymore as the problem is resolved by commit 16f889499a52 ("md: report 'write_pending' state when array in sync") - read request is stuck as array state is not reported correctly via sysfs attribute. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md: wake up personality thread after array state updateTomasz Majchrzak
When raid1/raid10 array fails to write to one of the drives, the request is added to bio_end_io_list and finished by personality thread. The thread doesn't handle it as long as MD_CHANGE_PENDING flag is set. In case of external metadata this flag is cleared, however the thread is not woken up. It causes request to be blocked for few seconds (until another action on the array wakes up the thread) or to get stuck indefinitely. Wake up personality thread once MD_CHANGE_PENDING has been cleared. Moving 'restart_array' call after the flag is cleared it not a solution because in read-write mode the call doesn't wake up the thread. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md: don't fail an array if there are unacknowledged bad blocksTomasz Majchrzak
If external metadata handler supports bad blocks and unacknowledged bad blocks are present, don't report disk via sysfs as faulty. Such situation can be still handled so disk just has to be blocked for a moment. It makes it consistent with kernel state as corresponding rdev flag is also not set. When the disk in being unblocked there are few cases: 1. Disk has been in blocked and faulty state, it is being unblocked but it still remains in faulty state. Metadata handler will remove it from array in the next call. 2. There is no bad block support in external metadata handler and bad blocks are present - put the disk in blocked and faulty state (see case 1). 3. There is bad block support in external metadata handler and all bad blocks are acknowledged - clear all flags, continue. 4. There is bad block support in external metadata handler but there are still unacknowledged bad blocks - clear all flags, continue. It is fine to clear Blocked flag because it was probably not set anyway (if it was it is case 1). BlockedBadBlocks flag can also be cleared because the request waiting for it will set it again when it finds out that some bad block is still not acknowledged. Recovery is not necessary but there are no problems if the flag is set. Sysfs rdev state is still reported as blocked (due to unacknowledged bad blocks) so metadata handler will process remaining bad blocks and unblock disk again. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-07md: add bad block support for external metadataTomasz Majchrzak
Add new rdev flag which external metadata handler can use to switch on/off bad block support. If new bad block is encountered, notify it via rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been cleared, notify update to rdev 'bad_blocks' sysfs file. When bad blocks support is being removed, just clear rdev flag. It is not necessary to reset badblocks->shift field. If there are bad blocks cleared or added at the same time, it is ok for those changes to be applied to the structure. The array is in blocked state and the drive which cannot handle bad blocks any more will be removed from the array before it is unlocked. Simplify state_show function by adding a separator at the end of each string and overwrite last separator with new line. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-11-05Merge tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull MD fixes from Shaohua Li: "There are several bug fixes queued: - fix raid5-cache recovery bugs - fix discard IO error handling for raid1/10 - fix array sync writes bogus position to superblock - fix IO error handling for raid array with external metadata" * tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md: be careful not lot leak internal curr_resync value into metadata. -- (all) raid1: handle read error also in readonly mode raid5-cache: correct condition for empty metadata write md: report 'write_pending' state when array in sync md/raid5: write an empty meta-block when creating log super-block md/raid5: initialize next_checkpoint field before use RAID10: ignore discard error RAID1: ignore discard error
2016-11-02dm: Fix a race condition related to stopping and starting queuesBart Van Assche
Ensure that all ongoing dm_mq_queue_rq() and dm_mq_requeue_request() calls have stopped before setting the "queue stopped" flag. This allows to remove the "queue stopped" test from dm_mq_queue_rq() and dm_mq_requeue_request(). This patch fixes a race condition because dm_mq_queue_rq() is called without holding the queue lock and hence BLK_MQ_S_STOPPED can be set at any time while dm_mq_queue_rq() is in progress. This patch prevents that the following hang occurs sporadically when using dm-mq: INFO: task systemd-udevd:10111 blocked for more than 480 seconds. Call Trace: [<ffffffff8161f397>] schedule+0x37/0x90 [<ffffffff816239ef>] schedule_timeout+0x27f/0x470 [<ffffffff8161e76f>] io_schedule_timeout+0x9f/0x110 [<ffffffff8161fb36>] bit_wait_io+0x16/0x60 [<ffffffff8161f929>] __wait_on_bit_lock+0x49/0xa0 [<ffffffff8114fe69>] __lock_page+0xb9/0xc0 [<ffffffff81165d90>] truncate_inode_pages_range+0x3e0/0x760 [<ffffffff81166120>] truncate_inode_pages+0x10/0x20 [<ffffffff81212a20>] kill_bdev+0x30/0x40 [<ffffffff81213d41>] __blkdev_put+0x71/0x360 [<ffffffff81214079>] blkdev_put+0x49/0x170 [<ffffffff812141c0>] blkdev_close+0x20/0x30 [<ffffffff811d48e8>] __fput+0xe8/0x1f0 [<ffffffff811d4a29>] ____fput+0x9/0x10 [<ffffffff810842d3>] task_work_run+0x83/0xb0 [<ffffffff8106606e>] do_exit+0x3ee/0xc40 [<ffffffff8106694b>] do_group_exit+0x4b/0xc0 [<ffffffff81073d9a>] get_signal+0x2ca/0x940 [<ffffffff8101bf43>] do_signal+0x23/0x660 [<ffffffff810022b3>] exit_to_usermode_loop+0x73/0xb0 [<ffffffff81002cb0>] syscall_return_slowpath+0xb0/0xc0 [<ffffffff81624e33>] entry_SYSCALL_64_fastpath+0xa6/0xa8 Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02dm: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq codeBart Van Assche
Instead of manipulating both QUEUE_FLAG_STOPPED and BLK_MQ_S_STOPPED in the dm start and stop queue functions, only manipulate the latter flag. Change blk_queue_stopped() tests into blk_mq_queue_stopped(). Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()Bart Van Assche
Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls are followed by kicking the requeue list. Hence add an argument to these two functions that allows to kick the requeue list. This was proposed by Christoph Hellwig. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02blk-mq: Remove blk_mq_cancel_requeue_work()Bart Van Assche
Since blk_mq_requeue_work() no longer restarts stopped queues canceling requeue work is no longer needed to prevent that a stopped queue would be restarted. Hence remove this function. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02blk-mq: Avoid that requeueing starts stopped queuesBart Van Assche
Since blk_mq_requeue_work() starts stopped queues and since execution of this function can be scheduled after a queue has been stopped it is not possible to stop queues without using an additional state variable to track whether or not the queue has been stopped. Hence modify blk_mq_requeue_work() such that it does not start stopped queues. My conclusion after a review of the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list() callers is as follows: * In the dm driver starting and stopping queues should only happen if __dm_suspend() or __dm_resume() is called and not if the requeue list is processed. * In the SCSI core queue stopping and starting should only be performed by the scsi_internal_device_block() and scsi_internal_device_unblock() functions but not by any other function. Although the blk_mq_stop_hw_queue() call in scsi_queue_rq() may help to reduce CPU load if a LLD queue is full, figuring out whether or not a queue should be restarted when requeueing a command would require to introduce additional locking in scsi_mq_requeue_cmd() to avoid a race with scsi_internal_device_block(). Avoid this complexity by removing the blk_mq_stop_hw_queue() call from scsi_queue_rq(). * In the NVMe core only the functions that call blk_mq_start_stopped_hw_queues() explicitly should start stopped queues. * A blk_mq_start_stopped_hwqueues() call must be added in the xen-blkfront driver in its blkif_recover() function. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: James Bottomley <jejb@linux.vnet.ibm.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01block,fs: use REQ_* flags directlyChristoph Hellwig
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags directly. Where applicable this also drops usage of the bio_set_op_attrs wrapper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01bcache: use op_is_sync to check for synchronous requestsChristoph Hellwig
(and remove one layer of masking for the op_is_write call next to it). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28md: be careful not lot leak internal curr_resync value into metadata. -- (all)NeilBrown
mddev->curr_resync usually records where the current resync is up to, but during the starting phase it has some "magic" values. 1 - means that the array is trying to start a resync, but has yielded to another array which shares physical devices, and also needs to start a resync 2 - means the array is trying to start resync, but has found another array which shares physical devices and has already started resync. 3 - means that resync has commensed, but it is possible that nothing has actually been resynced yet. It is important that this value not be visible to user-space and particularly that it doesn't get written to the metadata, as the resync or recovery checkpoint. In part, this is because it may be slightly higher than the correct value, though this is very rare. In part, because it is not a multiple of 4K, and some devices only support 4K aligned accesses. There are two places where this value is propagates into either ->curr_resync_completed or ->recovery_cp or ->recovery_offset. These currently avoid the propagation of values 1 and 3, but will allow 3 to leak through. Change them to only propagate the value if it is > 3. As this can cause an array to fail, the patch is suitable for -stable. Cc: stable@vger.kernel.org (v3.7+) Reported-by: Viswesh <viswesh.vichu@gmail.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-28raid1: handle read error also in readonly modeTomasz Majchrzak
If write is the first operation on a disk and it happens not to be aligned to page size, block layer sends read request first. If read operation fails, the disk is set as failed as no attempt to fix the error is made because array is in auto-readonly mode. Similarily, the disk is set as failed for read-only array. Take the same approach as in raid10. Don't fail the disk if array is in readonly or auto-readonly mode. Try to redirect the request first and if unsuccessful, return a read error. Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-28raid5-cache: correct condition for empty metadata writeShaohua Li
As long as we recover one metadata block, we should write the empty metadata write. The original code could make recovery corrupted if only one meta is valid. Reported-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-28Merge tag 'dm-4.9-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - a couple DM raid and DM mirror fixes - a couple .request_fn request-based DM NULL pointer fixes - a fix for a DM target reference count leak, on target load error, that prevented associated DM target kernel module(s) from being removed * tag 'dm-4.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm table: fix missing dm_put_target_type() in dm_table_add_target() dm rq: clear kworker_task if kthread_run() returned an error dm: free io_barrier after blk_cleanup_queue call dm raid: fix activation of existing raid4/10 devices dm mirror: use all available legs on multiple failures dm mirror: fix read error on recovery after default leg failure dm raid: fix compat_features validation
2016-10-28block: better op and flags encodingChristoph Hellwig
Now that we don't need the common flags to overflow outside the range of a 32-bit type we can encode them the same way for both the bio and request fields. This in addition allows us to place the operation first (and make some room for more ops while we're at it) and to stop having to shift around the operation values. In addition this allows passing around only one value in the block layer instead of two (and eventuall also in the file systems, but we can do that later) and thus clean up a lot of code. Last but not least this allows decreasing the size of the cmd_flags field in struct request to 32-bits. Various functions passing this value could also be updated, but I'd like to avoid the churn for now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-28block: split out request-only flags into a new namespaceChristoph Hellwig
A lot of the REQ_* flags are only used on struct requests, and only of use to the block layer and a few drivers that dig into struct request internals. This patch adds a new req_flags_t rq_flags field to struct request for them, and thus dramatically shrinks the number of common requests. It also removes the unfortunate situation where we have to fit the fields from the same enum into 32 bits for struct bio and 64 bits for struct request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-24md: report 'write_pending' state when array in syncTomasz Majchrzak
If there is a bad block on a disk and there is a recovery performed from this disk, the same bad block is reported for a new disk. It involves setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external metadata this flag is not being cleared as array state is reported as 'clean'. The read request to bad block in RAID5 array gets stuck as it is waiting for a flag to be cleared - as per commit c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns."). The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been clarified in commit 070dc6dd7103 ("md: resolve confusion of MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in personality error handlers since and it doesn't fully comply with initial purpose. It was supposed to notify that write request is about to start, however now it is also used to request metadata update. Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has been set and in_sync has been set to 0 at the same time. Error handlers just set the flag without modifying in_sync value. Sysfs array state is a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is set and in_sync is set to 1. Userspace has no idea it is expected to take some action. Swap the order that array state is checked so 'write_pending' is reported ahead of 'clean' ('write_pending' is a misleading name but it is too late to rename it now). Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24md/raid5: write an empty meta-block when creating log super-blockZhengyuan Liu
If superblock points to an invalid meta block, r5l_load_log will set create_super with true and create an new superblock, this runtime path would always happen if we do no writing I/O to this array since it was created. Writing an empty meta block could avoid this unnecessary action at the first time we created log superblock. Another reason is for the corretness of log recovery. Currently we have bellow code to guarantee log revocery to be correct. if (ctx.seq > log->last_cp_seq + 1) { int ret; ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10); if (ret) return ret; log->seq = ctx.seq + 11; log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS); r5l_write_super(log, ctx.pos); } else { log->log_start = ctx.pos; log->seq = ctx.seq; } If we just created a array with a journal device, log->log_start and log->last_checkpoint should all be 0, then we write three meta block which are valid except mid one and supposed crash happened. The ctx.seq would equal to log->last_cp_seq + 1 and log->log_start would be set to position of mid invalid meta block after we did a recovery, this will lead to problems which could be avoided with this patch. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24md/raid5: initialize next_checkpoint field before useZhengyuan Liu
No initial operation was done to this field when we load/recovery the log, it got assignment only when IO to raid disk was finished. So r5l_quiesce may use wrong next_checkpoint to reclaim log space, that would make reclaimable space calculation confused. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24RAID10: ignore discard errorShaohua Li
This is the counterpart of raid10 fix. If a write error occurs, raid10 will try to rewrite the bio in small chunk size. If the rewrite fails, raid10 will record the error in bad block. narrow_write_error will always use WRITE for the bio, but actually it could be a discard. Since discard bio hasn't payload, write the bio will cause different issues. But discard error isn't fatal, we can safely ignore it. This is what this patch does. This issue should exist since discard is added, but only exposed with recent arbitrary bio size feature. Cc: Sitsofe Wheeler <sitsofe@gmail.com> Cc: stable@vger.kernel.org (v3.6) Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24RAID1: ignore discard errorShaohua Li
If a write error occurs, raid1 will try to rewrite the bio in small chunk size. If the rewrite fails, raid1 will record the error in bad block. narrow_write_error will always use WRITE for the bio, but actually it could be a discard. Since discard bio hasn't payload, write the bio will cause different issues. But discard error isn't fatal, we can safely ignore it. This is what this patch does. This issue should exist since discard is added, but only exposed with recent arbitrary bio size feature. Reported-and-tested-by: Sitsofe Wheeler <sitsofe@gmail.com> Cc: stable@vger.kernel.org (v3.6) Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24dm table: fix missing dm_put_target_type() in dm_table_add_target()tang.junhui
dm_get_target_type() was previously called so any error returned from dm_table_add_target() must first call dm_put_target_type(). Otherwise the DM target module's reference count will leak and the associated kernel module will be unable to be removed. Also, leverage the fact that r is already -EINVAL and remove an extra newline. Fixes: 36a0456 ("dm table: add immutable feature") Fixes: cc6cbe1 ("dm table: add always writeable feature") Fixes: 3791e2f ("dm table: add singleton feature") Cc: stable@vger.kernel.org # 3.2+ Signed-off-by: tang.junhui <tang.junhui@zte.com.cn> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-18dm rq: clear kworker_task if kthread_run() returned an errorMike Snitzer
cleanup_mapped_device() calls kthread_stop() if kworker_task is non-NULL. Currently the assigned value could be a valid task struct or an error code (e.g -ENOMEM). Reset md->kworker_task to NULL if kthread_run() returned an erorr. Fixes: 7193a9defc ("dm rq: check kthread_run return for .request_fn request-based DM") Cc: stable@vger.kernel.org # 4.8 Reported-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-18dm: free io_barrier after blk_cleanup_queue callTahsin Erdogan
dm_old_request_fn() has paths that access md->io_barrier. The party destroying io_barrier should ensure that no future execution of dm_old_request_fn() is possible. Move io_barrier destruction to below blk_cleanup_queue() to ensure this and avoid a NULL pointer crash during request-based DM device shutdown. Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-17dm raid: fix activation of existing raid4/10 devicesHeinz Mauelshagen
dm-raid 1.9.0 fails to activate existing RAID4/10 devices that have the old superblock format (which does not have takeover/reshaping support that was added via commit 33e53f06850f). Fix validation path for old superblocks by reverting to the old raid4 layout and basing checks on mddev->new_{level,layout,...} members in super_init_validation(). Cc: stable@vger.kernel.org # 4.8 Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-14dm mirror: use all available legs on multiple failuresHeinz Mauelshagen
When any leg(s) have failed, any read will cause a new operational default leg to be selected and the read is resubmitted to it. If that new default leg fails the read too, no other still accessible legs are used to resubmit the read again -- thus failing the io. Fix by allowing the read to get resubmitted until all operational legs have been exhausted. Also, remove any details.bi_dev use as a flag. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-14dm mirror: fix read error on recovery after default leg failureHeinz Mauelshagen
If a default leg has failed, any read will cause a new operational default leg to be selected and the read is resubmitted. But until now the read will return failure even though it was successful due to resubmission. The reason for this is bio->bi_error was not being cleared before resubmitting the bio. Fix by clearing bio->bi_error before resubmission. Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio") Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-11kthread: kthread worker API cleanupPetr Mladek
A good practice is to prefix the names of functions by the name of the subsystem. The kthread worker API is a mix of classic kthreads and workqueues. Each worker has a dedicated kthread. It runs a generic function that process queued works. It is implemented as part of the kthread subsystem. This patch renames the existing kthread worker API to use the corresponding name from the workqueues API prefixed by kthread_: __init_kthread_worker() -> __kthread_init_worker() init_kthread_worker() -> kthread_init_worker() init_kthread_work() -> kthread_init_work() insert_kthread_work() -> kthread_insert_work() queue_kthread_work() -> kthread_queue_work() flush_kthread_work() -> kthread_flush_work() flush_kthread_worker() -> kthread_flush_worker() Note that the names of DEFINE_KTHREAD_WORK*() macros stay as they are. It is common that the "DEFINE_" prefix has precedence over the subsystem names. Note that INIT() macros and init() functions use different naming scheme. There is no good solution. There are several reasons for this solution: + "init" in the function names stands for the verb "initialize" aka "initialize worker". While "INIT" in the macro names stands for the noun "INITIALIZER" aka "worker initializer". + INIT() macros are used only in DEFINE() macros + init() functions are used close to the other kthread() functions. It looks much better if all the functions use the same scheme. + There will be also kthread_destroy_worker() that will be used close to kthread_cancel_work(). It is related to the init() function. Again it looks better if all functions use the same naming scheme. + there are several precedents for such init() function names, e.g. amd_iommu_init_device(), free_area_init_node(), jump_label_init_type(), regmap_init_mmio_clk(), + It is not an argument but it was inconsistent even before. [arnd@arndb.de: fix linux-next merge conflict] Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11dm raid: fix compat_features validationAndy Whitcroft
In ecbfb9f118bce4 ("dm raid: add raid level takeover support") a new compatible feature flag was added. Validation for these compat_features was added but this only passes for new raid mappings with this feature flag. This causes previously created raid mappings to be failed at import. Check compat_features for the only valid combination. Fixes: ecbfb9f118bce4 ("dm raid: add raid level takeover support") Cc: stable@vger.kernel.org # v4.8 Signed-off-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-09Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull blk-mq irq/cpu mapping updates from Jens Axboe: "This is the block-irq topic branch for 4.9-rc. It's mostly from Christoph, and it allows drivers to specify their own mappings, and more importantly, to share the blk-mq mappings with the IRQ affinity mappings. It's a good step towards making this work better out of the box" * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block: blk_mq: linux/blk-mq.h does not include all the headers it depends on blk-mq: kill unused blk_mq_create_mq_map() blk-mq: get rid of the cpumask in struct blk_mq_tags nvme: remove the post_scan callout nvme: switch to use pci_alloc_irq_vectors blk-mq: provide a default queue mapping for PCI device blk-mq: allow the driver to pass in a queue mapping blk-mq: remove ->map_queue blk-mq: only allocate a single mq_map per tag_set blk-mq: don't redistribute hardware queues on a CPU hotplug event
2016-10-09Merge tag 'dm-4.9-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - various fixes and cleanups for request-based DM core - add support for delaying the requeue of requests; used by DM multipath when all paths have failed and 'queue_if_no_path' is enabled - DM cache improvements to speedup the loading metadata and the writing of the hint array - fix potential for a dm-crypt crash on device teardown - remove dm_bufio_cond_resched() and just using cond_resched() - change DM multipath to return a reservation conflict error immediately; rather than failing the path and retrying (potentially indefinitely) * tag 'dm-4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits) dm mpath: always return reservation conflict without failing over dm bufio: remove dm_bufio_cond_resched() dm crypt: fix crash on exit dm cache metadata: switch to using the new cursor api for loading metadata dm array: introduce cursor api dm btree: introduce cursor api dm cache policy smq: distribute entries to random levels when switching to smq dm cache: speed up writing of the hint array dm array: add dm_array_new() dm mpath: delay the requeue of blk-mq requests while all paths down dm mpath: use dm_mq_kick_requeue_list() dm rq: introduce dm_mq_kick_requeue_list() dm rq: reduce arguments passed to map_request() and dm_requeue_original_request() dm rq: add DM_MAPIO_DELAY_REQUEUE to delay requeue of blk-mq requests dm: convert wait loops to use autoremove_wake_function() dm: use signal_pending_state() in dm_wait_for_completion() dm: rename task state function arguments dm: add two lockdep_assert_held() statements dm rq: simplify dm_old_stop_queue() dm mpath: check if path's request_queue is dying in activate_path() ...
2016-10-07Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer updates from Jens Axboe: "This is the main pull request for block layer changes in 4.9. As mentioned at the last merge window, I've changed things up and now do just one branch for core block layer changes, and driver changes. This avoids dependencies between the two branches. Outside of this main pull request, there are two topical branches coming as well. This pull request contains: - A set of fixes, and a conversion to blk-mq, of nbd. From Josef. - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd. Followup dependency fix from Geert. - General fixes from Bart, Baoyou, Guoqing, and Linus W. - CFQ async write starvation fix from Glauber. - Add supprot for delayed kick of the requeue list, from Mike. - Pull out the scalable bitmap code from blk-mq-tag.c and make it generally available under the name of sbitmap. Only blk-mq-tag uses it for now, but the blk-mq scheduling bits will use it as well. From Omar. - bdev thaw error progagation from Pierre. - Improve the blk polling statistics, and allow the user to clear them. From Stephen. - Set of minor cleanups from Christoph in block/blk-mq. - Set of cleanups and optimizations from me for block/blk-mq. - Various nvme/nvmet/nvmeof fixes from the various folks" * 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits) fs/block_dev.c: return the right error in thaw_bdev() nvme: Pass pointers, not dma addresses, to nvme_get/set_features() nvme/scsi: Remove power management support nvmet: Make dsm number of ranges zero based nvmet: Use direct IO for writes admin-cmd: Added smart-log command support. nvme-fabrics: Add host_traddr options field to host infrastructure nvme-fabrics: revise host transport option descriptions nvme-fabrics: rework nvmf_get_address() for variable options nbd: use BLK_MQ_F_BLOCKING blkcg: Annotate blkg_hint correctly cfq: fix starvation of asynchronous writes blk-mq: add flag for drivers wanting blocking ->queue_rq() blk-mq: remove non-blocking pass in blk_mq_map_request blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue() block: export bio_free_pages to other modules lightnvm: propagate device_add() error code lightnvm: expose device geometry through sysfs lightnvm: control life of nvm_dev in driver blk-mq: register device instead of disk ...
2016-10-07Merge tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/mdLinus Torvalds
Pull MD updates from Shaohua Li: "This update includes: - new AVX512 instruction based raid6 gen/recovery algorithm - a couple of md-cluster related bug fixes - fix a potential deadlock - set nonrotational bit for raid array with SSD - set correct max_hw_sectors for raid5/6, which hopefuly can improve performance a little bit - other minor fixes" * tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: md: set rotational bit raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays raid5: handle register_shrinker failure raid5: fix to detect failure of register_shrinker md: fix a potential deadlock md/bitmap: fix wrong cleanup raid5: allow arbitrary max_hw_sectors lib/raid6: Add AVX512 optimized xor_syndrome functions lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions lib/raid6: Add AVX512 optimized recovery functions lib/raid6: Add AVX512 optimized gen_syndrome functions md-cluster: make resync lock also could be interruptted md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang md-cluster: convert the completion to wait queue md-cluster: protect md_find_rdev_nr_rcu with rcu lock md-cluster: clean related infos of cluster md: changes for MD_STILL_CLOSED flag md-cluster: remove some unnecessary dlm_unlock_sync md-cluster: use FORCEUNLOCK in lockres_free md-cluster: call md_kick_rdev_from_array once ack failed