summaryrefslogtreecommitdiff
path: root/fs/nfs
AgeCommit message (Collapse)Author
2017-09-09NFS: Remove pnfs_generic_transfer_commit_list()Trond Myklebust
It's pretty much a duplicate of nfs_scan_commit_list() that also clears the PG_COMMIT_TO_DS flag. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-09NFS: nfs_lock_and_join_requests and nfs_scan_commit_list can deadlockTrond Myklebust
Since the commit list is not ordered, it is possible for nfs_scan_commit_list to hold a request that nfs_lock_and_join_requests() is waiting for, while at the same time trying to grab a request that nfs_lock_and_join_requests already holds. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-08NFS: Fix 2 use after free issues in the I/O codeTrond Myklebust
The writeback code wants to send a commit after processing the pages, which is why we want to delay releasing the struct path until after that's done. Also, the layout code expects that we do not free the inode before we've put the layout segments in pnfs_writehdr_free() and pnfs_readhdr_free() Fixes: 919e3bd9a875 ("NFS: Ensure we commit after writeback is complete") Fixes: 4714fb51fd03 ("nfs: remove pgio_header refcount, related cleanup") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-07Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block layer updates from Jens Axboe: "This is the first pull request for 4.14, containing most of the code changes. It's a quiet series this round, which I think we needed after the churn of the last few series. This contains: - Fix for a registration race in loop, from Anton Volkov. - Overflow complaint fix from Arnd for DAC960. - Series of drbd changes from the usual suspects. - Conversion of the stec/skd driver to blk-mq. From Bart. - A few BFQ improvements/fixes from Paolo. - CFQ improvement from Ritesh, allowing idling for group idle. - A few fixes found by Dan's smatch, courtesy of Dan. - A warning fixup for a race between changing the IO scheduler and device remova. From David Jeffery. - A few nbd fixes from Josef. - Support for cgroup info in blktrace, from Shaohua. - Also from Shaohua, new features in the null_blk driver to allow it to actually hold data, among other things. - Various corner cases and error handling fixes from Weiping Zhang. - Improvements to the IO stats tracking for blk-mq from me. Can drastically improve performance for fast devices and/or big machines. - Series from Christoph removing bi_bdev as being needed for IO submission, in preparation for nvme multipathing code. - Series from Bart, including various cleanups and fixes for switch fall through case complaints" * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits) kernfs: checking for IS_ERR() instead of NULL drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set drbd: Fix allyesconfig build, fix recent commit drbd: switch from kmalloc() to kmalloc_array() drbd: abort drbd_start_resync if there is no connection drbd: move global variables to drbd namespace and make some static drbd: rename "usermode_helper" to "drbd_usermode_helper" drbd: fix race between handshake and admin disconnect/down drbd: fix potential deadlock when trying to detach during handshake drbd: A single dot should be put into a sequence. drbd: fix rmmod cleanup, remove _all_ debugfs entries drbd: Use setup_timer() instead of init_timer() to simplify the code. drbd: fix potential get_ldev/put_ldev refcount imbalance during attach drbd: new disk-option disable-write-same drbd: Fix resource role for newly created resources in events2 drbd: mark symbols static where possible drbd: Send P_NEG_ACK upon write error in protocol != C drbd: add explicit plugging when submitting batches drbd: change list_for_each_safe to while(list_first_entry_or_null) drbd: introduce drbd_recv_header_maybe_unplug ...
2017-09-07NFS: Sync the correct byte range during synchronous writestarangg@amazon.com
Since commit 18290650b1c8 ("NFS: Move buffered I/O locking into nfs_file_write()") nfs_file_write() has not flushed the correct byte range during synchronous writes. generic_write_sync() expects that iocb->ki_pos points to the right edge of the range rather than the left edge. To replicate the problem, open a file with O_DSYNC, have the client write at increasing offsets, and then print the successful offsets. Block port 2049 partway through that sequence, and observe that the client application indicates successful writes in advance of what the server received. Fixes: 18290650b1c8 ("NFS: Move buffered I/O locking into nfs_file_write()") Signed-off-by: Jacob Strauss <jsstraus@amazon.com> Signed-off-by: Tarang Gupta <tarangg@amazon.com> Tested-by: Tarang Gupta <tarangg@amazon.com> Cc: stable@vger.kernel.org # v4.8+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-06fscache: remove unused ->now_uncached callbackJan Kara
Patch series "Ranged pagevec lookup", v2. In this series I make pagevec_lookup() update the index (to be consistent with pagevec_lookup_tag() and also as a preparation for ranged lookups), provide ranged variant of pagevec_lookup() and use it in places where it makes sense. This not only removes some common code but is also a measurable performance win for some use cases (see patch 4/10) where radix tree is sparse and searching & grabing of a page after the end of the range has measurable overhead. This patch (of 10): The callback doesn't ever get called. Remove it. Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06NFS: remove jiffies field from access cacheNeilBrown
This field hasn't been used since commit 57b691819ee2 ("NFS: Cache access checks more aggressively"). Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-06NFS: flush data when locking a file to ensure cache coherence for mmap.NeilBrown
When a byte range lock (or flock) is taken out on an NFS file, the validity of the cached data is checked and the inode is marked NFS_INODE_INVALID_DATA. However the cached data isn't flushed from the page cache. This is sufficient for future read() requests or mmap() requests as they call nfs_revalidate_mapping() which performs the flush if necessary. However an existing mapping is not affected. Accessing data through that mapping will continue to return old data even though the inode is marked NFS_INODE_INVALID_DATA. This can easily be confirmed using the 'nfs' tool in git://github.com/okirch/twopence-nfs.git and running nfs coherence FILENAME on one client, and nfs coherence -r FILENAME on another client. It appears that prior to Linux 2.6.0 this worked correctly. However commit: http://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/?id=ca9268fe3ddd075714005adecd4afbd7f9ab87d0 removed the call to inode_invalidate_pages() from nfs_zap_caches(). I haven't tested this code, but inspection suggests that prior to this commit, file locking would invalidate all inode pages. This patch adds a call to nfs_revalidate_mapping() after a successful SETLK so that invalid data is flushed. With this patch the above test passes. To minimize impact (and possibly avoid a GETATTR call) this only happens if the mapping might be mapped into userspace. Cc: Olaf Kirch <okir@suse.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-06NFS: don't expect errors from mempool_alloc().NeilBrown
Commit fbe77c30e9ab ("NFS: move rw_mode to nfs_pageio_header") reintroduced some pointless code that commit 518662e0fcb9 ("NFS: fix usage of mempools.") had recently removed. Remove it again. Cc: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-24sunrpc: Const-ify struct sv_serv_opsChuck Lever
Close an attack vector by moving the arrays of per-server methods to read-only memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-23block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-20Merge branch 'bugfixes'Trond Myklebust
2017-08-20NFS: Fix NFSv2 security settingsChuck Lever
For a while now any NFSv2 mount where sec= is specified uses AUTH_NULL. If sec= is not specified, the mount uses AUTH_UNIX. Commit e68fd7c8071d ("mount: use sec= that was specified on the command line") attempted to address a very similar problem with NFSv3, and should have fixed this too, but it has a bug. The MNTv1 MNT procedure does not return a list of security flavors, so our client makes up a list containing just AUTH_NULL. This should enable nfs_verify_authflavors() to assign the sec= specified flavor, but instead, it incorrectly sets it to AUTH_NULL. I expect this would also be a problem for any NFSv3 server whose MNTv3 MNT procedure returned a security flavor list containing only AUTH_NULL. Fixes: e68fd7c8071d ("mount: use sec= that was specified on ... ") BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=310 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys'NeilBrown
An NFSv4.1 client might close a file after the user who opened it has logged off. In this case the user's credentials may no longer be valid, if they are e.g. kerberos credentials that have expired. NFSv4.1 has a mechanism to allow the client to use machine credentials to close a file. However due to a short-coming in the RFC, a CLOSE with those credentials may not be possible if the file in question isn't exported to the same security flavor - the required PUTFH must be rejected when this is the case. Specifically if a server and client support kerberos in general and have used it to form a machine credential, but the file is only exported to "sec=sys", a PUTFH with the machine credentials will fail, so CLOSE is not possible. As RPC_AUTH_UNIX (used by sec=sys) credentials can never expire, there is no value in using the machine credential in place of them. So in that case, just use the users credentials for CLOSE etc, as you would in NFSv4.0 Signed-off-by: Neil Brown <neilb@suse.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20NFS: Remove unused parameter gfp_flags from nfs_pageio_init()Trond Myklebust
Now that the mirror allocation has been moved, the parameter can go. Also remove the redundant symbol export. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-20NFSv4: Fix up mirror allocationTrond Myklebust
There are a number of callers of nfs_pageio_complete() that want to continue using the nfs_pageio_descriptor without needing to call nfs_pageio_init() again. Examples include nfs_pageio_resend() and nfs_pageio_cond_complete(). The problem is that nfs_pageio_complete() also calls nfs_pageio_cleanup_mirroring(), which frees up the array of mirrors. This can lead to writeback errors, in the next call to nfs_pageio_setup_mirroring(). Fix by simply moving the allocation of the mirrors to nfs_pageio_setup_mirroring(). Link: https://bugzilla.kernel.org/show_bug.cgi?id=196709 Reported-by: JianhongYin <yin-jianhong@163.com> Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-18Merge branch 'writeback'Trond Myklebust
2017-08-15NFS: Wait for requests that are locked on the commit listTrond Myklebust
If a request is on the commit list, but is locked, we will currently skip it, which can lead to livelocking when the commit count doesn't reduce to zero. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFSv4/pnfs: Replace pnfs_put_lseg_locked() with pnfs_put_lseg()Trond Myklebust
Now that we no longer hold the inode->i_lock when manipulating the commit lists, it is safe to call pnfs_put_lseg() again. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Switch to using mapping->private_lock for page writeback lookups.Trond Myklebust
Switch from using the inode->i_lock for this to avoid contention with other metadata manipulation. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Use an atomic_long_t to count the number of commitsTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Use an atomic_long_t to count the number of requestsTrond Myklebust
Rather than forcing us to take the inode->i_lock just in order to bump the number. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFSv4: Use a mutex to protect the per-inode commit listsTrond Myklebust
The commit lists can get very large, so using the inode->i_lock can end up affecting general metadata performance. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Refactor nfs_page_find_head_request()Trond Myklebust
Split out the 2 cases so that we can treat the locking differently. The issue is that the locking in the pageswapcache cache is highly linked to the commit list locking. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFSv4: Convert nfs_lock_and_join_requests() to use nfs_page_find_head_request()Trond Myklebust
Hide the locking from nfs_lock_and_join_requests() so that we can separate out the requirements for swapcache pages. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Fix up nfs_page_group_covers_page()Trond Myklebust
Fix up the test in nfs_page_group_covers_page(). The simplest implementation is to check that we have a set of intersecting or contiguous subrequests that connect page offset 0 to nfs_page_length(req->wb_page). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Remove unused parameter from nfs_page_group_lock()Trond Myklebust
nfs_page_group_lock() is now always called with the 'nonblock' parameter set to 'false'. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Remove unuse function nfs_page_group_lock_wait()Trond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Remove nfs_page_group_clear_bits()Trond Myklebust
At this point, we only expect ever to potentially see PG_REMOVE and PG_TEARDOWN being set on the subrequests. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race casesTrond Myklebust
Since nfs_page_group_destroy() does not take any locks on the requests to be freed, we need to ensure that we don't inadvertently free the request in nfs_destroy_unlinked_subrequests() while the last reference is being released elsewhere. Do this by: 1) Taking a reference to the request unless it is already being freed 2) Checking (under the page group lock) if PG_TEARDOWN is already set before freeing an unreferenced request in nfs_destroy_unlinked_subrequests() Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Further optimise nfs_lock_and_join_requests()Trond Myklebust
When locking the entire group in order to remove subrequests, the locks are always taken in order, and with the page group lock being taken after the page head is locked. The intention is that: 1) The lock on the group head guarantees that requests may not be removed from the group (although new entries could be appended if we're not holding the group lock). 2) It is safe to drop and retake the page group lock while iterating through the list, in particular when waiting for a subrequest lock. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Reduce inode->i_lock contention in nfs_lock_and_join_requests()Trond Myklebust
We should no longer need the inode->i_lock, now that we've straightened out the request locking. The locking schema is now: 1) Lock page head request 2) Lock the page group 3) Lock the subrequests one by one Note that there is a subtle race with nfs_inode_remove_request() due to the fact that the latter does not lock the page head, when removing it from the struct page. Only the last subrequest is locked, hence we need to re-check that the PagePrivate(page) is still set after we've locked all the subrequests. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Remove page group limit in nfs_flush_incompatible()Trond Myklebust
nfs_try_to_update_request() should be able to cope now. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Teach nfs_try_to_update_request() to deal with request page_groupsTrond Myklebust
Simplify the code, and avoid some flushes to disk. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Fix the inode request accounting when pages have subrequestsTrond Myklebust
Both nfs_destroy_unlinked_subrequests() and nfs_lock_and_join_requests() manipulate the inode flags adjusting the NFS_I(inode)->nrequests. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Don't unlock writebacks before declaring PG_WB_ENDTrond Myklebust
We don't want nfs_lock_and_join_requests() to start fiddling with the request before the call to nfs_page_group_sync_on_bit(). Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Don't check request offset and size without holding a lockTrond Myklebust
Request offsets and sizes are not guaranteed to be stable unless you are holding the request locked. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Fix an ABBA issue in nfs_lock_and_join_requests()Trond Myklebust
All other callers of nfs_page_group_lock() appear to already hold the page lock on the head page, so doing it in the opposite order here is inefficient, although not deadlock prone since we roll back all locks on contention. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Fix a reference and lock leak in nfs_lock_and_join_requests()Trond Myklebust
Yes, this is a situation that should never happen (hence the WARN_ON) but we should still ensure that we free up the locks and references to the faulty pages. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Ensure we always dereference the page head lastTrond Myklebust
This fixes a race with nfs_page_group_sync_on_bit() whereby the call to wake_up_bit() in nfs_page_group_unlock() could occur after the page header had been freed. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Reduce lock contention in nfs_try_to_update_request()Trond Myklebust
Micro-optimisation to move the lockless check into the for(;;) loop. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Reduce lock contention in nfs_page_find_head_request()Trond Myklebust
Add a lockless check for whether or not the page might be carrying an existing writeback before we grab the inode->i_lock. Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15NFS: Simplify page writebackTrond Myklebust
We don't expect the page header lock to ever be held across I/O, so it should always be safe to wait for it, even if we're doing nonblocking writebacks. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-15Merge branch 'open_state'Trond Myklebust
2017-08-13NFSv4: Use the nfs4_state being recovered in _nfs4_opendata_to_nfs4_state()Trond Myklebust
If we're recovering a nfs4_state, then we should try to use that instead of looking up a new stateid. Only do that if the inodes match, though. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-13NFSv4: Use correct inode in _nfs4_opendata_to_nfs4_state()Trond Myklebust
When doing open by filehandle we don't really want to lookup a new inode, but rather update the one we've got. Add a helper which does this for us. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-08-11Merge tag 'nfs-for-4.13-5' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client fixes from Anna Schumaker: "A few more NFS client bugfixes from me for rc5. Dros has a stable fix for flexfiles to prevent leaking the nfs4_ff_ds_version arrays when freeing a layout, Trond fixed a potential recovery loop situation with the TEST_STATEID operation, and Christoph fixed up the pNFS blocklayout Kconfig options to prevent unsafe use with kernels that don't have large block device support. Summary: Stable fix: - fix leaking nfs4_ff_ds_version array Other fixes: - improve TEST_STATEID OLD_STATEID handling to prevent recovery loop - require 64-bit sector_t for pNFS blocklayout to prevent 32-bit compile errors" * tag 'nfs-for-4.13-5' of git://git.linux-nfs.org/projects/anna/linux-nfs: pnfs/blocklayout: require 64-bit sector_t NFSv4: Ignore NFS4ERR_OLD_STATEID in nfs41_check_open_stateid() nfs/flexfiles: fix leak of nfs4_ff_ds_version arrays
2017-08-11pnfs/blocklayout: require 64-bit sector_tChristoph Hellwig
The blocklayout code does not compile cleanly for a 32-bit sector_t, and also has no reliable checks for devices sizes, which makes it unsafe to use with a kernel that doesn't support large block devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Arnd Bergmann <arnd@arndb.de> Fixes: 5c83746a0cf2 ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-09NFSv4: Ignore NFS4ERR_OLD_STATEID in nfs41_check_open_stateid()Trond Myklebust
If the call to TEST_STATEID returns NFS4ERR_OLD_STATEID, then it just means we raced with other calls to OPEN. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-08nfs/flexfiles: fix leak of nfs4_ff_ds_version arraysWeston Andros Adamson
The client was freeing the nfs4_ff_layout_ds, but not the contained nfs4_ff_ds_version array. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>