summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2017-04-25can: peak: add support for PEAK PCAN-PCIe FD CAN-FD boardsStephane Grosjean
This patch adds the support of the PCAN-PCI Express FD boards made by PEAK-System, for computers using the PCI Express slot. The PCAN-PCI Express FD has one or two CAN FD channels, depending on the model. A galvanic isolation of the CAN ports protects the electronics of the card and the respective computer against disturbances of up to 500 Volts. The PCAN-PCI Express FD can be operated with ambient temperatures in a range of -40 to +85 °C. Such boards run an extented version of the CAN-FD IP running into USB CAN-FD interfaces from PEAK-System, so this patch adds several new commands and their corresponding data types to the PEAK CAN-FD common definitions header file too. Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-25can: peak: move header file to new can common subdirStephane Grosjean
The CAN-FD IP from PEAK-System runs into several kinds of PC CAN-FD interfaces. Up to now, only the USB CAN-FD adapters were supported by the Kernel. In order to prepare the adding of some new non-USB CAN-FD interfaces, this patch moves - and rename - the IP definitions file from its private (usb) sub-directory into a - newly created - CAN specific one. Signed-off-by: Stephane Grosjean <s.grosjean@peak-system.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2017-04-24bpf: make bpf_xdp_adjust_head support mandatoryDaniel Borkmann
Now that also the last in-tree user of the xdp_adjust_head bit has been removed, we can remove the flag from struct bpf_prog altogether. This, at the same time, also makes sure that any future driver for XDP comes with bpf_xdp_adjust_head() support right away. A rejection based on this flag would also mean that tail calls couldn't be used with such driver as per c2002f983767 ("bpf: fix checking xdp_adjust_head on tail calls") fix, thus lets not allow for it in the first place. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24f2fs: add parentheses for macro variables moreJaegeuk Kim
This patch adds parentheses for macro variables more in include/linux/f2fs_fs.h. Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-04-24mmc: core: Export API to allow hosts to get the card addressUlf Hansson
Some hosts controllers, like Cavium, needs to know whether the card operates in byte- or block-address mode. Therefore export a new API, mmc_card_is_blockaddr(), which provides this information. Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Steven J. Hill <Steven.Hill@cavium.com> Acked-by: David Daney <david.daney@cavium.com>
2017-04-24mmc: mmc_test: Disable Command Queue while mmc_test is usedAdrian Hunter
Normal read and write commands may not be used while the command queue is enabled. Disable the Command Queue when mmc_test is probed and re-enable it when it is removed. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Harjani Ritesh <riteshh@codeaurora.org> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2017-04-24mmc: mmc: Add functions to enable / disable the Command QueueAdrian Hunter
Add helper functions to enable or disable the Command Queue. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2017-04-24mmc: queue: Share mmc request array between partitionsAdrian Hunter
eMMC can have multiple internal partitions that are represented as separate disks / queues. However switching between partitions is only done when the queue is empty. Consequently the array of mmc requests that are queued can be shared between partitions saving memory. Keep a pointer to the mmc request queue on the card, and use that instead of allocating a new one for each partition. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2017-04-24mmc: core: add mmc_get_dma_dirHeiner Kallweit
Add function for determining DMA direction to core. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Shawn Lin <shawn.lin@rock-chips.com> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2017-04-24PCI: Implement devm_pci_remap_cfgspace()Lorenzo Pieralisi
The introduction of the pci_remap_cfgspace() interface allows PCI host controller drivers to map PCI config space through a dedicated kernel interface. Current PCI host controller drivers use the devm_ioremap_*() devres interfaces to map PCI configuration space regions so in order to update them to the new pci_remap_cfgspace() mapping interface a new set of devres interfaces should be implemented so that PCI host controller drivers can make use of them. Introduce two new functions in the PCI kernel layer and Devres documentation: - devm_pci_remap_cfgspace() - devm_pci_remap_cfg_resource() so that PCI host controller drivers can make use of them to map PCI configuration space regions. Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: Jonathan Corbet <corbet@lwn.net>
2017-04-24flow_dissector: add mpls support (v2)Benjamin LaHaise
Add support for parsing MPLS flows to the flow dissector in preparation for adding MPLS match support to cls_flower. Signed-off-by: Benjamin LaHaise <benjamin.lahaise@netronome.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Simon Horman <simon.horman@netronome.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@mellanox.com> Cc: Eric Dumazet <jhs@mojatatu.com> Cc: Hadar Hen Zion <hadarh@mellanox.com> Cc: Gao Feng <fgao@ikuai8.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24net/tcp_fastopen: Disable active side TFO in certain scenariosWei Wang
Middlebox firewall issues can potentially cause server's data being blackholed after a successful 3WHS using TFO. Following are the related reports from Apple: https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf Slide 31 identifies an issue where the client ACK to the server's data sent during a TFO'd handshake is dropped. C ---> syn-data ---> S C <--- syn/ack ----- S C (accept & write) C <---- data ------- S C ----- ACK -> X S [retry and timeout] https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf Slide 5 shows a similar situation that the server's data gets dropped after 3WHS. C ---- syn-data ---> S C <--- syn/ack ----- S C ---- ack --------> S S (accept & write) C? X <- data ------ S [retry and timeout] This is the worst failure b/c the client can not detect such behavior to mitigate the situation (such as disabling TFO). Failing to proceed, the application (e.g., SSL library) may simply timeout and retry with TFO again, and the process repeats indefinitely. The proposed solution is to disable active TFO globally under the following circumstances: 1. client side TFO socket detects out of order FIN 2. client side TFO socket receives out of order RST We disable active side TFO globally for 1hr at first. Then if it happens again, we disable it for 2h, then 4h, 8h, ... And we reset the timeout to 1hr if a client side TFO sockets not opened on loopback has successfully received data segs from server. And we examine this condition during close(). The rational behind it is that when such firewall issue happens, application running on the client should eventually close the socket as it is not able to get the data it is expecting. Or application running on the server should close the socket as it is not able to receive any response from client. In both cases, out of order FIN or RST will get received on the client given that the firewall will not block them as no data are in those frames. And we want to disable active TFO globally as it helps if the middle box is very close to the client and most of the connections are likely to fail. Also, add a debug sysctl: tcp_fastopen_blackhole_detect_timeout_sec: the initial timeout to use when firewall blackhole issue happens. This can be set and read. When setting it to 0, it means to disable the active disable logic. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24mdio_bus: Issue GPIO RESET to PHYs.Roger Quadros
Some boards [1] leave the PHYs at an invalid state during system power-up or reset thus causing unreliability issues with the PHY which manifests as PHY not being detected or link not functional. To fix this, these PHYs need to be RESET via a GPIO connected to the PHY's RESET pin. Some boards have a single GPIO controlling the PHY RESET pin of all PHYs on the bus whereas some others have separate GPIOs controlling individual PHY RESETs. In both cases, the RESET de-assertion cannot be done in the PHY driver as the PHY will not probe till its reset is de-asserted. So do the RESET de-assertion in the MDIO bus driver. [1] - am572x-idk, am571x-idk, a437x-idk Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24VSOCK: Add virtio vsock vsockmon hooksGerard Garcia
The virtio drivers deal with struct virtio_vsock_pkt. Add virtio_transport_deliver_tap_pkt(pkt) for handing packets to the vsockmon device. We call virtio_transport_deliver_tap_pkt(pkt) from net/vmw_vsock/virtio_transport.c and drivers/vhost/vsock.c instead of common code. This is because the drivers may drop packets before handing them to common code - we still want to capture them. Signed-off-by: Gerard Garcia <ggarcia@deic.uab.cat> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24qed: Add support for static dcbx.sudarsana.kalluru@cavium.com
The patch adds driver support for static/local dcbx mode. In this mode adapter brings up the dcbx link with locally configured parameters instead of performing the dcbx negotiation with the peer. The feature is useful when peer device/switch doesn't support dcbx. Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24dm: mark targets that pass integrity dataMikulas Patocka
A dm-crypt on dm-integrity device incorrectly advertises an integrity profile on the DM crypt device. It can be seen in the files "/sys/block/dm-*/integrity/*" that both dm-integrity and dm-crypt target advertise the integrity profile. That is incorrect, only the dm-integrity target should advertise the integrity profile. A general problem in DM is that if we have a DM device that depends on another device with an integrity profile, the upper device will always advertise the integrity profile, even when the target driver doesn't support handling integrity data. Most targets don't support integrity data, so we provide a whitelist of targets that support it (linear, delay and striped). The targets that support passing integrity data to the lower device are marked with the flag DM_TARGET_PASSES_INTEGRITY. The DM core will now advertise integrity data on a DM device only if all the targets support the integrity data. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2017-04-25Merge branch 'topic/kprobes' into nextMichael Ellerman
Although most of these kprobes patches are powerpc specific, there's a couple that touch generic code (with Acks). At the moment there's one conflict with acme's tree, but it's not too bad. Still just in case some other conflicts show up, we've put these in a topic branch so another tree could merge some or all of it if necessary.
2017-04-23module: Unify the return value type of try_module_getGao Feng
The prototypes of try_module_get are different with different macro. When enable module and module unload, it returns bool, but others not. Make the return type for try_module_get consistent across all module config options. Signed-off-by: Gao Feng <fgao@ikuai8.com> [jeyu: slightly amended changelog to make it clearer] Signed-off-by: Jessica Yu <jeyu@redhat.com>
2017-04-23Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Documentation updates. - Miscellaneous fixes. - Parallelize SRCU callback handling (plus overlapping patches). Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-23kbuild: fix asm-offset generation to work with clangJeroen Hofstee
KBuild abuses the asm statement to write to a file and clang chokes about these invalid asm statements. Hack it even more by fooling this is actual valid asm code. [masahiro: Import Jeroen's work for U-Boot: http://patchwork.ozlabs.org/patch/375026/ Tweak sed script a little to avoid garbage '#' for GCC case, like #define NR_PAGEFLAGS 23 /* __NR_PAGEFLAGS # */ ] Signed-off-by: Jeroen Hofstee <jeroen@myspectrum.nl> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Matthias Kaehlcke <mka@chromium.org> Tested-by: Matthias Kaehlcke <mka@chromium.org>
2017-04-21signal: Make kill_proc_info staticEric W. Biederman
There are no users outside of signal.c so make the function static so the compiler and other developers have that information. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Both conflict were simple overlapping changes. In the kaweth case, Eric Dumazet's skb_cow() bug fix overlapped the conversion of the driver in net-next to use in-netdev stats. Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21block: get rid of blk_integrity_revalidate()Ilya Dryomov
Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk") introduced blk_integrity_revalidate(), which seems to assume ownership of the stable pages flag and unilaterally clears it if no blk_integrity profile is registered: if (bi->profile) disk->queue->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES; else disk->queue->backing_dev_info->capabilities &= ~BDI_CAP_STABLE_WRITES; It's called from revalidate_disk() and rescan_partitions(), making it impossible to enable stable pages for drivers that support partitions and don't use blk_integrity: while the call in revalidate_disk() can be trivially worked around (see zram, which doesn't support partitions and hence gets away with zram_revalidate_disk()), rescan_partitions() can be triggered from userspace at any time. This breaks rbd, where the ceph messenger is responsible for generating/verifying CRCs. Since blk_integrity_{un,}register() "must" be used for (un)registering the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES setting there. This way drivers that call blk_integrity_register() and use integrity infrastructure won't interfere with drivers that don't but still want stable pages. Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk") Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.4+, needs backporting Tested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-21Merge tag 'nfc-next-4.12-1' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next Samuel Ortiz says: ==================== NFC 4.12 pull request This is the NFC pull request for 4.12. We have: - Improvements for the pn533 command queue handling and device registration order. - Removal of platform data for the pn544 and st21nfca drivers. - Additional device tree options to support more trf7970a hardware options. - Support for Sony's RC-S380P through the port100 driver. - Removal of the obsolte nfcwilink driver. - Headers inclusion cleanups (miscdevice.h, unaligned.h) for many drivers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2017-04-20 This adds the basic infrastructure for IPsec hardware offloading, it creates a configuration API and adjusts the packet path. 1) Add the needed netdev features to configure IPsec offloads. 2) Add the IPsec hardware offloading API. 3) Prepare the ESP packet path for hardware offloading. 4) Add gso handlers for esp4 and esp6, this implements the software fallback for GSO packets. 5) Add xfrm replay handler functions for offloading. 6) Change ESP to use a synchronous crypto algorithm on offloading, we don't have the option for asynchronous returns when we handle IPsec at layer2. 7) Add a xfrm validate function to validate_xmit_skb. This implements the software fallback for non GSO packets. 8) Set the inner_network and inner_transport members of the SKB, as well as encapsulation, to reflect the actual positions of these headers, and removes them only once encryption is done on the payload. From Ilan Tayari. 9) Prepare the ESP GRO codepath for hardware offloading. 10) Fix incorrect null pointer check in esp6. From Colin Ian King. 11) Fix for the GSO software fallback path to detect the fallback correctly. From Ilan Tayari. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21make sure that mntns_install() doesn't end up with referral for rootAl Viro
new flag: LOOKUP_DOWN. If the starting point is overmounted, cross into whatever's mounted on top, triggering referrals et.al. Use that instead of follow_down_one() loop in mntns_install(), handle errors properly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-21switch memcpy_from_msg() to copy_from_iter_full()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-04-21Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuningMatthew Whitehead
Constants used for tuning are generally a bad idea, especially as hardware changes over time. Replace the constant 2 jiffies with sysctl variable netdev_budget_usecs to enable sysadmins to tune the softirq processing. Also document the variable. For example, a very fast machine might tune this to 1000 microseconds, while my regression testing 486DX-25 needs it to be 4000 microseconds on a nearly idle network to prevent time_squeeze from being incremented. Version 2: changed jiffies to microseconds for predictable units. Signed-off-by: Matthew Whitehead <tedheadster@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-21spi: Add can_dma like interface for spi_flash_readVignesh R
Add an interface analogous to ->can_dma() for spi_flash_read() interface. This will enable SPI controller drivers to inform SPI core when not to do DMA mappings. Signed-off-by: Vignesh R <vigneshr@ti.com> Signed-off-by: Mark Brown <broonie@kernel.org>
2017-04-21IB/mlx5: Support congestion related countersParav Pandit
This patch adds support to query the congestion related hardware counters through new command and links them with other hw counters being available in hw_counters sysfs location. In order to reuse existing infrastructure it renames related q_counter data structures to more generic counters to reflect q_counters and congestion counters and maybe some other counters in the future. New hardware counters: * rp_cnp_handled - CNP packets handled by the reaction point * rp_cnp_ignored - CNP packets ignored by the reaction point * np_cnp_sent - CNP packets sent by notification point to respond to CE marked RoCE packets * np_ecn_marked_roce_packets - CE marked RoCE packets received by notification point It also avoids returning ENOSYS which is specific for invalid system call and produces the following checkpatch.pl warning. WARNING: ENOSYS means 'invalid syscall nr' and nothing else + return -ENOSYS; Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Eli Cohen <eli@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-04-21IB/mlx5: Use IP version matching to classify IP trafficAriel Levkovich
This change adds the ability for flow steering to classify IPv4/6 packets with MPLS tag (Ethertype 0x8847 and 0x8848) as standard IP packets and hit IPv4/6 classifed steering rules. When user added a flow rule with IP classification, driver was implicitly adding ethertype matching to the created rule in order to distinguish between IPv4 and IPv6 protocols. Since IP packets with MPLS tag header have MPLS ethertype, they missed the rule and ended up hitting the default filters. Such behavior prevented from MPLS packets to undergo inbound traffic load balancing flows (if such were defined by configuring RSS) to achieve higher throughput - the way that non-MPLS IP packets performed. Since our device is able to look past the MPLS tag and identify the next protocol we introduce this solution which replaces Ethertype matching by the device's capability to perform IP version parsing and matching in order to distinguish between IPv4 and IPv6. Therefore, whenever a flow with IP spec is added and device support IP version matching, driver will implicitly add IP version matching to the rule (Based on the IP spec type) without Ethertype matching which will cause relevant MPLS tagged packets to hit this rule as well. Otherwise (device doesn't support IP version matching), we fall back to setting Ethertype matching. If the user's filters specify an L2 ethertype and an IP spec the rule will then match both the ethertype and the IP version. The device's support for IP version matching is reported by the device via dedicated capability bit in query_device_cap and named outer/inner_ip_version. Signed-off-by: Ariel Levkovich <lariel@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-04-21IB/mlx4: Support RAW Ethernet when RoCE is disabledMajd Dibbiny
On some environments, such as certain SR-IOV VF configurations, RoCE isn't supported for mlx4 Ethernet ports. Currently the driver will not open IB device on that port. This is problematic since we do want user-space RAW Ethernet QPs functionality to remain in place. For that end, enhance the relevant driver flows such that we do create a device instance in that case. Signed-off-by: Majd Dibbiny <majd@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-04-21video: fbdev: imxfb: support AUS modeMartin Kaiser
Some displays require setting AUS mode in the LDCD AUS Mode Control Register to work with the imxfb driver. Like the value of the Panel Configuration Register, the AUS mode setting depends on the display mode. Allow setting AUS mode from the device tree by adding a boolean property. Make this property optional to keep the DT ABI stable. AUS mode can be set only on imx21 and compatible chipsets. Signed-off-by: Martin Kaiser <martin@kaiser.cx> Cc: Sascha Hauer <kernel@pengutronix.de> Cc: Rob Herring <robh+dt@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
2017-04-21lockd: Introduce nlmclnt_operationsBenjamin Coddington
NFS would enjoy the ability to modify the behavior of the NLM client's unlock RPC task in order to delay the transmission of the unlock until IO that was submitted under that lock has completed. This ability can ensure that the NLM client will always complete the transmission of an unlock even if the waiting caller has been interrupted with fatal signal. For this purpose, a pointer to a struct nlmclnt_operations can be assigned in a nfs_module's nfs_rpc_ops that will install those nlmclnt_operations on the nlm_host. The struct nlmclnt_operations defines three callback operations that will be used in a following patch: nlmclnt_alloc_call - used to call back after a successful allocation of a struct nlm_rqst in nlmclnt_proc(). nlmclnt_unlock_prepare - used to call back during NLM unlock's rpc_call_prepare. The NLM client defers calling rpc_call_start() until this callback returns false. nlmclnt_release_call - used to call back when the NLM client's struct nlm_rqst is freed. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-21NFS: Add an iocounter wait function for async RPC tasksBenjamin Coddington
By sleeping on a new NFS Unlock-On-Close waitqueue, rpc tasks may wait for a lock context's iocounter to reach zero. The rpc waitqueue is only woken when the open_context has the NFS_CONTEXT_UNLOCK flag set in order to mitigate spurious wake-ups for any iocounter reaching zero. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-21locks: Set FL_CLOSE when removing flock locks on close()Benjamin Coddington
Set FL_CLOSE in fl_flags as in locks_remove_posix() when clearing locks. NFS will check for this flag to ensure an unlock is sent in a following patch. Fuse handles flock and posix locks differently for FL_CLOSE, and so requires a fixup to retain the existing behavior for flock. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Acked-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-04-21nvmet_fc: Rework target side abort handlingJames Smart
target transport: ---------------------- There are cases when there is a need to abort in-progress target operations (writedata) so that controller termination or errors can clean up. That can't happen currently as the abort is another target op type, so it can't be used till the running one finishes (and it may not). Solve by removing the abort op type and creating a separate downcall from the transport to the lldd to request an io to be aborted. The transport will abort ios on queue teardown or io errors. In general the transport tries to call the lldd abort only when the io state is idle. Meaning: ops that transmit data (readdata or rsp) will always finish their transmit (or the lldd will see a state on the link or initiator port that fails the transmit) and the done call for the operation will occur. The transport will wait for the op done upcall before calling the abort function, and as the io is idle, the io can be cleaned up immediately after the abort call; Similarly, ios that are not waiting for data or transmitting data must be in the nvmet layer being processed. The transport will wait for the nvmet layer completion before calling the abort function, and as the io is idle, the io can be cleaned up immediately after the abort call; As for ops that are waiting for data (writedata), they may be outstanding indefinitely if the lldd doesn't see a condition where the initiatior port or link is bad. In those cases, the transport will call the abort function and wait for the lldd's op done upcall for the operation, where it will then clean up the io. Additionally, if a lldd receives an ABTS and matches it to an outstanding request in the transport, A new new transport upcall was created to abort the outstanding request in the transport. The transport expects any outstanding op call (readdata or writedata) will completed by the lldd and the operation upcall made. The transport doesn't act on the reported abort (e.g. clean up the io) until an op done upcall occurs, a new op is attempted, or the nvmet layer completes the io processing. fcloop: ---------------------- Updated to support the new target apis. On fcp io aborts from the initiator, the loopback context is updated to NULL out the half that has completed. The initiator side is immediately called after the abort request with an io completion (abort status). On fcp io aborts from the target, the io is stopped and the initiator side sees it as an aborted io. Target side ops, perhaps in progress while the initiator side is done, continue but noop the data movement as there's no structure on the initiator side to reference. patch also contains: ---------------------- Revised lpfc to support the new abort api commonized rsp buffer syncing and nulling of private data based on calling paths. errors in op done calls don't take action on the fod. They're bad operations which implies the fod may be bad. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-04-21nvmet_fc: add req_release to lldd apiJames Smart
With the advent of the opdone calls changing context, the lldd can no longer assume that once the op->done call returns for RSP operations that the request struct is no longer being accessed. As such, revise the lldd api for a req_release callback that the transport will call when the job is complete. This will also be used with abort cases. Fixed text in api header for change in io complete semantics. Revised lpfc to support the new req_release api. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-04-21nvmet_fc: add target feature flags for upcall isr contextsJames Smart
Two new feature flags were added to control whether upcalls to the transport result in context switches or stay in the calling context. NVMET_FCTGTFEAT_CMD_IN_ISR: By default, if the flag is not set, the transport assumes the lldd is in a non-isr context and in the cpu context it should be for the io queue. As such, the cmd handler is called directly in the calling context. If the flag is set, indicating the upcall is an isr context, the transport mandates a transition to a workqueue. The workqueue assigned to the queue is used for the context. NVMET_FCTGTFEAT_OPDONE_IN_ISR By default, if the flag is not set, the transport assumes the lldd is in a non-isr context and in the cpu context it should be for the io queue. As such, the fcp operation done callback is called directly in the calling context. If the flag is set, indicating the upcall is an isr context, the transport mandates a transition to a workqueue. The workqueue assigned to the queue is used for the context. Updated lpfc for flags Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-04-21nvme: improve performance for virtual NVMe devicesHelen Koike
This change provides a mechanism to reduce the number of MMIO doorbell writes for the NVMe driver. When running in a virtualized environment like QEMU, the cost of an MMIO is quite hefy here. The main idea for the patch is provide the device two memory location locations: 1) to store the doorbell values so they can be lookup without the doorbell MMIO write 2) to store an event index. I believe the doorbell value is obvious, the event index not so much. Similar to the virtio specification, the virtual device can tell the driver (guest OS) not to write MMIO unless you are writing past this value. FYI: doorbell values are written by the nvme driver (guest OS) and the event index is written by the virtual device (host OS). The patch implements a new admin command that will communicate where these two memory locations reside. If the command fails, the nvme driver will work as before without any optimizations. Contributions: Eric Northup <digitaleric@google.com> Frank Swiderski <fes@google.com> Ted Tso <tytso@mit.edu> Keith Busch <keith.busch@intel.com> Just to give an idea on the performance boost with the vendor extension: Running fio [1], a stock NVMe driver I get about 200K read IOPs with my vendor patch I get about 1000K read IOPs. This was running with a null device i.e. the backing device simply returned success on every read IO request. [1] Running on a 4 core machine: fio --time_based --name=benchmark --runtime=30 --filename=/dev/nvme0n1 --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randread --blocksize=4k --randrepeat=false Signed-off-by: Rob Nelson <rlnelson@google.com> [mlin: port for upstream] Signed-off-by: Ming Lin <mlin@kernel.org> [koike: updated for upstream] Signed-off-by: Helen Koike <helen.koike@collabora.co.uk> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com>
2017-04-21Merge branches 'doc.2017.04.12a', 'fixes.2017.04.19a' and 'srcu.2017.04.21a' ↵Paul E. McKenney
into HEAD doc.2017.04.12a: Documentation updates fixes.2017.04.19a: Miscellaneous fixes srcu.2017.04.21a: Parallelize SRCU callback handling
2017-04-21rcu: Make non-preemptive schedule be Tasks RCU quiescent statePaul E. McKenney
Currently, a call to schedule() acts as a Tasks RCU quiescent state only if a context switch actually takes place. However, just the call to schedule() guarantees that the calling task has moved off of whatever tracing trampoline that it might have been one previously. This commit therefore plumbs schedule()'s "preempt" parameter into rcu_note_context_switch(), which then records the Tasks RCU quiescent state, but only if this call to schedule() was -not- due to a preemption. To avoid adding overhead to the common-case context-switch path, this commit hides the rcu_note_context_switch() check under an existing non-common-case check. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2017-04-21srcu: Parallelize callback handlingPaul E. McKenney
Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1,2], however, there are workloads that could result in a high volume of concurrent invocations of call_srcu(), which with current SRCU would result in excessive lock contention on the srcu_struct structure's ->queue_lock, which protects SRCU's callback lists. This commit therefore moves SRCU to per-CPU callback lists, thus greatly reducing contention. Because a given SRCU instance no longer has a single centralized callback list, starting grace periods and invoking callbacks are both more complex than in the single-list Classic SRCU implementation. Starting grace periods and handling callbacks are now handled using an srcu_node tree that is in some ways similar to the rcu_node trees used by RCU-bh, RCU-preempt, and RCU-sched (for example, the srcu_node tree shape is controlled by exactly the same Kconfig options and boot parameters that control the shape of the rcu_node tree). In addition, the old per-CPU srcu_array structure is now named srcu_data and contains an rcu_segcblist structure named ->srcu_cblist for its callbacks (and a spinlock to protect this). The srcu_struct gets an srcu_gp_seq that is used to associate callback segments with the corresponding completion-time grace-period number. These completion-time grace-period numbers are propagated up the srcu_node tree so that the grace-period workqueue handler can determine whether additional grace periods are needed on the one hand and where to look for callbacks that are ready to be invoked. The srcu_barrier() function must now wait on all instances of the per-CPU ->srcu_cblist. Because each ->srcu_cblist is protected by ->lock, srcu_barrier() can remotely add the needed callbacks. In theory, it could also remotely start grace periods, but in practice doing so is complex and racy. And interestingly enough, it is never necessary for srcu_barrier() to start a grace period because srcu_barrier() only enqueues a callback when a callback is already present--and it turns out that a grace period has to have already been started for this pre-existing callback. Furthermore, it is only the callback that srcu_barrier() needs to wait on, not any particular grace period. Therefore, a new rcu_segcblist_entrain() function enqueues the srcu_barrier() function's callback into the same segment occupied by the last pre-existing callback in the list. The special case where all the pre-existing callbacks are on a different list (because they are in the process of being invoked) is handled by enqueuing srcu_barrier()'s callback into the RCU_DONE_TAIL segment, relying on the done-callbacks check that takes place after all callbacks are inovked. Note that the readers use the same algorithm as before. Note that there is a separate srcu_idx that tells the readers what counter to increment. This unfortunately cannot be combined with srcu_gp_seq because they need to be incremented at different times. This commit introduces some ugly #ifdefs in rcutorture. These will go away when I feel good enough about Tree SRCU to ditch Classic SRCU. Some crude performance comparisons, courtesy of a quickly hacked rcuperf asynchronous-grace-period capability: Callback Queuing Overhead ------------------------- # CPUS Classic SRCU Tree SRCU ------ ------------ --------- 2 0.349 us 0.342 us 16 31.66 us 0.4 us 41 --------- 0.417 us The times are the 90th percentiles, a statistic that was chosen to reject the overheads of the occasional srcu_barrier() call needed to avoid OOMing the test machine. The rcuperf test hangs when running Classic SRCU at 41 CPUs, hence the line of dashes. Despite the hacks to both the rcuperf code and that statistics, this is a convincing demonstration of Tree SRCU's performance and scalability advantages. [1] https://lwn.net/Articles/309030/ [2] https://patchwork.kernel.org/patch/5108281/ Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [ paulmck: Fix initialization if synchronize_srcu_expedited() called first. ]
2017-04-21kvm: Move srcu_struct fields to end of struct kvmPaul E. McKenney
Parallelizing SRCU callback handling will increase the size of srcu_struct, which will move the kvm structure's kvm_arch field out of reach of powerpc's current assembly code, which will result in the following sort of build error: arch/powerpc/kvm/book3s_hv_rmhandlers.S:617: Error: operand out of range (0x000000000000b328 is not between 0xffffffffffff8000 and 0x0000000000007fff) This commit moves the srcu_struct fields in the kvm structure to follow the kvm_arch field, which will allow powerpc's assembly code to continue to be able to reach the kvm_arch field. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Michael Ellerman <michaele@au1.ibm.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Paolo Bonzini <pbonzini@redhat.com> [ paulmck: Moved this commit to precede SRCU callback parallelization, and reworded the commit log into future tense, all in the name of bisectability. ]
2017-04-21linux/kernel.h: Add ALIGN_DOWN macroKrzysztof Kozlowski
Few parts of kernel define their own macro for aligning down so provide a common define for this, with the same usage and assumptions as existing ALIGN. Convert also three existing implementations to this one. Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2017-04-21Merge branch 'x86/process' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD Required for KVM support of the CPUID faulting feature. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-21Merge tag 'drm-misc-next-fixes-2017-04-20' of ↵Dave Airlie
git://anongit.freedesktop.org/git/drm-misc into drm-next drm-misc-next-fixes-2017-04-20 Core changes: - Maintain sti via drm-misc (Vincent) - Rename dma_buf_ops->kmap_* to avoid naming collision (Logan) Driver changes: - Fix UHD displays on stih407 (Vincent) - Fix uninitialized var return in atmel-hlcdc (Dan) * tag 'drm-misc-next-fixes-2017-04-20' of git://anongit.freedesktop.org/git/drm-misc: dma-buf: Rename dma-ops to prevent conflict with kunmap_atomic macro drm: atmel-hlcdc: Uninitialized return in atmel_hlcdc_create_outputs() drm/sti: fix GDP size to support up to UHD resolution MAINTAINERS: add drm/sti driver into drm-misc
2017-04-20ftrace: Have the function probes call their own functionSteven Rostedt (VMware)
Now that the function probes have their own ftrace_ops, there's no reason to continue using the ftrace_func_hash to find which probe to call in the function callback. The ops that is passed in to the function callback is part of the probe_ops to call. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20ftrace: Move the function commands into the tracing directorySteven Rostedt (VMware)
As nothing outside the tracing directory uses the function command mechanism, I'm moving the prototypes out of the include/linux/ftrace.h and into the local kernel/trace/trace.h header. I plan on making them hook to the trace_array structure which is local to kernel/trace, and I do not want to expose it to the rest of the kernel. This requires that the command functions must also be local to tracing. But luckily nothing else uses them. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-20blk-mq: Fix poll_stat for new size-based bucketing.Stephen Bates
Fixes an issue where the size of the poll_stat array in request_queue does not match the size expected by the new size based bucketing for IO completion polling. Fixes: 720b8ccc4500 ("blk-mq: Add a polling specific stats function") Signed-off-by: Stephen Bates <sbates@raithlin.com> Signed-off-by: Jens Axboe <axboe@fb.com>