summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2022-07-23cgroup: Make !percpu threadgroup_rwsem operations optionalTejun Heo
3942a9bd7b58 ("locking, rcu, cgroup: Avoid synchronize_sched() in __cgroup_procs_write()") disabled percpu operations on threadgroup_rwsem because the impiled synchronize_rcu() on write locking was pushing up the latencies too much for android which constantly moves processes between cgroups. This makes the hotter paths - fork and exit - slower as they're always forced into the slow path. There is no reason to force this on everyone especially given that more common static usage pattern can now completely avoid write-locking the rwsem. Write-locking is elided when turning on and off controllers on empty sub-trees and CLONE_INTO_CGROUP enables seeding a cgroup without grabbing the rwsem. Restore the default percpu operations and introduce the mount option "favordynmods" and config option CGROUP_FAVOR_DYNMODS for users who need lower latencies for the dynamic operations. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Michal Koutn� <mkoutny@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Dmitry Shmidt <dimitrysh@google.com> Cc: Oleg Nesterov <oleg@redhat.com>
2022-07-23vfio: Make vfio_unpin_pages() return voidNicolin Chen
There's only one caller that checks its return value with a WARN_ON_ONCE, while all other callers don't check the return value at all. Above that, an undo function should not fail. So, simplify the API to return void by embedding similar WARN_ONs. Also for users to pinpoint which condition fails, separate WARN_ON lines, yet remove the "driver->ops->unpin_pages" check, since it's unreasonable for callers to unpin on something totally random that wasn't even pinned. And remove NULL pointer checks for they would trigger oops vs. warnings. Note that npage is already validated in the vfio core, thus drop the same check in the type1 code. Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kirti Wankhede <kwankhede@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Link: https://lore.kernel.org/r/20220723020256.30081-2-nicolinc@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-07-22Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== bpf-next 2022-07-22 We've added 73 non-merge commits during the last 12 day(s) which contain a total of 88 files changed, 3458 insertions(+), 860 deletions(-). The main changes are: 1) Implement BPF trampoline for arm64 JIT, from Xu Kuohai. 2) Add ksyscall/kretsyscall section support to libbpf to simplify tracing kernel syscalls through kprobe mechanism, from Andrii Nakryiko. 3) Allow for livepatch (KLP) and BPF trampolines to attach to the same kernel function, from Song Liu & Jiri Olsa. 4) Add new kfunc infrastructure for netfilter's CT e.g. to insert and change entries, from Kumar Kartikeya Dwivedi & Lorenzo Bianconi. 5) Add a ksym BPF iterator to allow for more flexible and efficient interactions with kernel symbols, from Alan Maguire. 6) Bug fixes in libbpf e.g. for uprobe binary path resolution, from Dan Carpenter. 7) Fix BPF subprog function names in stack traces, from Alexei Starovoitov. 8) libbpf support for writing custom perf event readers, from Jon Doron. 9) Switch to use SPDX tag for BPF helper man page, from Alejandro Colomar. 10) Fix xsk send-only sockets when in busy poll mode, from Maciej Fijalkowski. 11) Reparent BPF maps and their charging on memcg offlining, from Roman Gushchin. 12) Multiple follow-up fixes around BPF lsm cgroup infra, from Stanislav Fomichev. 13) Use bootstrap version of bpftool where possible to speed up builds, from Pu Lehui. 14) Cleanup BPF verifier's check_func_arg() handling, from Joanne Koong. 15) Make non-prealloced BPF map allocations low priority to play better with memcg limits, from Yafang Shao. 16) Fix BPF test runner to reject zero-length data for skbs, from Zhengchao Shao. 17) Various smaller cleanups and improvements all over the place. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (73 commits) bpf: Simplify bpf_prog_pack_[size|mask] bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch) bpf, x64: Allow to use caller address from stack ftrace: Allow IPMODIFY and DIRECT ops on the same function ftrace: Add modify_ftrace_direct_multi_nolock bpf/selftests: Fix couldn't retrieve pinned program in xdp veth test bpf: Fix build error in case of !CONFIG_DEBUG_INFO_BTF selftests/bpf: Fix test_verifier failed test in unprivileged mode selftests/bpf: Add negative tests for new nf_conntrack kfuncs selftests/bpf: Add tests for new nf_conntrack kfuncs selftests/bpf: Add verifier tests for trusted kfunc args net: netfilter: Add kfuncs to set and change CT status net: netfilter: Add kfuncs to set and change CT timeout net: netfilter: Add kfuncs to allocate and insert CT net: netfilter: Deduplicate code in bpf_{xdp,skb}_ct_lookup bpf: Add documentation for kfuncs bpf: Add support for forcing kfunc args to be trusted bpf: Switch to new kfunc flags infrastructure tools/resolve_btfids: Add support for 8-byte BTF sets bpf: Introduce 8-byte BTF set ... ==================== Link: https://lore.kernel.org/r/20220722221218.29943-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-22PCI: Move isa_dma_bridge_buggy out of asm/dma.hStafford Horne
The isa_dma_bridge_buggy symbol is only used for x86_32, and only x86_32 platforms or quirks ever set it. Add a new linux/isa-dma.h header that #defines isa_dma_bridge_buggy to 0 except on x86_32, where we keep it as a variable, and remove all the arch- specific definitions. [bhelgaas: commit log] Suggested-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Christoph Hellwig <hch@infradead.org> Link: https://lore.kernel.org/r/20220722214944.831438-3-shorne@gmail.com Signed-off-by: Stafford Horne <shorne@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
2022-07-22PCI: Extend isolated function probing to s390Niklas Schnelle
Like the jailhouse hypervisor, s390's PCI architecture allows passing isolated PCI functions to a guest OS instance. As of now this is was not utilized even with multi-function support as the s390 PCI code makes sure that only virtual PCI busses including a function with devfn 0 are presented to the PCI subsystem. A subsequent change will remove this restriction. Allow probing such functions by replacing the existing check for jailhouse_paravirt() with a new hypervisor_isolated_pci_functions() helper. Link: https://lore.kernel.org/r/20220628143100.3228092-5-schnelle@linux.ibm.com Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Pierre Morel <pmorel@linux.ibm.com> Cc: Jan Kiszka <jan.kiszka@siemens.com>
2022-07-22bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)Song Liu
When tracing a function with IPMODIFY ftrace_ops (livepatch), the bpf trampoline must follow the instruction pointer saved on stack. This needs extra handling for bpf trampolines with BPF_TRAMP_F_CALL_ORIG flag. Implement bpf_tramp_ftrace_ops_func and use it for the ftrace_ops used by BPF trampoline. This enables tracing functions with livepatch. This also requires moving bpf trampoline to *_ftrace_direct_mult APIs. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/all/20220602193706.2607681-2-song@kernel.org/ Link: https://lore.kernel.org/bpf/20220720002126.803253-5-song@kernel.org
2022-07-22bpf, x64: Allow to use caller address from stackJiri Olsa
Currently we call the original function by using the absolute address given at the JIT generation. That's not usable when having trampoline attached to multiple functions, or the target address changes dynamically (in case of live patch). In such cases we need to take the return address from the stack. Adding support to retrieve the original function address from the stack by adding new BPF_TRAMP_F_ORIG_STACK flag for arch_prepare_bpf_trampoline function. Basically we take the return address of the 'fentry' call: function + 0: call fentry # stores 'function + 5' address on stack function + 5: ... The 'function + 5' address will be used as the address for the original function to call. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220720002126.803253-4-song@kernel.org
2022-07-22ftrace: Allow IPMODIFY and DIRECT ops on the same functionSong Liu
IPMODIFY (livepatch) and DIRECT (bpf trampoline) ops are both important users of ftrace. It is necessary to allow them work on the same function at the same time. First, DIRECT ops no longer specify IPMODIFY flag. Instead, DIRECT flag is handled together with IPMODIFY flag in __ftrace_hash_update_ipmodify(). Then, a callback function, ops_func, is added to ftrace_ops. This is used by ftrace core code to understand whether the DIRECT ops can share with an IPMODIFY ops. To share with IPMODIFY ops, the DIRECT ops need to implement the callback function and adjust the direct trampoline accordingly. If DIRECT ops is attached before the IPMODIFY ops, ftrace core code calls ENABLE_SHARE_IPMODIFY_PEER on the DIRECT ops before registering the IPMODIFY ops. If IPMODIFY ops is attached before the DIRECT ops, ftrace core code calls ENABLE_SHARE_IPMODIFY_SELF in __ftrace_hash_update_ipmodify. Owner of the DIRECT ops may return 0 if the DIRECT trampoline can share with IPMODIFY, so error code otherwise. The error code is propagated to register_ftrace_direct_multi so that onwer of the DIRECT trampoline can handle it properly. For more details, please refer to comment before enum ftrace_ops_cmd. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/all/20220602193706.2607681-2-song@kernel.org/ Link: https://lore.kernel.org/all/20220718055449.3960512-1-song@kernel.org/ Link: https://lore.kernel.org/bpf/20220720002126.803253-3-song@kernel.org
2022-07-22ftrace: Add modify_ftrace_direct_multi_nolockSong Liu
This is similar to modify_ftrace_direct_multi, but does not acquire direct_mutex. This is useful when direct_mutex is already locked by the user. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/bpf/20220720002126.803253-2-song@kernel.org
2022-07-22iomap: remove iomap_writepageiomap-6.0-merge-2iomap-5.20-merge-2Christoph Hellwig
Unused now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-07-22bpf: Fix build error in case of !CONFIG_DEBUG_INFO_BTFKumar Kartikeya Dwivedi
BTF_ID_FLAGS macro needs to be able to take 0 or 1 args, so make it a variable argument. BTF_SET8_END is incorrect, it should just be empty. Reported-by: kernel test robot <lkp@intel.com> Fixes: ab21d6063c01 ("bpf: Introduce 8-byte BTF set") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20220722113605.6513-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-22firmware: cs_dsp: Add memory chunk helpersCharles Keepax
Add helpers that can be layered on top of a buffer read from or to be written to the DSP to faciliate accessing datastructures within the DSP memory. These functions handle adding the padding bytes for the DSP, converting to big endian, and packing arbitrary length data. Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Link: https://lore.kernel.org/r/20220722094851.92521-2-ckeepax@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2022-07-22firmware: cs_dsp: Add pre_stop callbackCharles Keepax
The code already has a post_stop callback, add a matching pre_stop callback to the client_ops that is called before execution is stopped. This callback provides a convenient place for the client code to communicate with the DSP before it is stopped. Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Link: https://lore.kernel.org/r/20220722094851.92521-1-ckeepax@opensource.cirrus.com Signed-off-by: Mark Brown <broonie@kernel.org>
2022-07-22wifi: ieee80211: add helper functions for detecting TM/FTM framesAvraham Stern
Add helper functions for detection timing measurement and fine timing measurement frames. Signed-off-by: Avraham Stern <avraham.stern@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2022-07-22net: add missing includes and forward declarations under net/Jakub Kicinski
This patch adds missing includes to headers under include/net. All these problems are currently masked by the existing users including the missing dependency before the broken header. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-07-22Merge tag 'kvm-s390-next-5.20-1' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: s390x: Fixes and features for 5.20 * First part of deferred teardown * CPU Topology * interpretive execution for PCI instructions * PV attestation * Minor fixes
2022-07-21bpf: Add support for forcing kfunc args to be trustedKumar Kartikeya Dwivedi
Teach the verifier to detect a new KF_TRUSTED_ARGS kfunc flag, which means each pointer argument must be trusted, which we define as a pointer that is referenced (has non-zero ref_obj_id) and also needs to have its offset unchanged, similar to how release functions expect their argument. This allows a kfunc to receive pointer arguments unchanged from the result of the acquire kfunc. This is required to ensure that kfunc that operate on some object only work on acquired pointers and not normal PTR_TO_BTF_ID with same type which can be obtained by pointer walking. The restrictions applied to release arguments also apply to trusted arguments. This implies that strict type matching (not deducing type by recursively following members at offset) and OBJ_RELEASE offset checks (ensuring they are zero) are used for trusted pointer arguments. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220721134245.2450-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-21bpf: Switch to new kfunc flags infrastructureKumar Kartikeya Dwivedi
Instead of populating multiple sets to indicate some attribute and then researching the same BTF ID in them, prepare a single unified BTF set which indicates whether a kfunc is allowed to be called, and also its attributes if any at the same time. Now, only one call is needed to perform the lookup for both kfunc availability and its attributes. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220721134245.2450-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-21bpf: Introduce 8-byte BTF setKumar Kartikeya Dwivedi
Introduce support for defining flags for kfuncs using a new set of macros, BTF_SET8_START/BTF_SET8_END, which define a set which contains 8 byte elements (each of which consists of a pair of BTF ID and flags), using a new BTF_ID_FLAGS macro. This will be used to tag kfuncs registered for a certain program type as acquire, release, sleepable, ret_null, etc. without having to create more and more sets which was proving to be an unscalable solution. Now, when looking up whether a kfunc is allowed for a certain program, we can also obtain its kfunc flags in the same call and avoid further lookups. The resolve_btfids change is split into a separate patch. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220721134245.2450-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-07-21Merge branch 'ctxt.2022.07.05a' into HEADPaul E. McKenney
ctxt.2022.07.05a: Linux-kernel memory model development branch.
2022-07-21Merge branches 'doc.2022.06.21a', 'fixes.2022.07.19a', 'nocb.2022.07.19a', ↵Paul E. McKenney
'poll.2022.07.21a', 'rcu-tasks.2022.06.21a' and 'torture.2022.06.21a' into HEAD doc.2022.06.21a: Documentation updates. fixes.2022.07.19a: Miscellaneous fixes. nocb.2022.07.19a: Callback-offload updates. poll.2022.07.21a: Polled grace-period updates. rcu-tasks.2022.06.21a: Tasks RCU updates. torture.2022.06.21a: Torture-test updates.
2022-07-21rcu: Add polled expedited grace-period primitivesPaul E. McKenney
This commit adds expedited grace-period functionality to RCU's polled grace-period API, adding start_poll_synchronize_rcu_expedited() and cond_synchronize_rcu_expedited(), which are similar to the existing start_poll_synchronize_rcu() and cond_synchronize_rcu() functions, respectively. Note that although start_poll_synchronize_rcu_expedited() can be invoked very early, the resulting expedited grace periods are not guaranteed to start until after workqueues are fully initialized. On the other hand, both synchronize_rcu() and synchronize_rcu_expedited() can also be invoked very early, and the resulting grace periods will be taken into account as they occur. [ paulmck: Apply feedback from Neeraj Upadhyay. ] Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/ Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing Cc: Brian Foster <bfoster@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21resource: Introduce alloc_free_mem_region()Dan Williams
The core of devm_request_free_mem_region() is a helper that searches for free space in iomem_resource and performs __request_region_locked() on the result of that search. The policy choices of the implementation conform to what CONFIG_DEVICE_PRIVATE users want which is memory that is immediately marked busy, and a preference to search for the first-fit free range in descending order from the top of the physical address space. CXL has a need for a similar allocator, but with the following tweaks: 1/ Search for free space in ascending order 2/ Search for free space relative to a given CXL window 3/ 'insert' rather than 'request' the new resource given downstream drivers from the CXL Region driver (like the pmem or dax drivers) are responsible for request_mem_region() when they activate the memory range. Rework __request_free_mem_region() into get_free_mem_region() which takes a set of GFR_* (Get Free Region) flags to control the allocation policy (ascending vs descending), and "busy" policy (insert_resource() vs request_region()). As part of the consolidation of the legacy GFR_REQUEST_REGION case with the new default of just inserting a new resource into the free space some minor cleanups like not checking for NULL before calling devres_free() (which does its own check) is included. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/linux-cxl/20220420143406.GY2120790@nvidia.com/ Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/165784333333.1758207.13703329337805274043.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-07-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-21Merge tag 'net-5.19-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from can. Still no major regressions, most of the changes are still due to data races fixes, plus the usual bunch of drivers fixes. Previous releases - regressions: - tcp/udp: make early_demux back namespacified. - dsa: fix issues with vlan_filtering_is_global Previous releases - always broken: - ip: fix data-races around ipv4_net_table (round 2, 3 & 4) - amt: fix validation and synchronization bugs - can: fix detection of mcp251863 - eth: iavf: fix handling of dummy receive descriptors - eth: lan966x: fix issues with MAC table - eth: stmmac: dwmac-mediatek: fix clock issue Misc: - dsa: update documentation" * tag 'net-5.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (107 commits) mlxsw: spectrum_router: Fix IPv4 nexthop gateway indication net/sched: cls_api: Fix flow action initialization tcp: Fix data-races around sysctl_tcp_max_reordering. tcp: Fix a data-race around sysctl_tcp_abort_on_overflow. tcp: Fix a data-race around sysctl_tcp_rfc1337. tcp: Fix a data-race around sysctl_tcp_stdurg. tcp: Fix a data-race around sysctl_tcp_retrans_collapse. tcp: Fix data-races around sysctl_tcp_slow_start_after_idle. tcp: Fix a data-race around sysctl_tcp_thin_linear_timeouts. tcp: Fix data-races around sysctl_tcp_recovery. tcp: Fix a data-race around sysctl_tcp_early_retrans. tcp: Fix data-races around sysctl knobs related to SYN option. udp: Fix a data-race around sysctl_udp_l3mdev_accept. ip: Fix data-races around sysctl_ip_prot_sock. ipv4: Fix data-races around sysctl_fib_multipath_hash_fields. ipv4: Fix data-races around sysctl_fib_multipath_hash_policy. ipv4: Fix a data-race around sysctl_fib_multipath_use_neigh. can: rcar_canfd: Add missing of_node_put() in rcar_canfd_probe() can: mcp251xfd: fix detection of mcp251863 Documentation: fix udp_wmem_min in ip-sysctl.rst ...
2022-07-21cxl/acpi: Track CXL resources in iomem_resourceDan Williams
Recall that CXL capable address ranges, on ACPI platforms, are published in the CEDT.CFMWS (CXL Early Discovery Table: CXL Fixed Memory Window Structures). These windows represent both the actively mapped capacity and the potential address space that can be dynamically assigned to a new CXL decode configuration (region / interleave-set). CXL endpoints like DDR DIMMs can be mapped at any physical address including 0 and legacy ranges. There is an expectation and requirement that the /proc/iomem interface and the iomem_resource tree in the kernel reflect the full set of platform address ranges. I.e. that every address range that platform firmware and bus drivers enumerate be reflected as an iomem_resource entry. The hard requirement to do this for CXL arises from the fact that facilities like CONFIG_DEVICE_PRIVATE expect to be able to treat empty iomem_resource ranges as free for software to use as proxy address space. Without CXL publishing its potential address ranges in iomem_resource, the CONFIG_DEVICE_PRIVATE mechanism may inadvertently steal capacity reserved for runtime provisioning of new CXL regions. So, iomem_resource needs to know about both active and potential CXL resource ranges. The active CXL resources might already be reflected in iomem_resource as "System RAM". insert_resource_expand_to_fit() handles re-parenting "System RAM" underneath a CXL window. The "_expand_to_fit()" behavior handles cases where a CXL window is not a strict superset of an existing entry in the iomem_resource tree. The "_expand_to_fit()" behavior is acceptable from the perspective of resource allocation. The expansion happens because a conflicting resource range is already populated, which means the resource boundary expansion does not result in any additional free CXL address space being made available. CXL address space allocation is always bounded by the orginal unexpanded address range. However, the potential for expansion does mean that something like walk_iomem_res_desc(IORES_DESC_CXL...) can only return fuzzy answers on corner case platforms that cause the resource tree to expand a CXL window resource over a range that is not decoded by CXL. This would be an odd platform configuration, but if it becomes a problem in practice the CXL subsytem could just publish an API that returns definitive answers. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/165784325943.1758207.5310344844375305118.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-07-21dmaengine: imx-sdma: Add FIFO stride support for multi FIFO scriptShengjiu Wang
The peripheral may have several FIFOs, but some case just select some FIFOs from them for data transfer, which means FIFO0 and FIFO2 may be selected. So add FIFO address stride support, 0 means all FIFOs are continuous, 1 means 1 word stride between FIFOs. All stride between FIFOs should be same. Another option words_per_fifo means how many audio channel data copied to one FIFO one time, 1 means one channel per FIFO, 2 means 2 channels per FIFO. If 'n_fifos_src = 4' and 'words_per_fifo = 2', it means the first two words(channels) fetch from FIFO0 and then jump to FIFO1 for next two words, and so on after the last FIFO3 fetched, roll back to FIFO0. Signed-off-by: Joy Zou <joy.zou@nxp.com> Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com> Link: https://lore.kernel.org/r/1657162829-9273-1-git-send-email-shengjiu.wang@nxp.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2022-07-21net/cdc_ncm: Increase NTB max RX/TX values to 64kbŁukasz Spintzyk
DisplayLink ethernet devices require NTB buffers larger then 32kb in order to run with highest performance. This patch is changing upper limit of the rx and tx buffers. Those buffers are initialized with CDC_NCM_NTB_DEF_SIZE_RX and CDC_NCM_NTB_DEF_SIZE_TX which is 16kb so by default no device is affected by increased limit. Rx and tx buffer is increased under two conditions: - Device need to advertise that it supports higher buffer size in dwNtbMaxInMaxSize and dwNtbMaxOutMaxSize. - cdc_ncm/rx_max and cdc_ncm/tx_max driver parameters must be adjusted with udev rule or ethtool. Summary of testing and performance results: Tests were performed on following devices: - DisplayLink DL-3xxx family device - DisplayLink DL-6xxx family device - ASUS USB-C2500 2.5G USB3 ethernet adapter - Plugable USB3 1G USB3 ethernet adapter - EDIMAX EU-4307 USB-C ethernet adapter - Dell DBQBCBC064 USB-C ethernet adapter Performance measurements were done with: - iperf3 between two linux boxes - http://openspeedtest.com/ instance running on local test machine Insights from tests results: - All except one from third party usb adapters were not affected by increased buffer size to their advertised dwNtbOutMaxSize and dwNtbInMaxSize. Devices were generally reaching 912-940Mbps both download and upload. Only EDIMAX adapter experienced decreased download size from 929Mbps to 827Mbps with iper3, with openspeedtest decrease was from 968Mbps to 886Mbps. - DisplayLink DL-3xxx family devices experienced performance increase with iperf3 download from 300Mbps to 870Mbps and upload from 782Mbps to 844Mbps. With openspeedtest download increased from 556Mbps to 873Mbps and upload from 727Mbps to 973Mbps - DiplayLink DL-6xxx family devices are not affected by increased buffer size. Signed-off-by: Łukasz Spintzyk <lukasz.spintzyk@synaptics.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20220720060518.541-2-lukasz.spintzyk@synaptics.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-07-21x86/extable: Fix ex_handler_msr() print conditionPeter Zijlstra
On Fri, Jun 17, 2022 at 02:08:52PM +0300, Stephane Eranian wrote: > Some changes to the way invalid MSR accesses are reported by the > kernel is causing some problems with messages printed on the > console. > > We have seen several cases of ex_handler_msr() printing invalid MSR > accesses once but the callstack multiple times causing confusion on > the console. > The problem here is that another earlier commit (5.13): > > a358f40600b3 ("once: implement DO_ONCE_LITE for non-fast-path "do once" functionality") > > Modifies all the pr_*_once() calls to always return true claiming > that no caller is ever checking the return value of the functions. > > This is why we are seeing the callstack printed without the > associated printk() msg. Extract the ONCE_IF(cond) part into __ONCE_LTE_IF() and use that to implement DO_ONCE_LITE_IF() and fix the extable code. Fixes: a358f40600b3 ("once: implement DO_ONCE_LITE for non-fast-path "do once" functionality") Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Stephane Eranian <eranian@google.com> Link: https://lkml.kernel.org/r/YqyVFsbviKjVGGZ9@worktop.programming.kicks-ass.net
2022-07-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next: 1) Simplify nf_ct_get_tuple(), from Jackie Liu. 2) Add format to request_module() call, from Bill Wendling. 3) Add /proc/net/stats/nf_flowtable to monitor in-flight pending hardware offload objects to be processed, from Vlad Buslov. 4) Missing rcu annotation and accessors in the netfilter tree, from Florian Westphal. 5) Merge h323 conntrack helper nat hooks into single object, also from Florian. 6) A batch of update to fix sparse warnings treewide, from Florian Westphal. 7) Move nft_cmp_fast_mask() where it used, from Florian. 8) Missing const in nf_nat_initialized(), from James Yonan. 9) Use bitmap API for Maglev IPVS scheduler, from Christophe Jaillet. 10) Use refcount_inc instead of _inc_not_zero in flowtable, from Florian Westphal. 11) Remove pr_debug in xt_TPROXY, from Nathan Cancellor. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: xt_TPROXY: remove pr_debug invocations netfilter: flowtable: prefer refcount_inc netfilter: ipvs: Use the bitmap API to allocate bitmaps netfilter: nf_nat: in nf_nat_initialized(), use const struct nf_conn * netfilter: nf_tables: move nft_cmp_fast_mask to where its used netfilter: nf_tables: use correct integer types netfilter: nf_tables: add and use BE register load-store helpers netfilter: nf_tables: use the correct get/put helpers netfilter: x_tables: use correct integer types netfilter: nfnetlink: add missing __be16 cast netfilter: nft_set_bitmap: Fix spelling mistake netfilter: h323: merge nat hook pointers into one netfilter: nf_conntrack: use rcu accessors where needed netfilter: nf_conntrack: add missing __rcu annotations netfilter: nf_flow_table: count pending offload workqueue tasks net/sched: act_ct: set 'net' pointer when creating new nf_flow_table netfilter: conntrack: use correct format characters netfilter: conntrack: use fallthrough to cleanup ==================== Link: https://lore.kernel.org/r/20220720230754.209053-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-20Merge tag 'mlx5-updates-2022-07-17' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2022-07-17 1) Add resiliency for lost completions for PTP TX port timestamp 2) Report Header-data split state via ethtool 3) Decouple HTB code from main regular TX code * tag 'mlx5-updates-2022-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux: net/mlx5: CT: Remove warning of ignore_flow_level support for non PF net/mlx5e: Add resiliency for PTP TX port timestamp net/mlx5: Expose ts_cqe_metadata_size2wqe_counter net/mlx5e: HTB, move htb functions to a new file net/mlx5e: HTB, change functions name to follow convention net/mlx5e: HTB, remove priv from htb function calls net/mlx5e: HTB, hide and dynamically allocate mlx5e_htb structure net/mlx5e: HTB, move stats and max_sqs to priv net/mlx5e: HTB, move section comment to the right place net/mlx5e: HTB, move ids to selq_params struct net/mlx5e: HTB, reduce visibility of htb functions net/mlx5e: Fix mqprio_rl handling on devlink reload net/mlx5e: Report header-data split state through ethtool ==================== Link: https://lore.kernel.org/r/20220719203529.51151-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-20bpf: Fix bpf_trampoline_{,un}link_cgroup_shim ifdef guardsStanislav Fomichev
They were updated in kernel/bpf/trampoline.c to fix another build issue. We should to do the same for include/linux/bpf.h header. Fixes: 3908fcddc65d ("bpf: fix lsm_cgroup build errors on esoteric configs") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220720155220.4087433-1-sdf@google.com
2022-07-20vfio: Replace the iommu notifier with a device listJason Gunthorpe
Instead of bouncing the function call to the driver op through a blocking notifier just have the iommu layer call it directly. Register each device that is being attached to the iommu with the lower driver which then threads them on a linked list and calls the appropriate driver op at the right time. Currently the only use is if dma_unmap() is defined. Also, fully lock all the debugging tests on the pinning path that a dma_unmap is registered. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/2-v4-681e038e30fd+78-vfio_unmap_notif_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-07-20vfio: Replace the DMA unmapping notifier with a callbackJason Gunthorpe
Instead of having drivers register the notifier with explicit code just have them provide a dma_unmap callback op in their driver ops and rely on the core code to wire it up. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Zhenyu Wang <zhenyuw@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v4-681e038e30fd+78-vfio_unmap_notif_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-07-20x86/amd_nb: Add AMD PCI IDs for SMN communicationMario Limonciello
Add support for SMN communication on family 17h model A0h and family 19h models 60h-70h. [ bp: Merge into a single patch. ] Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci_ids.h Acked-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20220719195256.1516-1-mario.limonciello@amd.com
2022-07-20Merge branch irq/loongarch into irq/irqchip-nextMarc Zyngier
* irq/loongarch: : . : Merge the long awaited IRQ support for the LoongArch architecture. : : From the cover letter: : : "Currently, LoongArch based processors (e.g. Loongson-3A5000) : can only work together with LS7A chipsets. The irq chips in : LoongArch computers include CPUINTC (CPU Core Interrupt : Controller), LIOINTC (Legacy I/O Interrupt Controller), : EIOINTC (Extended I/O Interrupt Controller), PCH-PIC (Main : Interrupt Controller in LS7A chipset), PCH-LPC (LPC Interrupt : Controller in LS7A chipset) and PCH-MSI (MSI Interrupt Controller)." : : Note that this comes with non-official, arch private ACPICA : definitions until the official ACPICA update is realeased. : . irqchip / ACPI: Introduce ACPI_IRQ_MODEL_LPIC for LoongArch irqchip: Add LoongArch CPU interrupt controller support irqchip: Add Loongson Extended I/O interrupt controller support irqchip/loongson-liointc: Add ACPI init support irqchip/loongson-pch-msi: Add ACPI init support irqchip/loongson-pch-pic: Add ACPI init support irqchip: Add Loongson PCH LPC controller support LoongArch: Prepare to support multiple pch-pic and pch-msi irqdomain LoongArch: Use ACPI_GENERIC_GSI for gsi handling genirq/generic_chip: Export irq_unmap_generic_chip ACPI: irq: Allow acpi_gsi_to_irq() to have an arch-specific fallback APCI: irq: Add support for multiple GSI domains LoongArch: Provisionally add ACPICA data structures Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-07-20irqchip / ACPI: Introduce ACPI_IRQ_MODEL_LPIC for LoongArchJianmin Lv
For LoongArch, ACPI_IRQ_MODEL_LPIC is introduced, and then the callback acpi_get_gsi_domain_id and acpi_gsi_to_irq_fallback are implemented. The acpi_get_gsi_domain_id callback returns related fwnode handle of irqdomain for different GSI range. The acpi_gsi_to_irq_fallback will create new mapping for gsi when the mapping of it is not found. Signed-off-by: Jianmin Lv <lvjianmin@loongson.cn> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/1658314292-35346-14-git-send-email-lvjianmin@loongson.cn
2022-07-20irqchip: Add Loongson Extended I/O interrupt controller supportHuacai Chen
EIOINTC stands for "Extended I/O Interrupts" that described in Section 11.2 of "Loongson 3A5000 Processor Reference Manual". For more information please refer Documentation/loongarch/irq-chip-model.rst. Loongson-3A5000 has 4 cores per NUMA node, and each NUMA node has an EIOINTC; while Loongson-3C5000 has 16 cores per NUMA node, and each NUMA node has 4 EIOINTCs. In other words, 16 cores of one NUMA node in Loongson-3C5000 are organized in 4 groups, each group connects to an EIOINTC. We call the "group" here as an EIOINTC node, so each EIOINTC node always includes 4 cores (both in Loongson-3A5000 and Loongson- 3C5000). Co-developed-by: Jianmin Lv <lvjianmin@loongson.cn> Signed-off-by: Jianmin Lv <lvjianmin@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/1658314292-35346-12-git-send-email-lvjianmin@loongson.cn
2022-07-20genirq/generic_chip: Export irq_unmap_generic_chipJianmin Lv
Some irq controllers have to re-implement a private version for irq_generic_chip_ops, because they have a different xlate to translate hwirq. Export irq_unmap_generic_chip to allow reusing in drivers. Signed-off-by: Jianmin Lv <lvjianmin@loongson.cn> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/1658314292-35346-5-git-send-email-lvjianmin@loongson.cn
2022-07-20ACPI: irq: Allow acpi_gsi_to_irq() to have an arch-specific fallbackMarc Zyngier
It appears that the generic version of acpi_gsi_to_irq() doesn't fallback to establishing a mapping if there is no pre-existing one while the x86 version does. While arm64 seems unaffected by it, LoongArch is relying on the x86 behaviour. In an effort to prevent new architectures from reinventing the proverbial wheel, provide an optional callback that the arch code can set to restore the x86 behaviour. Hopefully we can eventually get rid of this in the future once the expected behaviour has been clarified. Reported-by: Jianmin Lv <lvjianmin@loongson.cn> Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Jianmin Lv <lvjianmin@loongson.cn> Tested-by: Hanjun Guo <guohanjun@huawei.com> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Link: https://lore.kernel.org/r/1658314292-35346-4-git-send-email-lvjianmin@loongson.cn
2022-07-20APCI: irq: Add support for multiple GSI domainsMarc Zyngier
In an unfortunate departure from the ACPI spec, the LoongArch architecture split its GSI space across multiple interrupt controllers. In order to be able to reuse the core code and prevent architectures from reinventing an already square wheel, offer the arch code the ability to register a dispatcher function that will return the domain fwnode for a given GSI. The ARM GIC drivers are updated to support this (with a single domain, as intended). Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> Signed-off-by: Jianmin Lv <lvjianmin@loongson.cn> Tested-by: Hanjun Guo <guohanjun@huawei.com> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Link: https://lore.kernel.org/r/1658314292-35346-3-git-send-email-lvjianmin@loongson.cn
2022-07-20arm64: enable THP_SWAP for arm64Barry Song
THP_SWAP has been proven to improve the swap throughput significantly on x86_64 according to commit bd4c82c22c367e ("mm, THP, swap: delay splitting THP after swapped out"). As long as arm64 uses 4K page size, it is quite similar with x86_64 by having 2MB PMD THP. THP_SWAP is architecture-independent, thus, enabling it on arm64 will benefit arm64 as well. A corner case is that MTE has an assumption that only base pages can be swapped. We won't enable THP_SWAP for ARM64 hardware with MTE support until MTE is reworked to coexist with THP_SWAP. A micro-benchmark is written to measure thp swapout throughput as below, unsigned long long tv_to_ms(struct timeval tv) { return tv.tv_sec * 1000 + tv.tv_usec / 1000; } main() { struct timeval tv_b, tv_e;; #define SIZE 400*1024*1024 volatile void *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (!p) { perror("fail to get memory"); exit(-1); } madvise(p, SIZE, MADV_HUGEPAGE); memset(p, 0x11, SIZE); /* write to get mem */ gettimeofday(&tv_b, NULL); madvise(p, SIZE, MADV_PAGEOUT); gettimeofday(&tv_e, NULL); printf("swp out bandwidth: %ld bytes/ms\n", SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b))); } Testing is done on rk3568 64bit Quad Core Cortex-A55 platform - ROCK 3A. thp swp throughput w/o patch: 2734bytes/ms (mean of 10 tests) thp swp throughput w/ patch: 3331bytes/ms (mean of 10 tests) Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Steven Price <steven.price@arm.com> Cc: Yang Shi <shy828301@gmail.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Link: https://lore.kernel.org/r/20220720093737.133375-1-21cnbao@gmail.com Signed-off-by: Will Deacon <will@kernel.org>
2022-07-19driver-core: Introduce BIN_ATTR_ADMIN_{RO,RW}Ira Weiny
Many binary attributes need to limit access to CAP_SYS_ADMIN only; ie many binary attributes specify is_visible with 0400 or 0600. Make setting the permissions of such attributes more explicit by defining BIN_ATTR_ADMIN_{RO,RW}. Cc: Bjorn Helgaas <bhelgaas@google.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Suggested-by: Krzysztof Wilczyński <kw@linux.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20220719205249.566684-6-ira.weiny@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-07-19PCI/DOE: Add DOE mailbox support functionsJonathan Cameron
Introduced in a PCIe r6.0, sec 6.30, DOE provides a config space based mailbox with standard protocol discovery. Each mailbox is accessed through a DOE Extended Capability. Each DOE mailbox must support the DOE discovery protocol in addition to any number of additional protocols. Define core PCIe functionality to manage a single PCIe DOE mailbox at a defined config space offset. Functionality includes iterating, creating, query of supported protocol, and task submission. Destruction of the mailboxes is device managed. Cc: "Li, Ming" <ming4.li@intel.com> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Acked-by: Bjorn Helgaas <helgaas@kernel.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Co-developed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20220719205249.566684-4-ira.weiny@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-07-19PCI: Add vendor ID for the PCI SIGJonathan Cameron
This ID is used in DOE headers to identify protocols that are defined within the PCI Express Base Specification, PCIe r6.0, sec 6.30.1.1 table 6-32. Acked-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20220719205249.566684-2-ira.weiny@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-07-19Merge branch 'io_uring-zerocopy-send' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux Pavel Begunkov says: ==================== io_uring zerocopy send The patchset implements io_uring zerocopy send. It works with both registered and normal buffers, mixing is allowed but not recommended. Apart from usual request completions, just as with MSG_ZEROCOPY, io_uring separately notifies the userspace when buffers are freed and can be reused (see API design below), which is delivered into io_uring's Completion Queue. Those "buffer-free" notifications are not necessarily per request, but the userspace has control over it and should explicitly attaching a number of requests to a single notification. The series also adds some internal optimisations when used with registered buffers like removing page referencing. From the kernel networking perspective there are two main changes. The first one is passing ubuf_info into the network layer from io_uring (inside of an in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info caching on the io_uring side, but also helps to avoid cross-referencing and synchronisation problems. The second part is an optional optimisation removing page referencing for requests with registered buffers. Benchmarking UDP with an optimised version of the selftest (see [1]), which sends a bunch of requests, waits for completions and repeats. "+ flush" column posts one additional "buffer-free" notification per request, and just "zc" doesn't post buffer notifications at all. NIC (requests / second): IO size | non-zc | zc | zc + flush 4000 | 495134 | 606420 (+22%) | 558971 (+12%) 1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%) 1000 | 584677 | 592088 (+1.2%) | 560885 (-4%) 600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%) dummy (requests / second): IO size | non-zc | zc | zc + flush 8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%) 4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%) 1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%) 600 | 2106794 | 2381527 (+13%) | 2195295 (+4%) Previously it also brought a massive performance speedup compared to the msg_zerocopy tool (see [3]), which is probably not super interesting. There is also an additional bunch of refcounting optimisations that was omitted from the series for simplicity and as they don't change the picture drastically, they will be sent as follow up, as well as flushing optimisations closing the performance gap b/w two last columns. For TCP on localhost (with hacks enabling localhost zerocopy) and including additional overhead for receive: IO size | non-zc | zc 1200 | 4174 | 4148 4096 | 7597 | 11228 Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the omitted optimisations will somewhat help, should look better for 4000, but couldn't test properly because of setup problems. Links: liburing (benchmark + tests): [1] https://github.com/isilence/liburing/tree/zc_v4 kernel repo: [2] https://github.com/isilence/linux/tree/zc_v4 RFC v1: [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/ RFC v2: https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/ Net patches based: git@github.com:isilence/linux.git zc_v4-net-base or https://github.com/isilence/linux/tree/zc_v4-net-base API design overview: The series introduces an io_uring concept of notifactors. From the userspace perspective it's an entity to which it can bind one or more requests and then requesting to flush it. Flushing a notifier makes it impossible to attach new requests to it, and instructs the notifier to post a completion once all requests attached to it are completed and the kernel doesn't need the buffers anymore. Notifications are stored in notification slots, which should be registered as an array in io_uring. Each slot stores only one notifier at any particular moment. Flushing removes it from the slot and the slot automatically replaces it with a new notifier. All operations with notifiers are done by specifying an index of a slot it's currently in. When registering a notification the userspace specifies a u64 tag for each slot, which will be copied in notification completion entries as cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32 sequence number counting notifiers of a slot. ==================== Link: https://lore.kernel.org/r/cover.1657643355.git.asml.silence@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-19net: introduce __skb_fill_page_desc_noaccPavel Begunkov
Managed pages contain pinned userspace pages and controlled by upper layers, there is no need in tracking skb->pfmemalloc for them. Introduce a helper for filling frags but ignoring page tracking, it'll be needed later. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-19net: introduce managed frags infrastructurePavel Begunkov
Some users like io_uring can do page pinning more efficiently, so we want a way to delegate referencing to other subsystems. For that add a new flag called SKBFL_MANAGED_FRAG_REFS. When set, skb doesn't hold page references and upper layers are responsivle to managing page lifetime. It's allowed to convert skbs from managed to normal by calling skb_zcopy_downgrade_managed(). The function will take all needed page references and clear the flag. It's needed, for instance, to avoid mixing managed modes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-19net: Allow custom iter handler in msghdrDavid Ahern
Add support for custom iov_iter handling to msghdr. The idea is that in-kernel subsystems want control over how an SG is split. Signed-off-by: David Ahern <dsahern@kernel.org> [pavel: move callback into msghdr] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-19skbuff: carry external ubuf_info in msghdrPavel Begunkov
Make possible for network in-kernel callers like io_uring to pass in a custom ubuf_info by setting it in a new field of struct msghdr. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>