summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2020-01-17XArray: Add xa_for_each_rangexarray-5.5Matthew Wilcox (Oracle)
This function supports iterating over a range of an array. Also add documentation links for xa_for_each_start(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2020-01-17XArray: Add wrappers for nested spinlocksMatthew Wilcox (Oracle)
Some users need to take an xarray lock while holding another xarray lock. Reported-by: Doug Gilbert <dgilbert@interlog.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2019-11-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) BPF sample build fixes from Björn Töpel 2) Fix powerpc bpf tail call implementation, from Eric Dumazet. 3) DCCP leaks jiffies on the wire, fix also from Eric Dumazet. 4) Fix crash in ebtables when using dnat target, from Florian Westphal. 5) Fix port disable handling whne removing bcm_sf2 driver, from Florian Fainelli. 6) Fix kTLS sk_msg trim on fallback to copy mode, from Jakub Kicinski. 7) Various KCSAN fixes all over the networking, from Eric Dumazet. 8) Memory leaks in mlx5 driver, from Alex Vesker. 9) SMC interface refcounting fix, from Ursula Braun. 10) TSO descriptor handling fixes in stmmac driver, from Jose Abreu. 11) Add a TX lock to synchonize the kTLS TX path properly with crypto operations. From Jakub Kicinski. 12) Sock refcount during shutdown fix in vsock/virtio code, from Stefano Garzarella. 13) Infinite loop in Intel ice driver, from Colin Ian King. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (108 commits) ixgbe: need_wakeup flag might not be set for Tx i40e: need_wakeup flag might not be set for Tx igb/igc: use ktime accessors for skb->tstamp i40e: Fix for ethtool -m issue on X722 NIC iavf: initialize ITRN registers with correct values ice: fix potential infinite loop because loop counter being too small qede: fix NULL pointer deref in __qede_remove() net: fix data-race in neigh_event_send() vsock/virtio: fix sock refcnt holding during the shutdown net: ethernet: octeon_mgmt: Account for second possible VLAN header mac80211: fix station inactive_time shortly after boot net/fq_impl: Switch to kvmalloc() for memory allocation mac80211: fix ieee80211_txq_setup_flows() failure path ipv4: Fix table id reference in fib_sync_down_addr ipv6: fixes rt6_probe() and fib6_nh->last_probe init net: hns: Fix the stray netpoll locks causing deadlock in NAPI path net: usb: qmi_wwan: add support for DW5821e with eSIM support CDC-NCM: handle incomplete transfer of MTU nfc: netlink: fix double device reference drop NFC: st21nfca: fix double free ...
2019-11-08Merge tag 'xarray-5.4' of git://git.infradead.org/users/willy/linux-daxLinus Torvalds
Pull XArray fixes from Matthew Wilcox: "These all fix various bugs, some of which people have tripped over and some of which have been caught by automatic tools" * tag 'xarray-5.4' of git://git.infradead.org/users/willy/linux-dax: idr: Fix idr_alloc_u32 on 32-bit systems idr: Fix integer overflow in idr_for_each_entry radix tree: Remove radix_tree_iter_find idr: Fix idr_get_next_ul race with idr_remove XArray: Fix xas_next() with a single entry at 0
2019-11-06mm: thp: handle page cache THP correctly in PageTransCompoundMapYang Shi
We have a usecase to use tmpfs as QEMU memory backend and we would like to take the advantage of THP as well. But, our test shows the EPT is not PMD mapped even though the underlying THP are PMD mapped on host. The number showed by /sys/kernel/debug/kvm/largepage is much less than the number of PMD mapped shmem pages as the below: 7f2778200000-7f2878200000 rw-s 00000000 00:14 262232 /dev/shm/qemu_back_mem.mem.Hz2hSf (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 579584 kB [snip] Locked: 0 kB cat /sys/kernel/debug/kvm/largepages 12 And some benchmarks do worse than with anonymous THPs. By digging into the code we figured out that commit 127393fbe597 ("mm: thp: kvm: fix memory corruption in KVM with THP enabled") checks if there is a single PTE mapping on the page for anonymous THP when setting up EPT map. But the _mapcount < 0 check doesn't work for page cache THP since every subpage of page cache THP would get _mapcount inc'ed once it is PMD mapped, so PageTransCompoundMap() always returns false for page cache THP. This would prevent KVM from setting up PMD mapped EPT entry. So we need handle page cache THP correctly. However, when page cache THP's PMD gets split, kernel just remove the map instead of setting up PTE map like what anonymous THP does. Before KVM calls get_user_pages() the subpages may get PTE mapped even though it is still a THP since the page cache THP may be mapped by other processes at the mean time. Checking its _mapcount and whether the THP has PTE mapped or not. Although this may report some false negative cases (PTE mapped by other processes), it looks not trivial to make this accurate. With this fix /sys/kernel/debug/kvm/largepage would show reasonable pages are PMD mapped by EPT as the below: 7fbeaee00000-7fbfaee00000 rw-s 00000000 00:14 275464 /dev/shm/qemu_back_mem.mem.SKUvat (deleted) Size: 4194304 kB [snip] AnonHugePages: 0 kB ShmemPmdMapped: 557056 kB [snip] Locked: 0 kB cat /sys/kernel/debug/kvm/largepages 271 And the benchmarks are as same as anonymous THPs. [yang.shi@linux.alibaba.com: v4] Link: http://lkml.kernel.org/r/1571865575-42913-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1571769577-89735-1-git-send-email-yang.shi@linux.alibaba.com Fixes: dd78fedde4b9 ("rmap: support file thp") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reported-by: Gang Deng <gavin.dg@linux.alibaba.com> Tested-by: Gang Deng <gavin.dg@linux.alibaba.com> Suggested-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-05net/tls: fix sk_msg trim on fallback to copy modeJakub Kicinski
sk_msg_trim() tries to only update curr pointer if it falls into the trimmed region. The logic, however, does not take into the account pointer wrapping that sk_msg_iter_var_prev() does nor (as John points out) the fact that msg->sg is a ring buffer. This means that when the message was trimmed completely, the new curr pointer would have the value of MAX_MSG_FRAGS - 1, which is neither smaller than any other value, nor would it actually be correct. Special case the trimming to 0 length a little bit and rework the comparison between curr and end to take into account wrapping. This bug caused the TLS code to not copy all of the message, if zero copy filled in fewer sg entries than memcopy would need. Big thanks to Alexander Potapenko for the non-KMSAN reproducer. v2: - take into account that msg->sg is a ring buffer (John). Link: https://lore.kernel.org/netdev/20191030160542.30295-1-jakub.kicinski@netronome.com/ (v1) Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface") Reported-by: syzbot+f8495bff23a879a6d0bd@syzkaller.appspotmail.com Reported-by: syzbot+6f50c99e8f6194bf363f@syzkaller.appspotmail.com Co-developed-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-11-02 The following pull-request contains BPF updates for your *net* tree. We've added 6 non-merge commits during the last 6 day(s) which contain a total of 8 files changed, 35 insertions(+), 9 deletions(-). The main changes are: 1) Fix ppc BPF JIT's tail call implementation by performing a second pass to gather a stable JIT context before opcode emission, from Eric Dumazet. 2) Fix build of BPF samples sys_perf_event_open() usage to compiled out unavailable test_attr__{enabled,open} checks. Also fix potential overflows in bpf_map_{area_alloc,charge_init} on 32 bit archs, from Björn Töpel. 3) Fix narrow loads of bpf_sysctl context fields with offset > 0 on big endian archs like s390x and also improve the test coverage, from Ilya Leoshkevich. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-03idr: Fix integer overflow in idr_for_each_entryMatthew Wilcox (Oracle)
If there is an entry at INT_MAX then idr_for_each_entry() will increment id after handling it. This is undefined behaviour, and is caught by UBSAN. Adding 1U to id forces the operation to be carried out as an unsigned addition which (when assigned to id) will result in INT_MIN. Since there is never an entry stored at INT_MIN, idr_get_next() will return NULL, ending the loop as expected. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2019-11-01radix tree: Remove radix_tree_iter_findMatthew Wilcox (Oracle)
This API is unsafe to use under the RCU lock. With no in-tree users remaining, remove it to prevent future bugs. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2019-11-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix free/alloc races in batmanadv, from Sven Eckelmann. 2) Several leaks and other fixes in kTLS support of mlx5 driver, from Tariq Toukan. 3) BPF devmap_hash cost calculation can overflow on 32-bit, from Toke Høiland-Jørgensen. 4) Add an r8152 device ID, from Kazutoshi Noguchi. 5) Missing include in ipv6's addrconf.c, from Ben Dooks. 6) Use siphash in flow dissector, from Eric Dumazet. Attackers can easily infer the 32-bit secret otherwise etc. 7) Several netdevice nesting depth fixes from Taehee Yoo. 8) Fix several KCSAN reported errors, from Eric Dumazet. For example, when doing lockless skb_queue_empty() checks, and accessing sk_napi_id/sk_incoming_cpu lockless as well. 9) Fix jumbo packet handling in RXRPC, from David Howells. 10) Bump SOMAXCONN and tcp_max_syn_backlog values, from Eric Dumazet. 11) Fix DMA synchronization in gve driver, from Yangchun Fu. 12) Several bpf offload fixes, from Jakub Kicinski. 13) Fix sk_page_frag() recursion during memory reclaim, from Tejun Heo. 14) Fix ping latency during high traffic rates in hisilicon driver, from Jiangfent Xiao. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (146 commits) net: fix installing orphaned programs net: cls_bpf: fix NULL deref on offload filter removal selftests: bpf: Skip write only files in debugfs selftests: net: reuseport_dualstack: fix uninitalized parameter r8169: fix wrong PHY ID issue with RTL8168dp net: dsa: bcm_sf2: Fix IMP setup for port different than 8 net: phylink: Fix phylink_dbg() macro gve: Fixes DMA synchronization. inet: stop leaking jiffies on the wire ixgbe: Remove duplicate clear_bit() call Documentation: networking: device drivers: Remove stray asterisks e1000: fix memory leaks i40e: Fix receive buffer starvation for AF_XDP igb: Fix constant media auto sense switching when no cable is connected net: ethernet: arc: add the missed clk_disable_unprepare igb: Enable media autosense for the i350. igb/igc: Don't warn on fatal read failures when the device is removed tcp: increase tcp_max_syn_backlog max value net: increase SOMAXCONN to 4096 netdevsim: Fix use-after-free during device dismantle ...
2019-11-01Merge tag 'nfs-for-5.4-3' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Anna Schumaker: "This contains two delegation fixes (with the RCU lock leak fix marked for stable), and three patches to fix destroying the the sunrpc back channel. Stable bugfixes: - Fix an RCU lock leak in nfs4_refresh_delegation_stateid() Other fixes: - The TCP back channel mustn't disappear while requests are outstanding - The RDMA back channel mustn't disappear while requests are outstanding - Destroy the back channel when we destroy the host transport - Don't allow a cached open with a revoked delegation" * tag 'nfs-for-5.4-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS: Fix an RCU lock leak in nfs4_refresh_delegation_stateid() NFSv4: Don't allow a cached open with a revoked delegation SUNRPC: Destroy the back channel when we destroy the host transport SUNRPC: The RDMA back channel mustn't disappear while requests are outstanding SUNRPC: The TCP back channel mustn't disappear while requests are outstanding
2019-11-01Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc fixes: an ABI fix for a reserved field, AMD IBS fixes, an Intel uncore PMU driver fix and a header typo fix" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/headers: Fix spelling s/EACCESS/EACCES/, s/privilidge/privilege/ perf/x86/uncore: Fix event group support perf/x86/amd/ibs: Handle erratum #420 only on the affected CPU family (10h) perf/x86/amd/ibs: Fix reading of the IBS OpData register and thus precise RIP validity perf/core: Start rejecting the syscall with attr.__reserved_2 set
2019-11-01Merge branch 'efi-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI fixes from Ingo Molnar: "Various fixes all over the map: prevent boot crashes on HyperV, classify UEFI randomness as bootloader randomness, fix EFI boot for the Raspberry Pi2, fix efi_test permissions, etc" * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi/efi_test: Lock down /dev/efi_test and require CAP_SYS_ADMIN x86, efi: Never relocate kernel below lowest acceptable address efi: libstub/arm: Account for firmware reserved memory at the base of RAM efi/random: Treat EFI_RNG_PROTOCOL output as bootloader randomness efi/tpm: Return -EINVAL when determining tpm final events log size fails efi: Make CONFIG_EFI_RCI2_TABLE selectable on x86 only
2019-10-31net: increase SOMAXCONN to 4096Eric Dumazet
SOMAXCONN is /proc/sys/net/core/somaxconn default value. It has been defined as 128 more than 20 years ago. Since it caps the listen() backlog values, the very small value has caused numerous problems over the years, and many people had to raise it on their hosts after beeing hit by problems. Google has been using 1024 for at least 15 years, and we increased this to 4096 after TCP listener rework has been completed, more than 4 years ago. We got no complain of this change breaking any legacy application. Many applications indeed setup a TCP listener with listen(fd, -1); meaning they let the system select the backlog. Raising SOMAXCONN lowers chance of the port being unavailable under even small SYNFLOOD attack, and reduces possibilities of side channel vulnerabilities. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Yue Cao <ycao009@ucr.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31bpf: Change size to u64 for bpf_map_{area_alloc, charge_init}()Björn Töpel
The functions bpf_map_area_alloc() and bpf_map_charge_init() prior this commit passed the size parameter as size_t. In this commit this is changed to u64. All users of these functions avoid size_t overflows on 32-bit systems, by explicitly using u64 when calculating the allocation size and memory charge cost. However, since the result was narrowed by the size_t when passing size and cost to the functions, the overflow handling was in vain. Instead of changing all call sites to size_t and handle overflow at the call site, the parameter is changed to u64 and checked in the functions above. Fixes: d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc") Fixes: c85d69135a91 ("bpf: move memory size checks to bpf_map_charge_init()") Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Link: https://lore.kernel.org/bpf/20191029154307.23053-1-bjorn.topel@gmail.com
2019-10-31efi/efi_test: Lock down /dev/efi_test and require CAP_SYS_ADMINJavier Martinez Canillas
The driver exposes EFI runtime services to user-space through an IOCTL interface, calling the EFI services function pointers directly without using the efivar API. Disallow access to the /dev/efi_test character device when the kernel is locked down to prevent arbitrary user-space to call EFI runtime services. Also require CAP_SYS_ADMIN to open the chardev to prevent unprivileged users to call the EFI runtime services, instead of just relying on the chardev file mode bits for this. The main user of this driver is the fwts [0] tool that already checks if the effective user ID is 0 and fails otherwise. So this change shouldn't cause any regression to this tool. [0]: https://wiki.ubuntu.com/FirmwareTestSuite/Reference/uefivarinfo Signed-off-by: Javier Martinez Canillas <javierm@redhat.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Laszlo Ersek <lersek@redhat.com> Acked-by: Matthew Garrett <mjg59@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: https://lkml.kernel.org/r/20191029173755.27149-7-ardb@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-31x86, efi: Never relocate kernel below lowest acceptable addressKairui Song
Currently, kernel fails to boot on some HyperV VMs when using EFI. And it's a potential issue on all x86 platforms. It's caused by broken kernel relocation on EFI systems, when below three conditions are met: 1. Kernel image is not loaded to the default address (LOAD_PHYSICAL_ADDR) by the loader. 2. There isn't enough room to contain the kernel, starting from the default load address (eg. something else occupied part the region). 3. In the memmap provided by EFI firmware, there is a memory region starts below LOAD_PHYSICAL_ADDR, and suitable for containing the kernel. EFI stub will perform a kernel relocation when condition 1 is met. But due to condition 2, EFI stub can't relocate kernel to the preferred address, so it fallback to ask EFI firmware to alloc lowest usable memory region, got the low region mentioned in condition 3, and relocated kernel there. It's incorrect to relocate the kernel below LOAD_PHYSICAL_ADDR. This is the lowest acceptable kernel relocation address. The first thing goes wrong is in arch/x86/boot/compressed/head_64.S. Kernel decompression will force use LOAD_PHYSICAL_ADDR as the output address if kernel is located below it. Then the relocation before decompression, which move kernel to the end of the decompression buffer, will overwrite other memory region, as there is no enough memory there. To fix it, just don't let EFI stub relocate the kernel to any address lower than lowest acceptable address. [ ardb: introduce efi_low_alloc_above() to reduce the scope of the change ] Signed-off-by: Kairui Song <kasong@redhat.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: https://lkml.kernel.org/r/20191029173755.27149-6-ardb@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-31Merge tag 'dmaengine-fix-5.4-rc6' of ↵Linus Torvalds
git://git.infradead.org/users/vkoul/slave-dma Pull dmaengine fixes from Vinod Koul: "A few fixes to the dmaengine drivers: - fix in sprd driver for link list and potential memory leak - tegra transfer failure fix - imx size check fix for script_number - xilinx fix for 64bit AXIDMA and control reg update - qcom bam dma resource leak fix - cppi slave transfer fix when idle" * tag 'dmaengine-fix-5.4-rc6' of git://git.infradead.org/users/vkoul/slave-dma: dmaengine: cppi41: Fix cppi41_dma_prep_slave_sg() when idle dmaengine: qcom: bam_dma: Fix resource leak dmaengine: sprd: Fix the possible memory leak issue dmaengine: xilinx_dma: Fix control reg update in vdma_channel_set_config dmaengine: xilinx_dma: Fix 64-bit simple AXIDMA transfer dmaengine: imx-sdma: fix size check for sdma script_number dmaengine: tegra210-adma: fix transfer failure dmaengine: sprd: Fix the link-list pointer register configuration issue
2019-10-30SUNRPC: Destroy the back channel when we destroy the host transportTrond Myklebust
When we're destroying the host transport mechanism, we should ensure that we do not leak memory by failing to release any back channel slots that might still exist. Reported-by: Neil Brown <neilb@suse.de> Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-29net/mlx5: Fix flow counter list auto bits structRoi Dayan
The union should contain the extended dest and counter list. Remove the resevered 0x40 bits which is redundant. This change doesn't break any functionally. Everything works today because the code in fs_cmd.c is using the correct structs if extended dest or the basic dest. Fixes: 1b115498598f ("net/mlx5: Introduce extended destination fields") Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-10-28net: fix sk_page_frag() recursion from memory reclaimTejun Heo
sk_page_frag() optimizes skb_frag allocations by using per-task skb_frag cache when it knows it's the only user. The condition is determined by seeing whether the socket allocation mask allows blocking - if the allocation may block, it obviously owns the task's context and ergo exclusively owns current->task_frag. Unfortunately, this misses recursion through memory reclaim path. Please take a look at the following backtrace. [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10 ... tcp_sendmsg+0x27/0x40 sock_sendmsg+0x30/0x40 sock_xmit.isra.24+0xa1/0x170 [nbd] nbd_send_cmd+0x1d2/0x690 [nbd] nbd_queue_rq+0x1b5/0x3b0 [nbd] __blk_mq_try_issue_directly+0x108/0x1b0 blk_mq_request_issue_directly+0xbd/0xe0 blk_mq_try_issue_list_directly+0x41/0xb0 blk_mq_sched_insert_requests+0xa2/0xe0 blk_mq_flush_plug_list+0x205/0x2a0 blk_flush_plug_list+0xc3/0xf0 [1] blk_finish_plug+0x21/0x2e _xfs_buf_ioapply+0x313/0x460 __xfs_buf_submit+0x67/0x220 xfs_buf_read_map+0x113/0x1a0 xfs_trans_read_buf_map+0xbf/0x330 xfs_btree_read_buf_block.constprop.42+0x95/0xd0 xfs_btree_lookup_get_block+0x95/0x170 xfs_btree_lookup+0xcc/0x470 xfs_bmap_del_extent_real+0x254/0x9a0 __xfs_bunmapi+0x45c/0xab0 xfs_bunmapi+0x15/0x30 xfs_itruncate_extents_flags+0xca/0x250 xfs_free_eofblocks+0x181/0x1e0 xfs_fs_destroy_inode+0xa8/0x1b0 destroy_inode+0x38/0x70 dispose_list+0x35/0x50 prune_icache_sb+0x52/0x70 super_cache_scan+0x120/0x1a0 do_shrink_slab+0x120/0x290 shrink_slab+0x216/0x2b0 shrink_node+0x1b6/0x4a0 do_try_to_free_pages+0xc6/0x370 try_to_free_mem_cgroup_pages+0xe3/0x1e0 try_charge+0x29e/0x790 mem_cgroup_charge_skmem+0x6a/0x100 __sk_mem_raise_allocated+0x18e/0x390 __sk_mem_schedule+0x2a/0x40 [0] tcp_sendmsg_locked+0x8eb/0xe10 tcp_sendmsg+0x27/0x40 sock_sendmsg+0x30/0x40 ___sys_sendmsg+0x26d/0x2b0 __sys_sendmsg+0x57/0xa0 do_syscall_64+0x42/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 In [0], tcp_send_msg_locked() was using current->page_frag when it called sk_wmem_schedule(). It already calculated how many bytes can be fit into current->page_frag. Due to memory pressure, sk_wmem_schedule() called into memory reclaim path which called into xfs and then IO issue path. Because the filesystem in question is backed by nbd, the control goes back into the tcp layer - back into tcp_sendmsg_locked(). nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes sense - it's in the process of freeing memory and wants to be able to, e.g., drop clean pages to make forward progress. However, this confused sk_page_frag() called from [2]. Because it only tests whether the allocation allows blocking which it does, it now thinks current->page_frag can be used again although it already was being used in [0]. After [2] used current->page_frag, the offset would be increased by the used amount. When the control returns to [0], current->page_frag's offset is increased and the previously calculated number of bytes now may overrun the end of allocated memory leading to silent memory corruptions. Fix it by adding gfpflags_normal_context() which tests sleepable && !reclaim and use it to determine whether to use current->task_frag. v2: Eric didn't like gfp flags being tested twice. Introduce a new helper gfpflags_normal_context() and combine the two tests. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28net: add skb_queue_empty_lockless()Eric Dumazet
Some paths call skb_queue_empty() without holding the queue lock. We must use a barrier in order to not let the compiler do strange things, and avoid KCSAN splats. Adding a barrier in skb_queue_empty() might be overkill, I prefer adding a new helper to clearly identify points where the callers might be lockless. This might help us finding real bugs. The corresponding WRITE_ONCE() should add zero cost for current compilers. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio fixes from Michael Tsirkin: "Some minor fixes" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vringh: fix copy direction of vringh_iov_push_kern() vsock/virtio: remove unused 'work' field from 'struct virtio_vsock_pkt' virtio_ring: fix stalls for packed rings
2019-10-28perf/headers: Fix spelling s/EACCESS/EACCES/, s/privilidge/privilege/Geert Uytterhoeven
As per POSIX, the correct spelling of the error code is EACCES: include/uapi/asm-generic/errno-base.h:#define EACCES 13 /* Permission denied */ Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Kosina <trivial@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: https://lkml.kernel.org/r/20191024122904.12463-1-geert+renesas@glider.be Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-28vsock/virtio: remove unused 'work' field from 'struct virtio_vsock_pkt'Stefano Garzarella
The 'work' field was introduced with commit 06a8fc78367d0 ("VSOCK: Introduce virtio_vsock_common.ko") but it is never used in the code, so we can remove it to save memory allocated in the per-packet 'struct virtio_vsock_pkt' Suggested-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2019-10-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-10-27 The following pull-request contains BPF updates for your *net* tree. We've added 7 non-merge commits during the last 11 day(s) which contain a total of 7 files changed, 66 insertions(+), 16 deletions(-). The main changes are: 1) Fix two use-after-free bugs in relation to RCU in jited symbol exposure to kallsyms, from Daniel Borkmann. 2) Fix NULL pointer dereference in AF_XDP rx-only sockets, from Magnus Karlsson. 3) Fix hang in netdev unregister for hash based devmap as well as another overflow bug on 32 bit archs in memlock cost calculation, from Toke Høiland-Jørgensen. 4) Fix wrong memory access in LWT BPF programs on reroute due to invalid dst. Also fix BPF selftests to use more compatible nc options, from Jiri Benc. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-26Merge tag 'driver-core-5.4-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fix from Greg KH: "Here is a single sysfs fix for 5.4-rc5. It resolves an error if you actually try to use the __BIN_ATTR_WO() macro, seems I never tested it properly before :( This has been in linux-next for a while with no reported issues" * tag 'driver-core-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: sysfs: Fixes __BIN_ATTR_WO() macro
2019-10-25Merge tag 'modules-for-v5.4-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules fixes from Jessica Yu: - Revert __ksymtab_$namespace.$symbol naming scheme back to __ksymtab_$symbol, as it was causing issues with depmod. Instead, have modpost extract a symbol's namespace from __kstrtabns and __ksymtab_strings. - Fix `make nsdeps` for out of tree kernel builds (make O=...) caused by unescaped '/'. Use a different sed delimiter to avoid this problem. * tag 'modules-for-v5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: scripts/nsdeps: use alternative sed delimiter symbol namespaces: revert to previous __ksymtab name scheme modpost: make updating the symbol namespace explicit modpost: delegate updating namespaces to separate function
2019-10-24net: remove unnecessary variables and callbackTaehee Yoo
This patch removes variables and callback these are related to the nested device structure. devices that can be nested have their own nest_level variable that represents the depth of nested devices. In the previous patch, new {lower/upper}_level variables are added and they replace old private nest_level variable. So, this patch removes all 'nest_level' variables. In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added to get lockdep subclass value, which is actually lower nested depth value. But now, they use the dynamic lockdep key to avoid lockdep warning instead of the subclass. So, this patch removes ->ndo_get_lock_subclass() callback. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24net: core: add ignore flag to netdev_adjacent structureTaehee Yoo
In order to link an adjacent node, netdev_upper_dev_link() is used and in order to unlink an adjacent node, netdev_upper_dev_unlink() is used. unlink operation does not fail, but link operation can fail. In order to exchange adjacent nodes, we should unlink an old adjacent node first. then, link a new adjacent node. If link operation is failed, we should link an old adjacent node again. But this link operation can fail too. It eventually breaks the adjacent link relationship. This patch adds an ignore flag into the netdev_adjacent structure. If this flag is set, netdev_upper_dev_link() ignores an old adjacent node for a moment. This patch also adds new functions for other modules. netdev_adjacent_change_prepare() netdev_adjacent_change_commit() netdev_adjacent_change_abort() netdev_adjacent_change_prepare() inserts new device into adjacent list but new device is not allowed to use immediately. If netdev_adjacent_change_prepare() fails, it internally rollbacks adjacent list so that we don't need any other action. netdev_adjacent_change_commit() deletes old device in the adjacent list and allows new device to use. netdev_adjacent_change_abort() rollbacks adjacent list. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24team: fix nested locking lockdep warningTaehee Yoo
team interface could be nested and it's lock variable could be nested too. But this lock uses static lockdep key and there is no nested locking handling code such as mutex_lock_nested() and so on. so the Lockdep would warn about the circular locking scenario that couldn't happen. In order to fix, this patch makes the team module to use dynamic lock key instead of static key. Test commands: ip link add team0 type team ip link add team1 type team ip link set team0 master team1 ip link set team0 nomaster ip link set team1 master team0 ip link set team1 nomaster Splat that looks like: [ 40.364352] WARNING: possible recursive locking detected [ 40.364964] 5.4.0-rc3+ #96 Not tainted [ 40.365405] -------------------------------------------- [ 40.365973] ip/750 is trying to acquire lock: [ 40.366542] ffff888060b34c40 (&team->lock){+.+.}, at: team_set_mac_address+0x151/0x290 [team] [ 40.367689] but task is already holding lock: [ 40.368729] ffff888051201c40 (&team->lock){+.+.}, at: team_del_slave+0x29/0x60 [team] [ 40.370280] other info that might help us debug this: [ 40.371159] Possible unsafe locking scenario: [ 40.371942] CPU0 [ 40.372338] ---- [ 40.372673] lock(&team->lock); [ 40.373115] lock(&team->lock); [ 40.373549] *** DEADLOCK *** [ 40.374432] May be due to missing lock nesting notation [ 40.375338] 2 locks held by ip/750: [ 40.375851] #0: ffffffffabcc42b0 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0 [ 40.376927] #1: ffff888051201c40 (&team->lock){+.+.}, at: team_del_slave+0x29/0x60 [team] [ 40.377989] stack backtrace: [ 40.378650] CPU: 0 PID: 750 Comm: ip Not tainted 5.4.0-rc3+ #96 [ 40.379368] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 40.380574] Call Trace: [ 40.381208] dump_stack+0x7c/0xbb [ 40.381959] __lock_acquire+0x269d/0x3de0 [ 40.382817] ? register_lock_class+0x14d0/0x14d0 [ 40.383784] ? check_chain_key+0x236/0x5d0 [ 40.384518] lock_acquire+0x164/0x3b0 [ 40.385074] ? team_set_mac_address+0x151/0x290 [team] [ 40.385805] __mutex_lock+0x14d/0x14c0 [ 40.386371] ? team_set_mac_address+0x151/0x290 [team] [ 40.387038] ? team_set_mac_address+0x151/0x290 [team] [ 40.387632] ? mutex_lock_io_nested+0x1380/0x1380 [ 40.388245] ? team_del_slave+0x60/0x60 [team] [ 40.388752] ? rcu_read_lock_sched_held+0x90/0xc0 [ 40.389304] ? rcu_read_lock_bh_held+0xa0/0xa0 [ 40.389819] ? lock_acquire+0x164/0x3b0 [ 40.390285] ? lockdep_rtnl_is_held+0x16/0x20 [ 40.390797] ? team_port_get_rtnl+0x90/0xe0 [team] [ 40.391353] ? __module_text_address+0x13/0x140 [ 40.391886] ? team_set_mac_address+0x151/0x290 [team] [ 40.392547] team_set_mac_address+0x151/0x290 [team] [ 40.393111] dev_set_mac_address+0x1f0/0x3f0 [ ... ] Fixes: 3d249d4ca7d0 ("net: introduce ethernet teaming device") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24net: core: add generic lockdep keysTaehee Yoo
Some interface types could be nested. (VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI, VXLAN, etc..) These interface types should set lockdep class because, without lockdep class key, lockdep always warn about unexisting circular locking. In the current code, these interfaces have their own lockdep class keys and these manage itself. So that there are so many duplicate code around the /driver/net and /net/. This patch adds new generic lockdep keys and some helper functions for it. This patch does below changes. a) Add lockdep class keys in struct net_device - qdisc_running, xmit, addr_list, qdisc_busylock - these keys are used as dynamic lockdep key. b) When net_device is being allocated, lockdep keys are registered. - alloc_netdev_mqs() c) When net_device is being free'd llockdep keys are unregistered. - free_netdev() d) Add generic lockdep key helper function - netdev_register_lockdep_key() - netdev_unregister_lockdep_key() - netdev_update_lockdep_key() e) Remove unnecessary generic lockdep macro and functions f) Remove unnecessary lockdep code of each interfaces. After this patch, each interface modules don't need to maintain their lockdep keys. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24net: core: limit nested device depthTaehee Yoo
Current code doesn't limit the number of nested devices. Nested devices would be handled recursively and this needs huge stack memory. So, unlimited nested devices could make stack overflow. This patch adds upper_level and lower_level, they are common variables and represent maximum lower/upper depth. When upper/lower device is attached or dettached, {lower/upper}_level are updated. and if maximum depth is bigger than 8, attach routine fails and returns -EMLINK. In addition, this patch converts recursive routine of netdev_walk_all_{lower/upper} to iterator routine. Test commands: ip link add dummy0 type dummy ip link add link dummy0 name vlan1 type vlan id 1 ip link set vlan1 up for i in {2..55} do let A=$i-1 ip link add vlan$i link vlan$A type vlan id $i done ip link del dummy0 Splat looks like: [ 155.513226][ T908] BUG: KASAN: use-after-free in __unwind_start+0x71/0x850 [ 155.514162][ T908] Write of size 88 at addr ffff8880608a6cc0 by task ip/908 [ 155.515048][ T908] [ 155.515333][ T908] CPU: 0 PID: 908 Comm: ip Not tainted 5.4.0-rc3+ #96 [ 155.516147][ T908] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 155.517233][ T908] Call Trace: [ 155.517627][ T908] [ 155.517918][ T908] Allocated by task 0: [ 155.518412][ T908] (stack is not available) [ 155.518955][ T908] [ 155.519228][ T908] Freed by task 0: [ 155.519885][ T908] (stack is not available) [ 155.520452][ T908] [ 155.520729][ T908] The buggy address belongs to the object at ffff8880608a6ac0 [ 155.520729][ T908] which belongs to the cache names_cache of size 4096 [ 155.522387][ T908] The buggy address is located 512 bytes inside of [ 155.522387][ T908] 4096-byte region [ffff8880608a6ac0, ffff8880608a7ac0) [ 155.523920][ T908] The buggy address belongs to the page: [ 155.524552][ T908] page:ffffea0001822800 refcount:1 mapcount:0 mapping:ffff88806c657cc0 index:0x0 compound_mapcount:0 [ 155.525836][ T908] flags: 0x100000000010200(slab|head) [ 155.526445][ T908] raw: 0100000000010200 ffffea0001813808 ffffea0001a26c08 ffff88806c657cc0 [ 155.527424][ T908] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000 [ 155.528429][ T908] page dumped because: kasan: bad access detected [ 155.529158][ T908] [ 155.529410][ T908] Memory state around the buggy address: [ 155.530060][ T908] ffff8880608a6b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 155.530971][ T908] ffff8880608a6c00: fb fb fb fb fb f1 f1 f1 f1 00 f2 f2 f2 f3 f3 f3 [ 155.531889][ T908] >ffff8880608a6c80: f3 fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 155.532806][ T908] ^ [ 155.533509][ T908] ffff8880608a6d00: fb fb fb fb fb fb fb fb fb f1 f1 f1 f1 00 00 00 [ 155.534436][ T908] ffff8880608a6d80: f2 f3 f3 f3 f3 fb fb fb 00 00 00 00 00 00 00 00 [ ... ] Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-23net/flow_dissector: switch to siphashEric Dumazet
UDP IPv6 packets auto flowlabels are using a 32bit secret (static u32 hashrnd in net/core/flow_dissector.c) and apply jhash() over fields known by the receivers. Attackers can easily infer the 32bit secret and use this information to identify a device and/or user, since this 32bit secret is only set at boot time. Really, using jhash() to generate cookies sent on the wire is a serious security concern. Trying to change the rol32(hash, 16) in ip6_make_flowlabel() would be a dead end. Trying to periodically change the secret (like in sch_sfq.c) could change paths taken in the network for long lived flows. Let's switch to siphash, as we did in commit df453700e8d8 ("inet: switch IP ID generator to siphash") Using a cryptographically strong pseudo random function will solve this privacy issue and more generally remove other weak points in the stack. Packet schedulers using skb_get_hash_perturb() benefit from this change. Fixes: b56774163f99 ("ipv6: Enable auto flow labels by default") Fixes: 42240901f7c4 ("ipv6: Implement different admin modes for automatic flow labels") Fixes: 67800f9b1f4e ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel") Fixes: cb1ce2ef387b ("ipv6: Implement automatic flow label generation on transmit") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Berger <jonathann1@walla.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-22dynamic_debug: provide dynamic_hex_dump stubArnd Bergmann
The ionic driver started using dymamic_hex_dump(), but that is not always defined: drivers/net/ethernet/pensando/ionic/ionic_main.c:229:2: error: implicit declaration of function 'dynamic_hex_dump' [-Werror,-Wimplicit-function-declaration] Add a dummy implementation to use when CONFIG_DYNAMIC_DEBUG is disabled, printing nothing. Fixes: 938962d55229 ("ionic: Add adminq action") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Shannon Nelson <snelson@pensando.io> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22bpf: Fix use after free in subprog's jited symbol removalDaniel Borkmann
syzkaller managed to trigger the following crash: [...] BUG: unable to handle page fault for address: ffffc90001923030 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD aa551067 P4D aa551067 PUD aa552067 PMD a572b067 PTE 80000000a1173163 Oops: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 7982 Comm: syz-executor912 Not tainted 5.4.0-rc3+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:bpf_jit_binary_hdr include/linux/filter.h:787 [inline] RIP: 0010:bpf_get_prog_addr_region kernel/bpf/core.c:531 [inline] RIP: 0010:bpf_tree_comp kernel/bpf/core.c:600 [inline] RIP: 0010:__lt_find include/linux/rbtree_latch.h:115 [inline] RIP: 0010:latch_tree_find include/linux/rbtree_latch.h:208 [inline] RIP: 0010:bpf_prog_kallsyms_find kernel/bpf/core.c:674 [inline] RIP: 0010:is_bpf_text_address+0x184/0x3b0 kernel/bpf/core.c:709 [...] Call Trace: kernel_text_address kernel/extable.c:147 [inline] __kernel_text_address+0x9a/0x110 kernel/extable.c:102 unwind_get_return_address+0x4c/0x90 arch/x86/kernel/unwind_frame.c:19 arch_stack_walk+0x98/0xe0 arch/x86/kernel/stacktrace.c:26 stack_trace_save+0xb6/0x150 kernel/stacktrace.c:123 save_stack mm/kasan/common.c:69 [inline] set_track mm/kasan/common.c:77 [inline] __kasan_kmalloc+0x11c/0x1b0 mm/kasan/common.c:510 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:518 slab_post_alloc_hook mm/slab.h:584 [inline] slab_alloc mm/slab.c:3319 [inline] kmem_cache_alloc+0x1f5/0x2e0 mm/slab.c:3483 getname_flags+0xba/0x640 fs/namei.c:138 getname+0x19/0x20 fs/namei.c:209 do_sys_open+0x261/0x560 fs/open.c:1091 __do_sys_open fs/open.c:1115 [inline] __se_sys_open fs/open.c:1110 [inline] __x64_sys_open+0x87/0x90 fs/open.c:1110 do_syscall_64+0xf7/0x1c0 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe [...] After further debugging it turns out that we walk kallsyms while in parallel we tear down a BPF program which contains subprograms that have been JITed though the program itself has not been fully exposed and is eventually bailing out with error. The bpf_prog_kallsyms_del_subprogs() in bpf_prog_load()'s error path removes the symbols, however, bpf_prog_free() tears down the JIT memory too early via scheduled work. Instead, it needs to properly respect RCU grace period as the kallsyms walk for BPF is under RCU. Fix it by refactoring __bpf_prog_put()'s tear down and reuse it in our error path where we defer final destruction when we have subprogs in the program. Fixes: 7d1982b4e335 ("bpf: fix panic in prog load calls cleanup") Fixes: 1c2a088a6626 ("bpf: x64: add JIT support for multi-function programs") Reported-by: syzbot+710043c5d1d5b5013bc7@syzkaller.appspotmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: syzbot+710043c5d1d5b5013bc7@syzkaller.appspotmail.com Link: https://lore.kernel.org/bpf/55f6367324c2d7e9583fa9ccf5385dcbba0d7a6e.1571752452.git.daniel@iogearbox.net
2019-10-21PM: QoS: Drop frequency QoS types from device PM QoSRafael J. Wysocki
There are no more active users of DEV_PM_QOS_MIN_FREQUENCY and DEV_PM_QOS_MAX_FREQUENCY device PM QoS request types, so drop them along with the code supporting them. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-21cpufreq: Use per-policy frequency QoSRafael J. Wysocki
Replace the CPU device PM QoS used for the management of min and max frequency constraints in cpufreq (and its users) with per-policy frequency QoS to avoid problems with cpufreq policies covering more then one CPU. Namely, a cpufreq driver is registered with the subsys interface which calls cpufreq_add_dev() for each CPU, starting from CPU0, so currently the PM QoS notifiers are added to the first CPU in the policy (i.e. CPU0 in the majority of cases). In turn, when the cpufreq driver is unregistered, the subsys interface doing that calls cpufreq_remove_dev() for each CPU, starting from CPU0, and the PM QoS notifiers are only removed when cpufreq_remove_dev() is called for the last CPU in the policy, say CPUx, which as a rule is not CPU0 if the policy covers more than one CPU. Then, the PM QoS notifiers cannot be removed, because CPUx does not have them, and they are still there in the device PM QoS notifiers list of CPU0, which prevents new PM QoS notifiers from being registered for CPU0 on the next attempt to register the cpufreq driver. The same issue occurs when the first CPU in the policy goes offline before unregistering the driver. After this change it does not matter which CPU is the policy CPU at the driver registration time and whether or not it is online all the time, because the frequency QoS is per policy and not per CPU. Fixes: 67d874c3b2c6 ("cpufreq: Register notifiers with the PM QoS framework") Reported-by: Dmitry Osipenko <digetx@gmail.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Reported-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Diagnosed-by: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lore.kernel.org/linux-pm/5ad2624194baa2f53acc1f1e627eb7684c577a19.1562210705.git.viresh.kumar@linaro.org/T/#md2d89e95906b8c91c15f582146173dce2e86e99f Link: https://lore.kernel.org/linux-pm/20191017094612.6tbkwoq4harsjcqv@vireshk-i7/T/#m30d48cc23b9a80467fbaa16e30f90b3828a5a29b Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-21PM: QoS: Introduce frequency QoSRafael J. Wysocki
Introduce frequency QoS, based on the "raw" low-level PM QoS, to represent min and max frequency requests and aggregate constraints. The min and max frequency requests are to be represented by struct freq_qos_request objects and the aggregate constraints are to be represented by struct freq_constraints objects. The latter are expected to be initialized with the help of freq_constraints_init(). The freq_qos_read_value() helper is defined to retrieve the aggregate constraints values from a given struct freq_constraints object and there are the freq_qos_add_request(), freq_qos_update_request() and freq_qos_remove_request() helpers to manipulate the min and max frequency requests. It is assumed that the the helpers will not run concurrently with each other for the same struct freq_qos_request object, so if that may be the case, their uses must ensure proper synchronization between them (e.g. through locking). In addition, freq_qos_add_notifier() and freq_qos_remove_notifier() are provided to add and remove notifiers that will trigger on aggregate constraint changes to and from a given struct freq_constraints object, respectively. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
2019-10-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: "I was battling a cold after some recent trips, so quite a bit piled up meanwhile, sorry about that. Highlights: 1) Fix fd leak in various bpf selftests, from Brian Vazquez. 2) Fix crash in xsk when device doesn't support some methods, from Magnus Karlsson. 3) Fix various leaks and use-after-free in rxrpc, from David Howells. 4) Fix several SKB leaks due to confusion of who owns an SKB and who should release it in the llc code. From Eric Biggers. 5) Kill a bunc of KCSAN warnings in TCP, from Eric Dumazet. 6) Jumbo packets don't work after resume on r8169, as the BIOS resets the chip into non-jumbo mode during suspend. From Heiner Kallweit. 7) Corrupt L2 header during MPLS push, from Davide Caratti. 8) Prevent possible infinite loop in tc_ctl_action, from Eric Dumazet. 9) Get register bits right in bcmgenet driver, based upon chip version. From Florian Fainelli. 10) Fix mutex problems in microchip DSA driver, from Marek Vasut. 11) Cure race between route lookup and invalidation in ipv4, from Wei Wang. 12) Fix performance regression due to false sharing in 'net' structure, from Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (145 commits) net: reorder 'struct net' fields to avoid false sharing net: dsa: fix switch tree list net: ethernet: dwmac-sun8i: show message only when switching to promisc net: aquantia: add an error handling in aq_nic_set_multicast_list net: netem: correct the parent's backlog when corrupted packet was dropped net: netem: fix error path for corrupted GSO frames macb: propagate errors when getting optional clocks xen/netback: fix error path of xenvif_connect_data() net: hns3: fix mis-counting IRQ vector numbers issue net: usb: lan78xx: Connect PHY before registering MAC vsock/virtio: discard packets if credit is not respected vsock/virtio: send a credit update when buffer size is changed mlxsw: spectrum_trap: Push Ethernet header before reporting trap net: ensure correct skb->tstamp in various fragmenters net: bcmgenet: reset 40nm EPHY on energy detect net: bcmgenet: soft reset 40nm EPHYs before MAC init net: phy: bcm7xxx: define soft_reset for 40nm EPHY net: bcmgenet: don't set phydev->link from MAC net: Update address for MediaTek ethernet driver in MAINTAINERS ipv4: fix race condition between route lookup and invalidation ...
2019-10-18symbol namespaces: revert to previous __ksymtab name schemeMatthias Maennich
The introduction of Symbol Namespaces changed the naming schema of the __ksymtab entries from __kysmtab__symbol to __ksymtab_NAMESPACE.symbol. That caused some breakages in tools that depend on the name layout in either the binaries(vmlinux,*.ko) or in System.map. E.g. kmod's depmod would not be able to read System.map without a patch to support symbol namespaces. A warning reported by depmod for namespaced symbols would look like depmod: WARNING: [...]/uas.ko needs unknown symbol usb_stor_adjust_quirks In order to address this issue, revert to the original naming scheme and rather read the __kstrtabns_<symbol> entries and their corresponding values from __ksymtab_strings to update the namespace values for symbols. After having read all symbols and handled them in handle_modversions(), the symbols are created. In a second pass, read the __kstrtabns_ entries and update the namespaces accordingly. Fixes: 8651ec01daed ("module: add support for symbol namespaces.") Reported-by: Stefan Wahren <stefan.wahren@i2se.com> Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Matthias Maennich <maennich@google.com> Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-10-17Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "The main thing here is a long-awaited workaround for a CPU erratum on ThunderX2 which we have developed in conjunction with engineers from Cavium/Marvell. At the moment, the workaround is unconditionally enabled for affected CPUs at runtime but we may add a command-line option to disable it in future if performance numbers show up indicating a significant cost for real workloads. Summary: - Work around Cavium/Marvell ThunderX2 erratum #219 - Fix regression in mlock() ABI caused by sign-extension of TTBR1 addresses - More fixes to the spurious kernel fault detection logic - Fix pathological preemption race when enabling some CPU features at boot - Drop broken kcore macros in favour of generic implementations - Fix userspace view of ID_AA64ZFR0_EL1 when SVE is disabled - Avoid NULL dereference on allocation failure during hibernation" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: tags: Preserve tags for addresses translated via TTBR1 arm64: mm: fix inverted PAR_EL1.F check arm64: sysreg: fix incorrect definition of SYS_PAR_EL1_F arm64: entry.S: Do not preempt from IRQ before all cpufeatures are enabled arm64: hibernate: check pgd table allocation arm64: cpufeature: Treat ID_AA64ZFR0_EL1 as RAZ when SVE is not enabled arm64: Fix kcore macros after 52-bit virtual addressing fallout arm64: Allow CAVIUM_TX2_ERRATUM_219 to be selected arm64: Avoid Cavium TX2 erratum 219 when switching TTBR arm64: Enable workaround for Cavium TX2 erratum 219 when running SMT arm64: KVM: Trap VM ops when ARM64_WORKAROUND_CAVIUM_TX2_219_TVM is set
2019-10-17net: phy: micrel: Update KSZ87xx PHY nameMarek Vasut
The KSZ8795 PHY ID is in fact used by KSZ8794/KSZ8795/KSZ8765 switches. Update the PHY ID and name to reflect that, as this family of switches is commonly refered to as KSZ87xx Signed-off-by: Marek Vasut <marex@denx.de> Cc: Andrew Lunn <andrew@lunn.ch> Cc: David S. Miller <davem@davemloft.net> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: George McCollister <george.mccollister@gmail.com> Cc: Heiner Kallweit <hkallweit1@gmail.com> Cc: Sean Nyekjaer <sean.nyekjaer@prevas.dk> Cc: Tristram Ha <Tristram.Ha@microchip.com> Cc: Woojung Huh <woojung.huh@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-17Merge branch 'errata/tx2-219' into for-next/fixesWill Deacon
Workaround for Cavium/Marvell ThunderX2 erratum #219. * errata/tx2-219: arm64: Allow CAVIUM_TX2_ERRATUM_219 to be selected arm64: Avoid Cavium TX2 erratum 219 when switching TTBR arm64: Enable workaround for Cavium TX2 erratum 219 when running SMT arm64: KVM: Trap VM ops when ARM64_WORKAROUND_CAVIUM_TX2_219_TVM is set
2019-10-17Merge tag 'gpio-v5.4-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull GPIO fixes from Linus Walleij: "The fixes pertain to a problem with initializing the Intel GPIO irqchips when adding gpiochips. Andy fixed it up elegantly by adding a hardware initialization callback to the struct gpio_irq_chip so let's use this. Tested and verified on the target hardware" * tag 'gpio-v5.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: gpio: lynxpoint: set default handler to be handle_bad_irq() gpio: merrifield: Move hardware initialization to callback gpio: lynxpoint: Move hardware initialization to callback gpio: intel-mid: Move hardware initialization to callback gpiolib: Initialize the hardware with a callback gpio: merrifield: Restore use of irq_base
2019-10-16arm64: entry.S: Do not preempt from IRQ before all cpufeatures are enabledJulien Thierry
Preempting from IRQ-return means that the task has its PSTATE saved on the stack, which will get restored when the task is resumed and does the actual IRQ return. However, enabling some CPU features requires modifying the PSTATE. This means that, if a task was scheduled out during an IRQ-return before all CPU features are enabled, the task might restore a PSTATE that does not include the feature enablement changes once scheduled back in. * Task 1: PAN == 0 ---| |--------------- | |<- return from IRQ, PSTATE.PAN = 0 | <- IRQ | +--------+ <- preempt() +-- ^ | reschedule Task 1, PSTATE.PAN == 1 * Init: --------------------+------------------------ ^ | enable_cpu_features set PSTATE.PAN on all CPUs Worse than this, since PSTATE is untouched when task switching is done, a task missing the new bits in PSTATE might affect another task, if both do direct calls to schedule() (outside of IRQ/exception contexts). Fix this by preventing preemption on IRQ-return until features are enabled on all CPUs. This way the only PSTATE values that are saved on the stack are from synchronous exceptions. These are expected to be fatal this early, the exception is BRK for WARN_ON(), but as this uses do_debug_exception() which keeps IRQs masked, it shouldn't call schedule(). Signed-off-by: Julien Thierry <julien.thierry@arm.com> [james: Replaced a really cool hack, with an even simpler static key in C. expanded commit message with Julien's cover-letter ascii art] Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2019-10-15net/sched: fix corrupted L2 header with MPLS 'push' and 'pop' actionsDavide Caratti
the following script: # tc qdisc add dev eth0 clsact # tc filter add dev eth0 egress protocol ip matchall \ > action mpls push protocol mpls_uc label 0x355aa bos 1 causes corruption of all IP packets transmitted by eth0. On TC egress, we can't rely on the value of skb->mac_len, because it's 0 and a MPLS 'push' operation will result in an overwrite of the first 4 octets in the packet L2 header (e.g. the Destination Address if eth0 is an Ethernet); the same error pattern is present also in the MPLS 'pop' operation. Fix this error in act_mpls data plane, computing 'mac_len' as the difference between the network header and the mac header (when not at TC ingress), and use it in MPLS 'push'/'pop' core functions. v2: unbreak 'make htmldocs' because of missing documentation of 'mac_len' in skb_mpls_pop(), reported by kbuild test robot CC: Lorenzo Bianconi <lorenzo@kernel.org> Fixes: 2a2ea50870ba ("net: sched: add mpls manipulation actions to TC") Reviewed-by: Simon Horman <simon.horman@netronome.com> Acked-by: John Hurley <john.hurley@netronome.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-15gpiolib: Initialize the hardware with a callbackAndy Shevchenko
After changing the drivers to use GPIO core to add an IRQ chip it appears that some of them requires a hardware initialization before adding the IRQ chip. Add an optional callback ->init_hw() to allow that drivers to initialize hardware if needed. This change is a part of the fix NULL pointer dereference brought to the several drivers recently. Cc: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-10-14xarray.h: fix kernel-doc warningRandy Dunlap
Fix (Sphinx) kernel-doc warning in <linux/xarray.h>: include/linux/xarray.h:232: WARNING: Unexpected indentation. Link: http://lkml.kernel.org/r/89ba2134-ce23-7c10-5ee1-ef83b35aa984@infradead.org Fixes: a3e4d3f97ec8 ("XArray: Redesign xa_alloc API") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14bitmap.h: fix kernel-doc warning and typoRandy Dunlap
Fix kernel-doc warning in <linux/bitmap.h>: include/linux/bitmap.h:341: warning: Function parameter or member 'nbits' not described in 'bitmap_or_equal' Also fix small typo (bitnaps). Link: http://lkml.kernel.org/r/0729ea7a-2c0d-b2c5-7dd3-3629ee0803e2@infradead.org Fixes: b9fa6442f704 ("cpumask: Implement cpumask_or_equal()") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>