summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2023-05-13mm: Count requests to free & nr freed per shrinkerKent Overstreet
The next step in this patch series for improving debugging of shrinker related issues: keep counts of number of objects we request to free vs. actually freed, and prints them in shrinker_to_text(). Shrinkers won't necessarily free all objects requested for a variety of reasons, but if the two counts are wildly different something is likely amiss. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-05-13mm: Add a .to_text() method for shrinkersKent Overstreet
This adds a new callback method to shrinkers which they can use to describe anything relevant to memory reclaim about their internal state, for example object dirtyness. This uses the new printbufs to output to heap allocated strings, so that the .to_text() methods can be used both for messages logged to the console, and also sysfs/debugfs. This patch also adds shrinkers_to_text(), which reports on the top 10 shrinkers - by object count - in sorted order, to be used in OOM reporting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-05-13seq_buf: seq_buf_human_readable_u64()Kent Overstreet
This adds a seq_buf wrapper for string_get_size(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-05-13iomap: add nosubmit flag to skip data I/O on iomap mappingBrian Foster
Implement a quick and dirty hack to skip data I/O submission on a specified mapping. The iomap layer will still perform every step up through constructing the bio as if it will be submitted, but instead invokes completion on the bio directly from submit context. The purpose of this is to facilitate filesystem metadata performance testing without the overhead of actual data I/O. Note that this may be dangerous in current form in that folios are not explicitly zeroed where they otherwise wouldn't be, so whatever previous data exists in a folio prior to being added to a read bio is mapped into pagecache for the file. Signed-off-by: Brian Foster <bfoster@redhat.com>
2023-05-13vfs: inode cache conversion to hash-blDave Chinner
Because scalability of the global inode_hash_lock really, really sucks. 32-way concurrent create on a couple of different filesystems before: - 52.13% 0.04% [kernel] [k] ext4_create - 52.09% ext4_create - 41.03% __ext4_new_inode - 29.92% insert_inode_locked - 25.35% _raw_spin_lock - do_raw_spin_lock - 24.97% __pv_queued_spin_lock_slowpath - 72.33% 0.02% [kernel] [k] do_filp_open - 72.31% do_filp_open - 72.28% path_openat - 57.03% bch2_create - 56.46% __bch2_create - 40.43% inode_insert5 - 36.07% _raw_spin_lock - do_raw_spin_lock 35.86% __pv_queued_spin_lock_slowpath 4.02% find_inode Convert the inode hash table to a RCU-aware hash-bl table just like the dentry cache. Note that we need to store a pointer to the hlist_bl_head the inode has been added to in the inode so that when it comes to unhash the inode we know what list to lock. We need to do this because the hash value that is used to hash the inode is generated from the inode itself - filesystems can provide this themselves so we have to either store the hash or the head pointer in the inode to be able to find the right list head for removal... Same workload after: Signed-off-by: Dave Chinner <dchinner@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org
2023-05-13hlist-bl: add hlist_bl_fake()Dave Chinner
in preparation for switching the VFS inode cache over the hlist_bl lists, we nee dto be able to fake a list node that looks like it is hased for correct operation of filesystems that don't directly use the VFS indoe cache. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2023-05-13Increase MAX_LOCK_DEPTH, bcachefs BTREE_ITER_MAX (do not upstream)Kent Overstreet
2023-05-12six locks: Improved optimistic spinningKent Overstreet
This adds a threshold for the maximum spin time, similar to the rwsem code, and a flag to the lock itself indicating when we've spun too long so other threads also refrain from spinning. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-05-12six locks: Expose tracepoint IPKent Overstreet
This adds _ip variations of the various lock functions that allow an IP to be passed in, which is used by lockstat. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-05-12six locks: Wakeup now takes lock on behalf of waiterKent Overstreet
This brings back an important optimization, to avoid touching the wait lists an extra time, while preserving the property that a thread is on a lock waitlist iff it is waiting - it is never removed from the waitlist until it has the lock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-05-12six locks: Add start_time to six_lock_waiterKent Overstreet
This is needed by the cycle detector in bcachefs - we need a way to iterater over waitlist entries while dropping and retaking the waitlist lock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-05-12six locks: six_lock_waiter()Kent Overstreet
This allows passing in the wait list entry - to be used for a deadlock cycle detector. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-05-12six locks: Simplify wait listsKent Overstreet
This switches to a single list of waiters, instead of separate lists for read and intent, and switches write locks to also use the wait lists instead of being handled differently. Also, removal from the wait list is now done by the process waiting on the lock, not the process doing the wakeup. This is needed for the new deadlock cycle detector - we need tasks to stay on the waitlist until they've successfully acquired the lock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-05-12six locks: Delete six_lock_pcpu_free_rcu()Kent Overstreet
Didn't have any users, and wasn't a good idea to begin with - delete it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-05-12six locks: Improve six_lock_countKent Overstreet
six_lock_count now counts up whether a write lock held, and this patch now also correctly counts six_lock->intent_lock_recurse. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-05-12bcachefs: Update export_operations for snapshotsKent Overstreet
When support for snapshots was merged, export operations weren't updated yet. This patch adds new filehandle types for bcachefs that include the subvolume ID and updates export operations for subvolumes - and also .get_parent, support for which was added just prior to snapshots. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-05-12lib: add mean and variance module.Daniel Hill
This module provides a fast 64bit implementation of basic statistics functions, including mean, variance and standard deviation in both weighted and unweighted variants, the unweighted variant has a 32bit limitation per sample to prevent overflow when squaring. Signed-off-by: Daniel Hill <daniel@gluo.nz>
2023-05-12lib/string_helpers: string_get_size() now returns characters wroteKent Overstreet
printbuf now needs to know the number of characters that would have been written if the buffer was too small, like snprintf(); this changes string_get_size() to return the the return value of snprintf(). Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-05-12lib/generic-radix-tree.c: Add peek_prev()Kent Overstreet
This patch adds genradix_peek_prev(), genradix_iter_rewind(), and genradix_for_each_reverse(), for iterating backwards over a generic radix tree. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-05-12lib/generic-radix-tree.c: Add a missing includeKent Overstreet
We now need linux/limits.h for SIZE_MAX. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-05-12lib/generic-radix-tree.c: Don't overflow in peek()Kent Overstreet
When we started spreading new inode numbers throughout most of the 64 bit inode space, that triggered some corner case bugs, in particular some integer overflows related to the radix tree code. Oops. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-05-12iov_iter: copy_folio_from_iter_atomic()Kent Overstreet
Add a foliated version of copy_page_from_iter_atomic() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org>
2023-05-12closures: closure_nr_remaining()Kent Overstreet
Factor out a new helper, which returns the number of events outstanding. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-05-12closures: closure_wait_event()Kent Overstreet
Like wait_event() - except, because it uses closures and closure waitlists it doesn't have the restriction on modifying task state inside the condition check, like wait_event() does. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Acked-by: Coly Li <colyli@suse.de>
2023-05-12bcache: move closures to lib/Kent Overstreet
Prep work for bcachefs - being a fork of bcache it also uses closures Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Acked-by: Coly Li <colyli@suse.de>
2023-05-12block: Don't block on s_umount from __invalidate_super()Kent Overstreet
__invalidate_super() is used to flush any filesystem mounted on a device, generally on some sort of media change event. However, when unmounting a filesystem and closing the underlying block devices, we can deadlock if the block driver then calls __invalidate_device() (e.g. because the block device goes away when it is no longer in use). This happens with bcachefs on top of loopback, and can be triggered by fstests generic/042: put_super -> blkdev_put -> lo_release -> disk_force_media_change -> __invalidate_device -> get_super This isn't inherently specific to bcachefs - it hasn't shown up with other filesystems before because most other filesystems use the sget() mechanism for opening/closing block devices (and enforcing exclusion), however sget() has its own downsides and weird/sketchy behaviour w.r.t. block device open lifetime - if that ever gets fixed more code will run into this issue. The __invalidate_device() call here is really a best effort "I just yanked the device for a mounted filesystem, please try not to lose my data" - if it's ever actually needed the user has already done something crazy, and we probably shouldn't make things worse by deadlocking. Switching to a trylock seems in keeping with what the code is trying to do. If we ever get revoke() at the block layer, perhaps we would look at rearchitecting to use that instead. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-05-12block: Rework bio_for_each_folio_all()Kent Overstreet
This reimplements bio_for_each_folio_all() on top of the newly-reworked bvec_iter_all, and since it's now trivial we also provide bio_for_each_folio. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Matthew Wilcox <willy@infradead.org> Cc: linux-block@vger.kernel.org
2023-05-12block: Rework bio_for_each_segment_all()Kent Overstreet
This patch reworks bio_for_each_segment_all() to be more inline with how the other bio iterators work: - bio_iter_all_peek() now returns a synthesized bio_vec; we don't stash one in the iterator and pass a pointer to it - bad. This way makes it clearer what's a constructed value vs. a reference to something pre-existing, and it also will help with cleaning up and consolidating code with bio_for_each_folio_all(). - We now provide bio_for_each_segment_all_continue(), for squashfs: this makes their code clearer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Ming Lei <ming.lei@redhat.com> Cc: Phillip Lougher <phillip@squashfs.org.uk>
2023-05-12block: Bring back zero_fill_bio_iterKent Overstreet
This reverts the commit that deleted it; it's used by bcachefs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org
2023-05-12block: Add some exports for bcachefsKent Overstreet
- bio_set_pages_dirty(), bio_check_pages_dirty() - dio path - blk_status_to_str() - error messages - bio_add_folio() - this should definitely be exported for everyone, it's the modern version of bio_add_page() Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: linux-block@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk>
2023-05-12fs: factor out d_mark_tmpfile()Kent Overstreet
New helper for bcachefs - bcachefs doesn't want the inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on its own atomically with other btree updates Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: linux-fsdevel@vger.kernel.org
2023-05-12mm: Bring back vmalloc_execKent Overstreet
This reverts 7a0e27b2a0ce ("mm: remove vmalloc_exec"), which folded it into the only caller - but bcachefs needs it, so this patch brings it back. We're also exporting it now. (bcachefs does dynamic code generation for per btree node unpack functions). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: linux-mm@kvack.org
2023-05-12sched: Add task_struct->faults_disabled_mappingKent Overstreet
This is used by bcachefs to fix a page cache coherency issue with O_DIRECT writes. Also relevant: mapping->invalidate_lock, see below. O_DIRECT writes (and other filesystem operations that modify file data while bypassing the page cache) need to shoot down ranges of the page cache - and additionally, need locking to prevent those pages from pulled back in. But O_DIRECT writes invoke the page fault handler (via get_user_pages), and the page fault handler will need to take that same lock - this is a classic recursive deadlock if userspace has mmaped the file they're DIO writing to and uses those pages for the buffer to write from, and it's a lock ordering deadlock in general. Thus we need a way to signal from the dio code to the page fault handler when we already are holding the pagecache add lock on an address space - this patch just adds a member to task_struct for this purpose. For now only bcachefs is implementing this locking, though it may be moved out of bcachefs and made available to other filesystems in the future. --------------------------------- The closest current VFS equivalent is mapping->invalidate_lock, which comes from XFS. However, it's not used by direct IO. Instead, direct IO paths shoot down the page cache twice - before starting the IO and at the end, and they're still technically racy w.r.t. page cache coherency. This is a more complete approach: in the future we might consider replacing mapping->invalidate_lock with the bcachefs code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Jan Kara <jack@suse.cz> Cc: Darrick J. Wong <djwong@kernel.org> Cc: linux-fsdevel@vger.kernel.org
2023-05-12locking: SIX locks (shared/intent/exclusive)Kent Overstreet
New lock for bcachefs, like read/write locks but with a third state, intent. Intent locks conflict with each other, but not with read locks; taking a write lock requires first holding an intent lock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com>
2023-04-26locking/lockdep: lockdep_set_no_check_recursion()Kent Overstreet
This adds a method to tell lockdep not to check lock ordering within a lock class - but to still check lock ordering w.r.t. other lock types. This is for bcachefs, where for btree node locks we have our own deadlock avoidance strategy w.r.t. other btree node locks (cycle detection), but we still want lockdep to check lock ordering w.r.t. other lock types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com>
2023-04-26locking/lockdep: lock_class_is_held()Kent Overstreet
This patch adds lock_class_is_held(), which can be used to assert that a particular type of lock is not held. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Boqun Feng <boqun.feng@gmail.com>
2023-04-26Compiler Attributes: add __flattenKent Overstreet
This makes __attribute__((flatten)) available, which is used by bcachefs. Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: Miguel Ojeda <ojeda@kernel.org> (maintainer:COMPILER ATTRIBUTES) Cc: Nick Desaulniers <ndesaulniers@google.com> (reviewer:COMPILER ATTRIBUTES)
2023-04-21Revert "ACPICA: Events: Support fixed PCIe wake event"Linus Torvalds
This reverts commit 5c62d5aab8752e5ee7bfbe75ed6060db1c787f98. This broke wake-on-lan for multiple people, and for much too long. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217069 Link: https://lore.kernel.org/all/754225a2-95a9-2c36-1886-7da1a78308c2@loongson.cn/ Link: https://github.com/acpica/acpica/pull/866 Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Jianmin Lv <lvjianmin@loongson.cn> Cc: Huacai Chen <chenhuacai@loongson.cn> Cc: Bob Moore <robert.moore@intel.com> Cc: stable@kernel.org # 6.2 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-04-20Merge tag 'net-6.3-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter and bpf. There are a few fixes for new code bugs, including the Mellanox one noted in the last networking pull. No known regressions outstanding. Current release - regressions: - sched: clear actions pointer in miss cookie init fail - mptcp: fix accept vs worker race - bpf: fix bpf_arch_text_poke() with new_addr == NULL on s390 - eth: bnxt_en: fix a possible NULL pointer dereference in unload path - eth: veth: take into account peer device for NETDEV_XDP_ACT_NDO_XMIT xdp_features flag Current release - new code bugs: - eth: revert "net/mlx5: Enable management PF initialization" Previous releases - regressions: - netfilter: fix recent physdev match breakage - bpf: fix incorrect verifier pruning due to missing register precision taints - eth: virtio_net: fix overflow inside xdp_linearize_page() - eth: cxgb4: fix use after free bugs caused by circular dependency problem - eth: mlxsw: pci: fix possible crash during initialization Previous releases - always broken: - sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg - netfilter: validate catch-all set elements - bridge: don't notify FDB entries with "master dynamic" - eth: bonding: fix memory leak when changing bond type to ethernet - eth: i40e: fix accessing vsi->active_filters without holding lock Misc: - Mat is back as MPTCP co-maintainer" * tag 'net-6.3-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (33 commits) net: bridge: switchdev: don't notify FDB entries with "master dynamic" Revert "net/mlx5: Enable management PF initialization" MAINTAINERS: Resume MPTCP co-maintainer role mailmap: add entries for Mat Martineau e1000e: Disable TSO on i219-LM card to increase speed bnxt_en: fix free-runnig PHC mode net: dsa: microchip: ksz8795: Correctly handle huge frame configuration bpf: Fix incorrect verifier pruning due to missing register precision taints hamradio: drop ISA_DMA_API dependency mlxsw: pci: Fix possible crash during initialization mptcp: fix accept vs worker race mptcp: stops worker on unaccepted sockets at listener close net: rpl: fix rpl header size calculation net: vmxnet3: Fix NULL pointer dereference in vmxnet3_rq_rx_complete() bonding: Fix memory leak when changing bond type to Ethernet veth: take into account peer device for NETDEV_XDP_ACT_NDO_XMIT xdp_features flag mlxfw: fix null-ptr-deref in mlxfw_mfa2_tlv_next() bnxt_en: Fix a possible NULL pointer dereference in unload path bnxt_en: Do not initialize PTP on older P3/P4 chips netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements ...
2023-04-19Revert "net/mlx5: Enable management PF initialization"Jakub Kicinski
This reverts commit fe998a3c77b9f989a30a2a01fb00d3729a6d53a4. Paul reports that it causes a regression with IB on CX4 and FW 12.18.1000. In addition I think that the concept of "management PF" is not fully accepted and requires a discussion. Fixes: fe998a3c77b9 ("net/mlx5: Enable management PF initialization") Reported-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/all/CAHC9VhQ7A4+msL38WpbOMYjAqLp0EtOjeLh4Dc6SQtD6OUvCQg@mail.gmail.com/ Link: https://lore.kernel.org/r/20230413222547.56901-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-19Merge tag 'mm-hotfixes-stable-2023-04-19-16-36' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "22 hotfixes. 19 are cc:stable and the remainder address issues which were introduced during this merge cycle, or aren't considered suitable for -stable backporting. 19 are for MM and the remainder are for other subsystems" * tag 'mm-hotfixes-stable-2023-04-19-16-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits) nilfs2: initialize unused bytes in segment summary blocks mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages mm/mmap: regression fix for unmapped_area{_topdown} maple_tree: fix mas_empty_area() search maple_tree: make maple state reusable after mas_empty_area_rev() mm: kmsan: handle alloc failures in kmsan_ioremap_page_range() mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush() tools/Makefile: do missed s/vm/mm/ mm: fix memory leak on mm_init error handling mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock kernel/sys.c: fix and improve control flow in __sys_setres[ug]id() Revert "userfaultfd: don't fail on unrecognized features" writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs maple_tree: fix a potential memory leak, OOB access, or other unpredictable bug tools/mm/page_owner_sort.c: fix TGID output when cull=tg is used mailmap: update jtoppins' entry to reference correct email mm/mempolicy: fix use-after-free of VMA iterator mm/huge_memory.c: warn with pr_warn_ratelimited instead of VM_WARN_ON_ONCE_FOLIO mm/mprotect: fix do_mprotect_pkey() return on error mm/khugepaged: check again on anon uffd-wp during isolation ...
2023-04-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Unbreak br_netfilter physdev match support, from Florian Westphal. 2) Use GFP_KERNEL_ACCOUNT for stateful/policy objects, from Chen Aotian. 3) Use IS_ENABLED() in nf_reset_trace(), from Florian Westphal. 4) Fix validation of catch-all set element. 5) Tighten requirements for catch-all set elements. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements netfilter: nf_tables: validate catch-all set elements netfilter: nf_tables: fix ifdef to also consider nf_tables=m netfilter: nf_tables: Modify nla_memdup's flag to GFP_KERNEL_ACCOUNT netfilter: br_netfilter: fix recent physdev match breakage ==================== Link: https://lore.kernel.org/r/20230418145048.67270-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-04-18mm: kmsan: handle alloc failures in kmsan_ioremap_page_range()Alexander Potapenko
Similarly to kmsan_vmap_pages_range_noflush(), kmsan_ioremap_page_range() must also properly handle allocation/mapping failures. In the case of such, it must clean up the already created metadata mappings and return an error code, so that the error can be propagated to ioremap_page_range(). Without doing so, KMSAN may silently fail to bring the metadata for the page range into a consistent state, which will result in user-visible crashes when trying to access them. Link: https://lkml.kernel.org/r/20230413131223.4135168-2-glider@google.com Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations") Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com> Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/ Reviewed-by: Marco Elver <elver@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush()Alexander Potapenko
As reported by Dipanjan Das, when KMSAN is used together with kernel fault injection (or, generally, even without the latter), calls to kcalloc() or __vmap_pages_range_noflush() may fail, leaving the metadata mappings for the virtual mapping in an inconsistent state. When these metadata mappings are accessed later, the kernel crashes. To address the problem, we return a non-zero error code from kmsan_vmap_pages_range_noflush() in the case of any allocation/mapping failure inside it, and make vmap_pages_range_noflush() return an error if KMSAN fails to allocate the metadata. This patch also removes KMSAN_WARN_ON() from vmap_pages_range_noflush(), as these allocation failures are not fatal anymore. Link: https://lkml.kernel.org/r/20230413131223.4135168-1-glider@google.com Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations") Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com> Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/ Reviewed-by: Marco Elver <elver@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-18netfilter: nf_tables: validate catch-all set elementsPablo Neira Ayuso
catch-all set element might jump/goto to chain that uses expressions that require validation. Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-17netfilter: nf_tables: fix ifdef to also consider nf_tables=mFlorian Westphal
nftables can be built as a module, so fix the preprocessor conditional accordingly. Fixes: 478b360a47b7 ("netfilter: nf_tables: fix nf_trace always-on with XT_TRACE=n") Reported-by: Florian Fainelli <f.fainelli@gmail.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-04-13Merge tag 'net-6.3-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf, and bluetooth. Not all that quiet given spring celebrations, but "current" fixes are thinning out, which is encouraging. One outstanding regression in the mlx5 driver when using old FW, not blocking but we're pushing for a fix. Current release - new code bugs: - eth: enetc: workaround for unresponsive pMAC after receiving express traffic Previous releases - regressions: - rtnetlink: restore RTM_NEW/DELLINK notification behavior, keep the pid/seq fields 0 for backward compatibility Previous releases - always broken: - sctp: fix a potential overflow in sctp_ifwdtsn_skip - mptcp: - use mptcp_schedule_work instead of open-coding it and make the worker check stricter, to avoid scheduling work on closed sockets - fix NULL pointer dereference on fastopen early fallback - skbuff: fix memory corruption due to a race between skb coalescing and releasing clones confusing page_pool reference counting - bonding: fix neighbor solicitation validation on backup slaves - bpf: tcp: use sock_gen_put instead of sock_put in bpf_iter_tcp - bpf: arm64: fixed a BTI error on returning to patched function - openvswitch: fix race on port output leading to inf loop - sfp: initialize sfp->i2c_block_size at sfp allocation to avoid returning a different errno than expected - phy: nxp-c45-tja11xx: unregister PTP, purge queues on remove - Bluetooth: fix printing errors if LE Connection times out - Bluetooth: assorted UaF, deadlock and data race fixes - eth: macb: fix memory corruption in extended buffer descriptor mode Misc: - adjust the XDP Rx flow hash API to also include the protocol layers over which the hash was computed" * tag 'net-6.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits) selftests/bpf: Adjust bpf_xdp_metadata_rx_hash for new arg mlx4: bpf_xdp_metadata_rx_hash add xdp rss hash type veth: bpf_xdp_metadata_rx_hash add xdp rss hash type mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash type xdp: rss hash types representation selftests/bpf: xdp_hw_metadata remove bpf_printk and add counters skbuff: Fix a race between coalescing and releasing SKBs net: macb: fix a memory corruption in extended buffer descriptor mode selftests: add the missing CONFIG_IP_SCTP in net config udp6: fix potential access to stale information selftests: openvswitch: adjust datapath NL message declaration selftests: mptcp: userspace pm: uniform verify events mptcp: fix NULL pointer dereference on fastopen early fallback mptcp: stricter state check in mptcp_worker mptcp: use mptcp_schedule_work instead of open-coding it net: enetc: workaround for unresponsive pMAC after receiving express traffic sctp: fix a potential overflow in sctp_ifwdtsn_skip net: qrtr: Fix an uninit variable access bug in qrtr_tx_resume() rtnetlink: Restore RTM_NEW/DELLINK notification behavior net: ti/cpsw: Add explicit platform_device.h and of_platform.h includes ...
2023-04-13mlx5: bpf_xdp_metadata_rx_hash add xdp rss hash typeJesper Dangaard Brouer
Update API for bpf_xdp_metadata_rx_hash() with arg for xdp rss hash type via mapping table. The mlx5 hardware can also identify and RSS hash IPSEC. This indicate hash includes SPI (Security Parameters Index) as part of IPSEC hash. Extend xdp core enum xdp_rss_hash_type with IPSEC hash type. Fixes: bc8d405b1ba9 ("net/mlx5e: Support RX XDP metadata") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/168132892548.340624.11185734579430124869.stgit@firesoul Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-13xdp: rss hash types representationJesper Dangaard Brouer
The RSS hash type specifies what portion of packet data NIC hardware used when calculating RSS hash value. The RSS types are focused on Internet traffic protocols at OSI layers L3 and L4. L2 (e.g. ARP) often get hash value zero and no RSS type. For L3 focused on IPv4 vs. IPv6, and L4 primarily TCP vs UDP, but some hardware supports SCTP. Hardware RSS types are differently encoded for each hardware NIC. Most hardware represent RSS hash type as a number. Determining L3 vs L4 often requires a mapping table as there often isn't a pattern or sorting according to ISO layer. The patch introduce a XDP RSS hash type (enum xdp_rss_hash_type) that contains both BITs for the L3/L4 types, and combinations to be used by drivers for their mapping tables. The enum xdp_rss_type_bits get exposed to BPF via BTF, and it is up to the BPF-programmer to match using these defines. This proposal change the kfunc API bpf_xdp_metadata_rx_hash() adding a pointer value argument for provide the RSS hash type. Change signature for all xmo_rx_hash calls in drivers to make it compile. The RSS type implementations for each driver comes as separate patches. Fixes: 3d76a4d3d4e5 ("bpf: XDP metadata RX kfuncs") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/168132892042.340624.582563003880565460.stgit@firesoul Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-12rtnetlink: Restore RTM_NEW/DELLINK notification behaviorMartin Willi
The commits referenced below allows userspace to use the NLM_F_ECHO flag for RTM_NEW/DELLINK operations to receive unicast notifications for the affected link. Prior to these changes, applications may have relied on multicast notifications to learn the same information without specifying the NLM_F_ECHO flag. For such applications, the mentioned commits changed the behavior for requests not using NLM_F_ECHO. Multicast notifications are still received, but now use the portid of the requester and the sequence number of the request instead of zero values used previously. For the application, this message may be unexpected and likely handled as a response to the NLM_F_ACKed request, especially if it uses the same socket to handle requests and notifications. To fix existing applications relying on the old notification behavior, set the portid and sequence number in the notification only if the request included the NLM_F_ECHO flag. This restores the old behavior for applications not using it, but allows unicasted notifications for others. Fixes: f3a63cce1b4f ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_delete_link") Fixes: d88e136cab37 ("rtnetlink: Honour NLM_F_ECHO flag in rtnl_newlink_create") Signed-off-by: Martin Willi <martin@strongswan.org> Acked-by: Guillaume Nault <gnault@redhat.com> Acked-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20230411074319.24133-1-martin@strongswan.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>