Age | Commit message (Collapse) | Author |
|
An earlier attempt changed this to GFP_KERNEL, but the get helper is
also called for get requests from userspace, which uses rcu.
Let the caller pass in the kmalloc flags to allow insertions
to schedule if needed.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Insertions into the set are slow when we try to add many elements.
For 800k elements I get:
time nft -f pipapo_800k
real 19m34.849s
user 0m2.390s
sys 19m12.828s
perf stats:
--95.39%--nft_pipapo_insert
|--76.60%--pipapo_insert
| --76.37%--pipapo_resize
| |--72.87%--memcpy_orig
| |--1.88%--__free_pages_ok
| | --0.89%--free_tail_page_prepare
| --1.38%--kvmalloc_node
..
--18.56%--pipapo_get.isra.0
|--13.91%--__bitmap_and
|--3.01%--pipapo_refill
|--0.81%--__kmalloc
| --0.74%--__kmalloc_large_node
| --0.66%--__alloc_pages
..
--0.52%--memset_orig
So lots of time is spent in copying exising elements to make space for
the next one.
Instead of allocating to the exact size of the new rule count, allocate
extra slack to reduce alloc/copy/free overhead.
After:
time nft -f pipapo_800k
real 1m54.110s
user 0m2.515s
sys 1m51.377s
--80.46%--nft_pipapo_insert
|--73.45%--pipapo_get.isra.0
|--57.63%--__bitmap_and
| |--8.52%--pipapo_refill
|--3.45%--__kmalloc
| --3.05%--__kmalloc_large_node
| --2.58%--__alloc_pages
--2.59%--memset_orig
|--6.51%--pipapo_insert
--5.96%--pipapo_resize
|--3.63%--memcpy_orig
--2.13%--kvmalloc_node
The new @rules_alloc fills a hole, so struct size doesn't go up.
Also make it so rule removal doesn't shrink unless the free/extra space
exceeds two pages. This should be safe as well:
When a rule gets removed, the attempt to lower the allocated size is
already allowed to fail.
Exception: do exact allocations as long as set is very small (less
than one page needed).
v2: address comments from Stefano:
kdoc comment
formatting changes
remove redundant assignment
switch back to PAGE_SIZE
Link: https://lore.kernel.org/netfilter-devel/20240213141753.17ef27a6@elisabeth/
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
The set uses a mix of 'int', 'unsigned int', and size_t.
The rule count limit is NFT_PIPAPO_RULE0_MAX, which cannot
exceed INT_MAX (a few helpers use 'int' as return type).
Add a compile-time assertion for this.
Replace size_t usage in structs with unsigned int or u8 where
the stored values are smaller.
Replace signed-int arguments for lengths with 'unsigned int'
where possible.
Last, remove lt_aligned member: its set but never read.
struct nft_pipapo_match 40 bytes -> 32 bytes
struct nft_pipapo_field 56 bytes -> 32 bytes
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
pipapo relies on kmalloc(0) returning ZERO_SIZE_PTR (i.e., not NULL
but pointer is invalid).
Rework this to not call slab allocator when we'd request a 0-byte
allocation.
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Those get called from packet path, content must not be modified.
No functional changes intended.
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Randy Dunlap reports arptables build failure:
arp_tables.c:(.text+0x20): undefined reference to `xt_find_table'
... because recent change removed a 'select' on the xtables core.
Add a "depends" clause on arptables to resolve this.
Kernel test robot reports another build breakage:
iptable_nat.c:(.text+0x8): undefined reference to `ipt_unregister_table_exit'
... because of a typo, the nat table selected ip6tables.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Closes: https://lore.kernel.org/netfilter-devel/d0dfbaef-046a-4c42-9daa-53636664bf6d@infradead.org/
Fixes: a9525c7f6219 ("netfilter: xtables: allow xtables-nft only builds")
Fixes: 4654467dc7e1 ("netfilter: arptables: allow xtables-nft only builds")
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Remove useless branch to check for errors in nft_parse_register_store().
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Sanitize nf_logger_find_get() input parameters, no caller in the tree
passes invalid values.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Consolidate pointer fetch to logger and check for NULL in
__find_logger().
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
nf_conntrack_expect_init
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Since commit aed65af1cc2f ("drivers: make device_type const"), the driver
core can properly handle constant struct device_type. Move the vlan_type
variable to be a constant structure as well, placing it into read-only
memory which can not be modified at runtime.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since commit aed65af1cc2f ("drivers: make device_type const"), the driver
core can properly handle constant struct device_type. Move the l2tpeth_type
variable to be a constant structure as well, placing it into read-only
memory which can not be modified at runtime.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since commit aed65af1cc2f ("drivers: make device_type const"), the driver
core can properly handle constant struct device_type. Move the hsr_type
variable to be a constant structure as well, placing it into read-only
memory which can not be modified at runtime.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since commit aed65af1cc2f ("drivers: make device_type const"), the driver
core can properly handle constant struct device_type. Move the br_type
variable to be a constant structure as well, placing it into read-only
memory which can not be modified at runtime.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since commit aed65af1cc2f ("drivers: make device_type const"), the driver
core can properly handle constant struct device_type. Move the dsa_type
variable to be a constant structure as well, placing it into read-only
memory which can not be modified at runtime.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Properly check page pointer returned by page_pool_dev_alloc routine in
skb_pp_cow_data() for non-linear part of the original skb.
Reported-by: Julian Wiedmann <jwiedmann.dev@gmail.com>
Closes: https://lore.kernel.org/netdev/cover.1707729884.git.lorenzo@kernel.org/T/#m7d189b0015a7281ed9221903902490c03ed19a7a
Fixes: e6d5dbdd20aa ("xdp: add multi-buff support for xdp running in generic mode")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/r/25512af3e09befa9dcb2cf3632bdc45b807cf330.1708167716.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next
Marc Kleine-Budde says:
====================
pull-request: can-next 2024-02-20
this is a pull request of 9 patches for net-next/master.
The first patch is by Francesco Dolcini and removes a redundant check
for pm_clock_support from the m_can driver.
Martin Hundebøll contributes 3 patches to the m_can/tcan4x5x driver to
allow resume upon RX of a CAN frame.
3 patches by Srinivas Goud add support for ECC statistics to the
xilinx_can driver.
The last 2 patches are by Oliver Hartkopp and me, target the CAN RAW
protocol and fix an error in the getsockopt() for CAN-XL introduced in
the previous pull request to net-next (linux-can-next-for-6.9-20240213).
linux-can-next-for-6.9-20240220
* tag 'linux-can-next-for-6.9-20240220' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
can: raw: raw_getsockopt(): reduce scope of err
can: raw: fix getsockopt() for new CAN_RAW_XL_VCID_OPTS
can: xilinx_can: Add ethtool stats interface for ECC errors
can: xilinx_can: Add ECC support
dt-bindings: can: xilinx_can: Add 'xlnx,has-ecc' optional property
can: tcan4x5x: support resuming from rx interrupt signal
can: m_can: allow keeping the transceiver running in suspend
dt-bindings: can: tcan4x5x: Document the wakeup-source flag
can: m_can: remove redundant check for pm_clock_support
====================
Link: https://lore.kernel.org/r/20240220085130.2936533-1-mkl@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
'def_bool X' is a shorthand for 'bool' plus 'default X'.
'def_bool' is redundant where 'bool' is already present, so 'def_bool X'
can be replaced with 'default X', or removed if X is 'n'.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
Last major reorg happened in commit 9115e8cd2a0c ("net: reorganize
struct sock for better data locality")
Since then, many changes have been done.
Before SO_PEEK_OFF support is added to TCP, we need
to move sk_peek_off to a better location.
It is time to make another pass, and add six groups,
without explicit alignment.
- sock_write_rx (following sk_refcnt) read-write fields in rx path.
- sock_read_rx read-mostly fields in rx path.
- sock_read_rxtx read-mostly fields in both rx and tx paths.
- sock_write_rxtx read-write fields in both rx and tx paths.
- sock_write_tx read-write fields in tx paths.
- sock_read_tx read-mostly fields in tx paths.
Results on TCP_RR benchmarks seem to show a gain (4 to 5 %).
It is possible UDP needs a change, because sk_peek_off
shares a cache line with sk_receive_queue.
If this the case, we can exchange roles of sk->sk_receive
and up->reader_queue queues.
After this change, we have the following layout:
struct sock {
struct sock_common __sk_common; /* 0 0x88 */
/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
__u8 __cacheline_group_begin__sock_write_rx[0]; /* 0x88 0 */
atomic_t sk_drops; /* 0x88 0x4 */
__s32 sk_peek_off; /* 0x8c 0x4 */
struct sk_buff_head sk_error_queue; /* 0x90 0x18 */
struct sk_buff_head sk_receive_queue; /* 0xa8 0x18 */
/* --- cacheline 3 boundary (192 bytes) --- */
struct {
atomic_t rmem_alloc; /* 0xc0 0x4 */
int len; /* 0xc4 0x4 */
struct sk_buff * head; /* 0xc8 0x8 */
struct sk_buff * tail; /* 0xd0 0x8 */
} sk_backlog; /* 0xc0 0x18 */
struct {
atomic_t rmem_alloc; /* 0 0x4 */
int len; /* 0x4 0x4 */
struct sk_buff * head; /* 0x8 0x8 */
struct sk_buff * tail; /* 0x10 0x8 */
/* size: 24, cachelines: 1, members: 4 */
/* last cacheline: 24 bytes */
};
__u8 __cacheline_group_end__sock_write_rx[0]; /* 0xd8 0 */
__u8 __cacheline_group_begin__sock_read_rx[0]; /* 0xd8 0 */
rcu * sk_rx_dst; /* 0xd8 0x8 */
int sk_rx_dst_ifindex; /* 0xe0 0x4 */
u32 sk_rx_dst_cookie; /* 0xe4 0x4 */
unsigned int sk_ll_usec; /* 0xe8 0x4 */
unsigned int sk_napi_id; /* 0xec 0x4 */
u16 sk_busy_poll_budget; /* 0xf0 0x2 */
u8 sk_prefer_busy_poll; /* 0xf2 0x1 */
u8 sk_userlocks; /* 0xf3 0x1 */
int sk_rcvbuf; /* 0xf4 0x4 */
rcu * sk_filter; /* 0xf8 0x8 */
/* --- cacheline 4 boundary (256 bytes) --- */
union {
rcu * sk_wq; /* 0x100 0x8 */
struct socket_wq * sk_wq_raw; /* 0x100 0x8 */
}; /* 0x100 0x8 */
union {
rcu * sk_wq; /* 0 0x8 */
struct socket_wq * sk_wq_raw; /* 0 0x8 */
};
void (*sk_data_ready)(struct sock *); /* 0x108 0x8 */
long sk_rcvtimeo; /* 0x110 0x8 */
int sk_rcvlowat; /* 0x118 0x4 */
__u8 __cacheline_group_end__sock_read_rx[0]; /* 0x11c 0 */
__u8 __cacheline_group_begin__sock_read_rxtx[0]; /* 0x11c 0 */
int sk_err; /* 0x11c 0x4 */
struct socket * sk_socket; /* 0x120 0x8 */
struct mem_cgroup * sk_memcg; /* 0x128 0x8 */
rcu * sk_policy[2]; /* 0x130 0x10 */
/* --- cacheline 5 boundary (320 bytes) --- */
__u8 __cacheline_group_end__sock_read_rxtx[0]; /* 0x140 0 */
__u8 __cacheline_group_begin__sock_write_rxtx[0]; /* 0x140 0 */
socket_lock_t sk_lock; /* 0x140 0x20 */
u32 sk_reserved_mem; /* 0x160 0x4 */
int sk_forward_alloc; /* 0x164 0x4 */
u32 sk_tsflags; /* 0x168 0x4 */
__u8 __cacheline_group_end__sock_write_rxtx[0]; /* 0x16c 0 */
__u8 __cacheline_group_begin__sock_write_tx[0]; /* 0x16c 0 */
int sk_write_pending; /* 0x16c 0x4 */
atomic_t sk_omem_alloc; /* 0x170 0x4 */
int sk_sndbuf; /* 0x174 0x4 */
int sk_wmem_queued; /* 0x178 0x4 */
refcount_t sk_wmem_alloc; /* 0x17c 0x4 */
/* --- cacheline 6 boundary (384 bytes) --- */
unsigned long sk_tsq_flags; /* 0x180 0x8 */
union {
struct sk_buff * sk_send_head; /* 0x188 0x8 */
struct rb_root tcp_rtx_queue; /* 0x188 0x8 */
}; /* 0x188 0x8 */
union {
struct sk_buff * sk_send_head; /* 0 0x8 */
struct rb_root tcp_rtx_queue; /* 0 0x8 */
};
struct sk_buff_head sk_write_queue; /* 0x190 0x18 */
u32 sk_dst_pending_confirm; /* 0x1a8 0x4 */
u32 sk_pacing_status; /* 0x1ac 0x4 */
struct page_frag sk_frag; /* 0x1b0 0x10 */
/* --- cacheline 7 boundary (448 bytes) --- */
struct timer_list sk_timer; /* 0x1c0 0x28 */
/* XXX last struct has 4 bytes of padding */
unsigned long sk_pacing_rate; /* 0x1e8 0x8 */
atomic_t sk_zckey; /* 0x1f0 0x4 */
atomic_t sk_tskey; /* 0x1f4 0x4 */
__u8 __cacheline_group_end__sock_write_tx[0]; /* 0x1f8 0 */
__u8 __cacheline_group_begin__sock_read_tx[0]; /* 0x1f8 0 */
unsigned long sk_max_pacing_rate; /* 0x1f8 0x8 */
/* --- cacheline 8 boundary (512 bytes) --- */
long sk_sndtimeo; /* 0x200 0x8 */
u32 sk_priority; /* 0x208 0x4 */
u32 sk_mark; /* 0x20c 0x4 */
rcu * sk_dst_cache; /* 0x210 0x8 */
netdev_features_t sk_route_caps; /* 0x218 0x8 */
u16 sk_gso_type; /* 0x220 0x2 */
u16 sk_gso_max_segs; /* 0x222 0x2 */
unsigned int sk_gso_max_size; /* 0x224 0x4 */
gfp_t sk_allocation; /* 0x228 0x4 */
u32 sk_txhash; /* 0x22c 0x4 */
u8 sk_pacing_shift; /* 0x230 0x1 */
bool sk_use_task_frag; /* 0x231 0x1 */
__u8 __cacheline_group_end__sock_read_tx[0]; /* 0x232 0 */
u8 sk_gso_disabled:1; /* 0x232: 0 0x1 */
u8 sk_kern_sock:1; /* 0x232:0x1 0x1 */
u8 sk_no_check_tx:1; /* 0x232:0x2 0x1 */
u8 sk_no_check_rx:1; /* 0x232:0x3 0x1 */
/* XXX 4 bits hole, try to pack */
u8 sk_shutdown; /* 0x233 0x1 */
u16 sk_type; /* 0x234 0x2 */
u16 sk_protocol; /* 0x236 0x2 */
unsigned long sk_lingertime; /* 0x238 0x8 */
/* --- cacheline 9 boundary (576 bytes) --- */
struct proto * sk_prot_creator; /* 0x240 0x8 */
rwlock_t sk_callback_lock; /* 0x248 0x8 */
int sk_err_soft; /* 0x250 0x4 */
u32 sk_ack_backlog; /* 0x254 0x4 */
u32 sk_max_ack_backlog; /* 0x258 0x4 */
kuid_t sk_uid; /* 0x25c 0x4 */
spinlock_t sk_peer_lock; /* 0x260 0x4 */
int sk_bind_phc; /* 0x264 0x4 */
struct pid * sk_peer_pid; /* 0x268 0x8 */
const struct cred * sk_peer_cred; /* 0x270 0x8 */
ktime_t sk_stamp; /* 0x278 0x8 */
/* --- cacheline 10 boundary (640 bytes) --- */
int sk_disconnects; /* 0x280 0x4 */
u8 sk_txrehash; /* 0x284 0x1 */
u8 sk_clockid; /* 0x285 0x1 */
u8 sk_txtime_deadline_mode:1; /* 0x286: 0 0x1 */
u8 sk_txtime_report_errors:1; /* 0x286:0x1 0x1 */
u8 sk_txtime_unused:6; /* 0x286:0x2 0x1 */
/* XXX 1 byte hole, try to pack */
void * sk_user_data; /* 0x288 0x8 */
void * sk_security; /* 0x290 0x8 */
struct sock_cgroup_data sk_cgrp_data; /* 0x298 0x8 */
void (*sk_state_change)(struct sock *); /* 0x2a0 0x8 */
void (*sk_write_space)(struct sock *); /* 0x2a8 0x8 */
void (*sk_error_report)(struct sock *); /* 0x2b0 0x8 */
int (*sk_backlog_rcv)(struct sock *, struct sk_buff *); /* 0x2b8 0x8 */
/* --- cacheline 11 boundary (704 bytes) --- */
void (*sk_destruct)(struct sock *); /* 0x2c0 0x8 */
rcu * sk_reuseport_cb; /* 0x2c8 0x8 */
rcu * sk_bpf_storage; /* 0x2d0 0x8 */
struct callback_head sk_rcu __attribute__((__aligned__(8))); /* 0x2d8 0x10 */
netns_tracker ns_tracker; /* 0x2e8 0x8 */
/* size: 752, cachelines: 12, members: 105 */
/* sum members: 749, holes: 1, sum holes: 1 */
/* sum bitfield members: 12 bits, bit holes: 1, sum bit holes: 4 bits */
/* paddings: 1, sum paddings: 4 */
/* forced alignments: 1 */
/* last cacheline: 48 bytes */
};
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20240216162006.2342759-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The variable len being initialized with a value that is never read, an
if statement is initializing it in both paths of the if statement.
The initialization is redundant and can be removed.
Cleans up clang scan build warning:
net/ipv4/tcp_ao.c:512:11: warning: Value stored to 'len' during its
initialization is never read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://lore.kernel.org/r/20240216125443.2107244-1-colin.i.king@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
syzkaller reported an overflown write in arp_req_get(). [0]
When ioctl(SIOCGARP) is issued, arp_req_get() looks up an neighbour
entry and copies neigh->ha to struct arpreq.arp_ha.sa_data.
The arp_ha here is struct sockaddr, not struct sockaddr_storage, so
the sa_data buffer is just 14 bytes.
In the splat below, 2 bytes are overflown to the next int field,
arp_flags. We initialise the field just after the memcpy(), so it's
not a problem.
However, when dev->addr_len is greater than 22 (e.g. MAX_ADDR_LEN),
arp_netmask is overwritten, which could be set as htonl(0xFFFFFFFFUL)
in arp_ioctl() before calling arp_req_get().
To avoid the overflow, let's limit the max length of memcpy().
Note that commit b5f0de6df6dc ("net: dev: Convert sa_data to flexible
array in struct sockaddr") just silenced syzkaller.
[0]:
memcpy: detected field-spanning write (size 16) of single field "r->arp_ha.sa_data" at net/ipv4/arp.c:1128 (size 14)
WARNING: CPU: 0 PID: 144638 at net/ipv4/arp.c:1128 arp_req_get+0x411/0x4a0 net/ipv4/arp.c:1128
Modules linked in:
CPU: 0 PID: 144638 Comm: syz-executor.4 Not tainted 6.1.74 #31
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
RIP: 0010:arp_req_get+0x411/0x4a0 net/ipv4/arp.c:1128
Code: fd ff ff e8 41 42 de fb b9 0e 00 00 00 4c 89 fe 48 c7 c2 20 6d ab 87 48 c7 c7 80 6d ab 87 c6 05 25 af 72 04 01 e8 5f 8d ad fb <0f> 0b e9 6c fd ff ff e8 13 42 de fb be 03 00 00 00 4c 89 e7 e8 a6
RSP: 0018:ffffc900050b7998 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88803a815000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff8641a44a RDI: 0000000000000001
RBP: ffffc900050b7a98 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 203a7970636d656d R12: ffff888039c54000
R13: 1ffff92000a16f37 R14: ffff88803a815084 R15: 0000000000000010
FS: 00007f172bf306c0(0000) GS:ffff88805aa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f172b3569f0 CR3: 0000000057f12005 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
arp_ioctl+0x33f/0x4b0 net/ipv4/arp.c:1261
inet_ioctl+0x314/0x3a0 net/ipv4/af_inet.c:981
sock_do_ioctl+0xdf/0x260 net/socket.c:1204
sock_ioctl+0x3ef/0x650 net/socket.c:1321
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x18e/0x220 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x37/0x90 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x64/0xce
RIP: 0033:0x7f172b262b8d
Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f172bf300b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f172b3abf80 RCX: 00007f172b262b8d
RDX: 0000000020000000 RSI: 0000000000008954 RDI: 0000000000000003
RBP: 00007f172b2d3493 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007f172b3abf80 R15: 00007f172bf10000
</TASK>
Reported-by: syzkaller <syzkaller@googlegroups.com>
Reported-by: Bjoern Doebel <doebel@amazon.de>
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240215230516.31330-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The pernet operations structure for the subsystem must be registered
before registering the generic netlink family.
Make an unregister in case of unsuccessful registration.
Fixes: 687125b5799c ("devlink: split out core code")
Signed-off-by: Vasiliy Kovalev <kovalev@altlinux.org>
Link: https://lore.kernel.org/r/20240215203400.29976-1-kovalev@altlinux.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The pernet operations structure for the subsystem must be registered
before registering the generic netlink family.
Fixes: 915d7e5e5930 ("ipv6: sr: add code base for control plane support of SR-IPv6")
Signed-off-by: Vasiliy Kovalev <kovalev@altlinux.org>
Link: https://lore.kernel.org/r/20240215202717.29815-1-kovalev@altlinux.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Reduce the scope of the variable "err" to the individual cases. This
is to avoid the mistake of setting "err" in the mistaken belief that
it will be evaluated later.
Reviewed-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Link: https://lore.kernel.org/all/20240220-raw-setsockopt-v1-1-7d34cb1377fc@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Use struct netmem* instead of page in skb_frag_t. Currently struct
netmem* is always a struct page underneath, but the abstraction
allows efforts to add support for skb frags not backed by pages.
There is unfortunately 1 instance where the skb_frag_t is assumed to be
a exactly a bio_vec in kcm. For this case, WARN_ON_ONCE and return error
before doing a cast.
Add skb[_frag]_fill_netmem_*() and skb_add_rx_frag_netmem() helpers so
that the API can be used to create netmem skbs.
Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The code for the CAN_RAW_XL_VCID_OPTS getsockopt() was incompletely adopted
from the CAN_RAW_FILTER getsockopt().
Add the missing put_user() and return statements.
Flagged by Smatch.
Fixes: c83c22ec1493 ("can: canxl: add virtual CAN network identifier support")
Reported-by: Simon Horman <horms@kernel.org>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/all/20240219200021.12113-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
Creation of sysfs entries is expensive, mainly for workloads that
constantly creates netdev and netns often.
Do not create BQL sysfs entries for devices that don't need,
basically those that do not have a real queue, i.e, devices that has
NETIF_F_LLTX and IFF_NO_QUEUE, such as `lo` interface.
This will remove the /sys/class/net/eth0/queues/tx-X/byte_queue_limits/
directory for these devices.
In the example below, eth0 has the `byte_queue_limits` directory but not
`lo`.
# ls /sys/class/net/lo/queues/tx-0/
traffic_class tx_maxrate tx_timeout xps_cpus xps_rxqs
# ls /sys/class/net/eth0/queues/tx-0/byte_queue_limits/
hold_time inflight limit limit_max limit_min
This also removes the #ifdefs, since we can also use netdev_uses_bql() to
check if the config is enabled. (as suggested by Jakub).
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240216094154.3263843-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use global percpu page_pool_recycle_stats counter for system page_pool
allocator instead of allocating a separate percpu variable for each
(also percpu) page pool instance.
Reviewed-by: Toke Hoiland-Jorgensen <toke@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/87f572425e98faea3da45f76c3c68815c01a20ee.1708075412.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now that direct recycling is performed basing on pool->cpuid when set,
memory leaks are possible:
1. A pool is destroyed.
2. Alloc cache is emptied (it's done only once).
3. pool->cpuid is still set.
4. napi_pp_put_page() does direct recycling basing on pool->cpuid.
5. Now alloc cache is not empty, but it won't ever be freed.
In order to avoid that, rewrite pool->cpuid to -1 when unlinking NAPI to
make sure no direct recycling will be possible after emptying the cache.
This involves a bit of overhead as pool->cpuid now must be accessed
via READ_ONCE() to avoid partial reads.
Rename page_pool_unlink_napi() -> page_pool_disable_direct_recycling()
to reflect what it actually does and unexport it.
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20240215113905.96817-1-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct tc_pedit.
Additionally, since the element count member must be set before accessing
the annotated flexible array member, move its initialization earlier.
Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1]
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fullmesh endpoints could end-up unexpectedly generating duplicate
subflows - same local and remote addresses - when multiple incoming
ADD_ADDR are processed before the PM creates the subflow for the local
endpoints.
Address the issue explicitly checking for duplicates at subflow
creation time.
To avoid a quadratic computational complexity, track the unavailable
remote address ids in a temporary bitmap and initialize such bitmap
with the remote ids of all the existing subflows matching the local
address currently processed.
The above allows additionally replacing the existing code checking
for duplicate entry in the current set with a simple bit test
operation.
Fixes: 2843ff6f36db ("mptcp: remote addresses fullmesh")
Cc: stable@vger.kernel.org
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/435
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Similar to the previous patch, address the data race on
remote_id, adding the suitable ONCE annotations.
Fixes: bedee0b56113 ("mptcp: address lookup improvements")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The local address id is accessed lockless by the NL PM, add
all the required ONCE annotation. There is a caveat: the local
id can be initialized late in the subflow life-cycle, and its
validity is controlled by the local_id_valid flag.
Remove such flag and encode the validity in the local_id field
itself with negative value before initialization. That allows
accessing the field consistently with a single read operation.
Fixes: 0ee4261a3681 ("mptcp: implement mptcp_pm_remove_subflow")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since the introduction of the subflow ULP diag interface, the
dump callback accessed all the subflow data with lockless.
We need either to annotate all the read and write operation accordingly,
or acquire the subflow socket lock. Let's do latter, even if slower, to
avoid a diffstat havoc.
Fixes: 5147dfb50832 ("mptcp: allow dumping subflow context to userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Just the same as userspace PM, a new parameter needs_id is added for
in-kernel PM mptcp_pm_nl_append_new_local_addr() too.
Add a new helper mptcp_pm_has_addr_attr_id() to check whether an address
ID is set from PM or not.
In mptcp_pm_nl_get_local_id(), needs_id is always true, but in
mptcp_pm_nl_add_addr_doit(), pass mptcp_pm_has_addr_attr_id() to
needs_it.
Fixes: efd5a4c04e18 ("mptcp: add the address ID assignment bitmap")
Cc: stable@vger.kernel.org
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When userspace PM requires to create an ID 0 subflow in "userspace pm
create id 0 subflow" test like this:
userspace_pm_add_sf $ns2 10.0.3.2 0
An ID 1 subflow, in fact, is created.
Since in mptcp_pm_nl_append_new_local_addr(), 'id 0' will be treated as
no ID is set by userspace, and will allocate a new ID immediately:
if (!e->addr.id)
e->addr.id = find_next_zero_bit(pernet->id_bitmap,
MPTCP_PM_MAX_ADDR_ID + 1,
1);
To solve this issue, a new parameter needs_id is added for
mptcp_userspace_pm_append_new_local_addr() to distinguish between
whether userspace PM has set an ID 0 or whether userspace PM has
not set any address.
needs_id is true in mptcp_userspace_pm_get_local_id(), but false in
mptcp_pm_nl_announce_doit() and mptcp_pm_nl_subflow_create_doit().
Fixes: e5ed101a6028 ("mptcp: userspace pm allow creating id 0 subflow")
Cc: stable@vger.kernel.org
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
net->dev_base_seq and ipv6.dev_addr_genid are monotonically increasing.
If we XOR their values, we could miss to detect if both values
were changed with the same amount.
Fixes: 63998ac24f83 ("ipv6: provide addr and netconf dump consistency info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
net->dev_base_seq and ipv4.dev_addr_genid are monotonically increasing.
If we XOR their values, we could miss to detect if both values
were changed with the same amount.
Fixes: 0465277f6b3f ("ipv4: provide addr and netconf dump consistency info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
It is impossible to disable BQL individually today, since there is no
prompt for the Kconfig entry, so, the BQL is always enabled if SYSFS is
enabled.
Create a prompt entry for BQL, so, it could be enabled or disabled at
build time independently of SYSFS.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If we're redirecting the skb, and haven't called tcf_mirred_forward(),
yet, we need to tell the core to drop the skb by setting the retcode
to SHOT. If we have called tcf_mirred_forward(), however, the skb
is out of our hands and returning SHOT will lead to UaF.
Move the retval override to the error path which actually need it.
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Fixes: e5cf1baf92cb ("act_mirred: use TC_ACT_REINSERT when possible")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The test Davide added in commit ca22da2fbd69 ("act_mirred: use the backlog
for nested calls to mirred ingress") hangs our testing VMs every 10 or so
runs, with the familiar tcp_v4_rcv -> tcp_v4_rcv deadlock reported by
lockdep.
The problem as previously described by Davide (see Link) is that
if we reverse flow of traffic with the redirect (egress -> ingress)
we may reach the same socket which generated the packet. And we may
still be holding its socket lock. The common solution to such deadlocks
is to put the packet in the Rx backlog, rather than run the Rx path
inline. Do that for all egress -> ingress reversals, not just once
we started to nest mirred calls.
In the past there was a concern that the backlog indirection will
lead to loss of error reporting / less accurate stats. But the current
workaround does not seem to address the issue.
Fixes: 53592b364001 ("net/sched: act_mirred: Implement ingress actions")
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Suggested-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/netdev/33dc43f587ec1388ba456b4915c75f02a8aae226.1663945716.git.dcaratti@redhat.com/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix a misspelling of "circuit".
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
syzkaller reported a warning [0] in inet_csk_destroy_sock() with no
repro.
WARN_ON(inet_sk(sk)->inet_num && !inet_csk(sk)->icsk_bind_hash);
However, the syzkaller's log hinted that connect() failed just before
the warning due to FAULT_INJECTION. [1]
When connect() is called for an unbound socket, we search for an
available ephemeral port. If a bhash bucket exists for the port, we
call __inet_check_established() or __inet6_check_established() to check
if the bucket is reusable.
If reusable, we add the socket into ehash and set inet_sk(sk)->inet_num.
Later, we look up the corresponding bhash2 bucket and try to allocate
it if it does not exist.
Although it rarely occurs in real use, if the allocation fails, we must
revert the changes by check_established(). Otherwise, an unconnected
socket could illegally occupy an ehash entry.
Note that we do not put tw back into ehash because sk might have
already responded to a packet for tw and it would be better to free
tw earlier under such memory presure.
[0]:
WARNING: CPU: 0 PID: 350830 at net/ipv4/inet_connection_sock.c:1193 inet_csk_destroy_sock (net/ipv4/inet_connection_sock.c:1193)
Modules linked in:
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:inet_csk_destroy_sock (net/ipv4/inet_connection_sock.c:1193)
Code: 41 5c 41 5d 41 5e e9 2d 4a 3d fd e8 28 4a 3d fd 48 89 ef e8 f0 cd 7d ff 5b 5d 41 5c 41 5d 41 5e e9 13 4a 3d fd e8 0e 4a 3d fd <0f> 0b e9 61 fe ff ff e8 02 4a 3d fd 4c 89 e7 be 03 00 00 00 e8 05
RSP: 0018:ffffc9000b21fd38 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000009e78 RCX: ffffffff840bae40
RDX: ffff88806e46c600 RSI: ffffffff840bb012 RDI: ffff88811755cca8
RBP: ffff88811755c880 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000009e78 R11: 0000000000000000 R12: ffff88811755c8e0
R13: ffff88811755c892 R14: ffff88811755c918 R15: 0000000000000000
FS: 00007f03e5243800(0000) GS:ffff88811ae00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b32f21000 CR3: 0000000112ffe001 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
? inet_csk_destroy_sock (net/ipv4/inet_connection_sock.c:1193)
dccp_close (net/dccp/proto.c:1078)
inet_release (net/ipv4/af_inet.c:434)
__sock_release (net/socket.c:660)
sock_close (net/socket.c:1423)
__fput (fs/file_table.c:377)
__fput_sync (fs/file_table.c:462)
__x64_sys_close (fs/open.c:1557 fs/open.c:1539 fs/open.c:1539)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
RIP: 0033:0x7f03e53852bb
Code: 03 00 00 00 0f 05 48 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 43 c9 f5 ff 8b 7c 24 0c 41 89 c0 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 89 44 24 0c e8 a1 c9 f5 ff 8b 44
RSP: 002b:00000000005dfba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f03e53852bb
RDX: 0000000000000002 RSI: 0000000000000002 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000167c
R10: 0000000008a79680 R11: 0000000000000293 R12: 00007f03e4e43000
R13: 00007f03e4e43170 R14: 00007f03e4e43178 R15: 00007f03e4e43170
</TASK>
[1]:
FAULT_INJECTION: forcing a failure.
name failslab, interval 1, probability 0, space 0, times 0
CPU: 0 PID: 350833 Comm: syz-executor.1 Not tainted 6.7.0-12272-g2121c43f88f5 #9
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
should_fail_ex (lib/fault-inject.c:52 lib/fault-inject.c:153)
should_failslab (mm/slub.c:3748)
kmem_cache_alloc (mm/slub.c:3763 mm/slub.c:3842 mm/slub.c:3867)
inet_bind2_bucket_create (net/ipv4/inet_hashtables.c:135)
__inet_hash_connect (net/ipv4/inet_hashtables.c:1100)
dccp_v4_connect (net/dccp/ipv4.c:116)
__inet_stream_connect (net/ipv4/af_inet.c:676)
inet_stream_connect (net/ipv4/af_inet.c:747)
__sys_connect_file (net/socket.c:2048 (discriminator 2))
__sys_connect (net/socket.c:2065)
__x64_sys_connect (net/socket.c:2072)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
RIP: 0033:0x7f03e5284e5d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48
RSP: 002b:00007f03e4641cc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
RAX: ffffffffffffffda RBX: 00000000004bbf80 RCX: 00007f03e5284e5d
RDX: 0000000000000010 RSI: 0000000020000000 RDI: 0000000000000003
RBP: 00000000004bbf80 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
R13: 000000000000000b R14: 00007f03e52e5530 R15: 0000000000000000
</TASK>
Reported-by: syzkaller <syzkaller@googlegroups.com>
Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When unoffloading a device, it is important to ensure that all
relevant deferred events are delivered to it before it disassociates
itself from the bridge.
Before this change, this was true for the normal case when a device
maps 1:1 to a net_bridge_port, i.e.
br0
/
swp0
When swp0 leaves br0, the call to switchdev_deferred_process() in
del_nbp() makes sure to process any outstanding events while the
device is still associated with the bridge.
In the case when the association is indirect though, i.e. when the
device is attached to the bridge via an intermediate device, like a
LAG...
br0
/
lag0
/
swp0
...then detaching swp0 from lag0 does not cause any net_bridge_port to
be deleted, so there was no guarantee that all events had been
processed before the device disassociated itself from the bridge.
Fix this by always synchronously processing all deferred events before
signaling completion of unoffloading back to the driver.
Fixes: 4e51bf44a03a ("net: bridge: move the switchdev object replay helpers to "push" mode")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Before this change, generation of the list of MDB events to replay
would race against the creation of new group memberships, either from
the IGMP/MLD snooping logic or from user configuration.
While new memberships are immediately visible to walkers of
br->mdb_list, the notification of their existence to switchdev event
subscribers is deferred until a later point in time. So if a replay
list was generated during a time that overlapped with such a window,
it would also contain a replay of the not-yet-delivered event.
The driver would thus receive two copies of what the bridge internally
considered to be one single event. On destruction of the bridge, only
a single membership deletion event was therefore sent. As a
consequence of this, drivers which reference count memberships (at
least DSA), would be left with orphan groups in their hardware
database when the bridge was destroyed.
This is only an issue when replaying additions. While deletion events
may still be pending on the deferred queue, they will already have
been removed from br->mdb_list, so no duplicates can be generated in
that scenario.
To a user this meant that old group memberships, from a bridge in
which a port was previously attached, could be reanimated (in
hardware) when the port joined a new bridge, without the new bridge's
knowledge.
For example, on an mv88e6xxx system, create a snooping bridge and
immediately add a port to it:
root@infix-06-0b-00:~$ ip link add dev br0 up type bridge mcast_snooping 1 && \
> ip link set dev x3 up master br0
And then destroy the bridge:
root@infix-06-0b-00:~$ ip link del dev br0
root@infix-06-0b-00:~$ mvls atu
ADDRESS FID STATE Q F 0 1 2 3 4 5 6 7 8 9 a
DEV:0 Marvell 88E6393X
33:33:00:00:00:6a 1 static - - 0 . . . . . . . . . .
33:33:ff:87:e4:3f 1 static - - 0 . . . . . . . . . .
ff:ff:ff:ff:ff:ff 1 static - - 0 1 2 3 4 5 6 7 8 9 a
root@infix-06-0b-00:~$
The two IPv6 groups remain in the hardware database because the
port (x3) is notified of the host's membership twice: once via the
original event and once via a replay. Since only a single delete
notification is sent, the count remains at 1 when the bridge is
destroyed.
Then add the same port (or another port belonging to the same hardware
domain) to a new bridge, this time with snooping disabled:
root@infix-06-0b-00:~$ ip link add dev br1 up type bridge mcast_snooping 0 && \
> ip link set dev x3 up master br1
All multicast, including the two IPv6 groups from br0, should now be
flooded, according to the policy of br1. But instead the old
memberships are still active in the hardware database, causing the
switch to only forward traffic to those groups towards the CPU (port
0).
Eliminate the race in two steps:
1. Grab the write-side lock of the MDB while generating the replay
list.
This prevents new memberships from showing up while we are generating
the replay list. But it leaves the scenario in which a deferred event
was already generated, but not delivered, before we grabbed the
lock. Therefore:
2. Make sure that no deferred version of a replay event is already
enqueued to the switchdev deferred queue, before adding it to the
replay list, when replaying additions.
Fixes: 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
iucv_path_table is a dynamically allocated array of pointers to
struct iucv_path items. Yet, its size is calculated as if it was
an array of struct iucv_path items.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix virtual vs physical address confusion. This does not fix a bug
since virtual and physical address spaces are currently the same.
Acked-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The code block under the "!ds->user_mii_bus && ds->ops->phy_read" check
under dsa_switch_setup() populates ds->user_mii_bus. The use of
ds->user_mii_bus is inappropriate when the MDIO bus of the switch is
described on the device tree [1].
For this reason, use this code block only for switches [with MDIO bus]
probed on platform_data, and OF which the switch MDIO bus isn't described
on the device tree. Therefore, remove OF-based MDIO bus registration as
it's useless for these cases.
These subdrivers which control switches [with MDIO bus] probed on OF, will
lose the ability to register the MDIO bus OF-based:
drivers/net/dsa/b53/b53_common.c
drivers/net/dsa/lan9303-core.c
drivers/net/dsa/vitesse-vsc73xx-core.c
These subdrivers let the DSA core driver register the bus:
- ds->ops->phy_read() and ds->ops->phy_write() are present.
- ds->user_mii_bus is not populated.
The commit fe7324b93222 ("net: dsa: OF-ware slave_mii_bus") which brought
OF-based MDIO bus registration on the DSA core driver is reasonably recent
and, in this time frame, there have been no device trees in the Linux
repository that started describing the MDIO bus, or dt-bindings defining
the MDIO bus for the switches these subdrivers control. So I don't expect
any devices to be affected.
The logic we encourage is that all subdrivers should register the switch
MDIO bus on their own [2]. And, for subdrivers which control switches [with
MDIO bus] probed on OF, this logic must be followed to support all cases
properly:
No switch MDIO bus defined: Populate ds->user_mii_bus, register the MDIO
bus, set the interrupts for PHYs if "interrupt-controller" is defined at
the switch node. This case should only be covered for the switches which
their dt-bindings documentation didn't document the MDIO bus from the
start. This is to keep supporting the device trees that do not describe the
MDIO bus on the device tree but the MDIO bus is being used nonetheless.
Switch MDIO bus defined: Don't populate ds->user_mii_bus, register the MDIO
bus, set the interrupts for PHYs if ["interrupt-controller" is defined at
the switch node and "interrupts" is defined at the PHY nodes under the
switch MDIO bus node].
Switch MDIO bus defined but explicitly disabled: If the device tree says
status = "disabled" for the MDIO bus, we shouldn't need an MDIO bus at all.
Instead, just exit as early as possible and do not call any MDIO API.
After all subdrivers that control switches with MDIO buses are made to
register the MDIO buses on their own, we will be able to get rid of
dsa_switch_ops :: phy_read() and :: phy_write(), and the code block for
registering the MDIO bus on the DSA core driver.
Link: https://lore.kernel.org/netdev/20231213120656.x46fyad6ls7sqyzv@skbuf/ [1]
Link: https://lore.kernel.org/netdev/20240103184459.dcbh57wdnlox6w7d@skbuf/ [2]
Suggested-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Acked-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/20240213-for-netnext-dsa-mdio-bus-v2-1-0ff6f4823a9e@arinc9.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Stanislaw Gruszka says:
====================
thermal/netlink/intel_hfi: Enable HFI feature only when required
The patchset introduces a new genetlink family bind/unbind callbacks
and thermal/netlink notifications, which allow drivers to send netlink
multicast events based on the presence of actual user-space consumers.
This functionality optimizes resource usage by allowing disabling
of features when not needed.
v1: https://lore.kernel.org/linux-pm/20240131120535.933424-1-stanislaw.gruszka@linux.intel.com//
v2: https://lore.kernel.org/linux-pm/20240206133605.1518373-1-stanislaw.gruszka@linux.intel.com/
v3: https://lore.kernel.org/linux-pm/20240209120625.1775017-1-stanislaw.gruszka@linux.intel.com/
====================
Link: https://lore.kernel.org/r/20240212161615.161935-1-stanislaw.gruszka@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add genetlink family bind()/unbind() callbacks when adding/removing
multicast group to/from netlink client socket via setsockopt() or
bind() syscall.
They can be used to track if consumers of netlink multicast messages
emerge or disappear. Thus, a client implementing callbacks, can now
send events only when there are active consumers, preventing unnecessary
work when none exist.
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20240212161615.161935-2-stanislaw.gruszka@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|