summaryrefslogtreecommitdiff
path: root/net/mptcp
AgeCommit message (Collapse)Author
2022-03-08mptcp: add tracepoint in mptcp_sendmsg_fragGeliang Tang
The tracepoint in get_mapping_status() only dumped the incoming mpext fields. This patch added a new tracepoint in mptcp_sendmsg_frag() to dump the outgoing mpext too. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-04mptcp: add the mibs for MP_RSTGeliang Tang
This patch added two more mibs for MP_RST, MPTCP_MIB_MPRSTTX for the MP_RST sending and MPTCP_MIB_MPRSTRX for the MP_RST receiving. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-04mptcp: add the mibs for MP_FASTCLOSEGeliang Tang
This patch added two more mibs for MP_FASTCLOSE, MPTCP_MIB_MPFASTCLOSETX for the MP_FASTCLOSE sending and MPTCP_MIB_MPFASTCLOSERX for receiving. Also added a debug log for MP_FASTCLOSE receiving, printed out the recv_key of MP_FASTCLOSE in mptcp_parse_option to show that MP_RST is received. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
net/batman-adv/hard-interface.c commit 690bb6fb64f5 ("batman-adv: Request iflink once in batadv-on-batadv check") commit 6ee3c393eeb7 ("batman-adv: Demote batadv-on-batadv skip error message") https://lore.kernel.org/all/20220302163049.101957-1-sw@simonwunderlich.de/ net/smc/af_smc.c commit 4d08b7b57ece ("net/smc: Fix cleanup when register ULP fails") commit 462791bbfa35 ("net/smc: add sysctl interface for SMC") https://lore.kernel.org/all/20220302112209.355def40@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-24mptcp: Correctly set DATA_FIN timeout when number of retransmits is largeMat Martineau
Syzkaller with UBSAN uncovered a scenario where a large number of DATA_FIN retransmits caused a shift-out-of-bounds in the DATA_FIN timeout calculation: ================================================================================ UBSAN: shift-out-of-bounds in net/mptcp/protocol.c:470:29 shift exponent 32 is too large for 32-bit type 'unsigned int' CPU: 1 PID: 13059 Comm: kworker/1:0 Not tainted 5.17.0-rc2-00630-g5fbf21c90c60 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Workqueue: events mptcp_worker Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 ubsan_epilogue+0xb/0x5a lib/ubsan.c:151 __ubsan_handle_shift_out_of_bounds.cold+0xb2/0x20e lib/ubsan.c:330 mptcp_set_datafin_timeout net/mptcp/protocol.c:470 [inline] __mptcp_retrans.cold+0x72/0x77 net/mptcp/protocol.c:2445 mptcp_worker+0x58a/0xa70 net/mptcp/protocol.c:2528 process_one_work+0x9df/0x16d0 kernel/workqueue.c:2307 worker_thread+0x95/0xe10 kernel/workqueue.c:2454 kthread+0x2f4/0x3b0 kernel/kthread.c:377 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> ================================================================================ This change limits the maximum timeout by limiting the size of the shift, which keeps all intermediate values in-bounds. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/259 Fixes: 6477dd39e62c ("mptcp: Retransmit DATA_FIN") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-24mptcp: accurate SIOCOUTQ for fallback socketPaolo Abeni
The MPTCP SIOCOUTQ implementation is not very accurate in case of fallback: it only measures the data in the MPTCP-level write queue, but it does not take in account the subflow write queue utilization. In case of fallback the first can be empty, while the latter is not. The above produces sporadic self-tests issues and can foul legit user-space application. Fix the issue additionally querying the subflow in case of fallback. Fixes: 644807e3e462 ("mptcp: add SIOCINQ, OUTQ and OUTQNSD ioctls") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/260 Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
tools/testing/selftests/net/mptcp/mptcp_join.sh 34aa6e3bccd8 ("selftests: mptcp: add ip mptcp wrappers") 857898eb4b28 ("selftests: mptcp: add missing join check") 6ef84b1517e0 ("selftests: mptcp: more robust signal race test") https://lore.kernel.org/all/20220221131842.468893-1-broonie@kernel.org/ drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/act.h drivers/net/ethernet/mellanox/mlx5/core/en/tc/act/ct.c fb7e76ea3f3b6 ("net/mlx5e: TC, Skip redundant ct clear actions") c63741b426e11 ("net/mlx5e: Fix MPLSoUDP encap to use MPLS action information") 09bf97923224f ("net/mlx5e: TC, Move pedit_headers_action to parse_attr") 84ba8062e383 ("net/mlx5e: Test CT and SAMPLE on flow attr") efe6f961cd2e ("net/mlx5e: CT, Don't set flow flag CT for ct clear flow") 3b49a7edec1d ("net/mlx5e: TC, Reject rules with multiple CT actions") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-19mptcp: add mibs counter for ignored incoming optionsPaolo Abeni
The MPTCP in kernel path manager has some constraints on incoming addresses announce processing, so that in edge scenarios it can end-up dropping (ignoring) some of such announces. The above is not very limiting in practice since such scenarios are very uncommon and MPTCP will recover due to ADD_ADDR retransmissions. This patch adds a few MIB counters to account for such drop events to allow easier introspection of the critical scenarios. Fixes: f7efc7771eac ("mptcp: drop argument port from mptcp_pm_announce_addr") Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-19mptcp: fix race in incoming ADD_ADDR option processingPaolo Abeni
If an MPTCP endpoint received multiple consecutive incoming ADD_ADDR options, mptcp_pm_add_addr_received() can overwrite the current remote address value after the PM lock is released in mptcp_pm_nl_add_addr_received() and before such address is echoed. Fix the issue caching the remote address value a little earlier and always using the cached value after releasing the PM lock. Fixes: f7efc7771eac ("mptcp: drop argument port from mptcp_pm_announce_addr") Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-19mptcp: fix race in overlapping signal eventsPaolo Abeni
After commit a88c9e496937 ("mptcp: do not block subflows creation on errors"), if a signal address races with a failing subflow creation, the subflow creation failure control path can trigger the selection of the next address to be announced while the current announced is still pending. The above will cause the unintended suppression of the ADD_ADDR announce. Fix the issue skipping the to-be-suppressed announce before it will mark an endpoint as already used. The relevant announce will be triggered again when the current one will complete. Fixes: a88c9e496937 ("mptcp: do not block subflows creation on errors") Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-16mptcp: don't save tcp data_ready and write space callbacksFlorian Westphal
Assign the helpers directly rather than save/restore in the context structure. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16mptcp: mark ops structures as ro_after_initFlorian Westphal
These structures are initialised from the init hooks, so we can't make them 'const'. But no writes occur afterwards, so we can use ro_after_init. Also, remove bogus EXPORT_SYMBOL, the only access comes from ip stack, not from kernel modules. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16mptcp: constify a bunch of of helpersPaolo Abeni
A few pm-related helpers don't touch arguments which lacking the const modifier, let's constify them. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16mptcp: drop port parameter of mptcp_pm_add_addr_signalGeliang Tang
Drop the port parameter of mptcp_pm_add_addr_signal() and reflect it to avoid passing too many parameters. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16mptcp: drop unneeded type casts for hmacGeliang Tang
Drop the unneeded type casts to 'unsigned long long' for printing out the hmac values in add_addr_hmac_valid() and subflow_thmac_valid(). Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16mptcp: drop unused sk in mptcp_get_optionsGeliang Tang
The parameter 'sk' became useless since the code using it was dropped from mptcp_get_options() in the commit 8d548ea1dd15 ("mptcp: do not set unconditionally csum_reqd on incoming opt"). Let's drop it. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-16mptcp: add SNDTIMEO setsockopt supportGeliang Tang
Add setsockopt support for SO_SNDTIMEO_OLD and SO_SNDTIMEO_NEW to fix this error reported by the mptcp bpf selftest: (network_helpers.c:64: errno: Operation not supported) Failed to set SO_SNDTIMEO test_mptcp:FAIL:115 All error logs: (network_helpers.c:64: errno: Operation not supported) Failed to set SO_SNDTIMEO test_mptcp:FAIL:115 Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-09mptcp: netlink: process IPv6 addrs in creating listening socketsKishen Maloor
This change updates mptcp_pm_nl_create_listen_socket() to create listening sockets bound to IPv6 addresses (where IPv6 is supported). Fixes: 1729cf186d8a ("mptcp: create the listening socket for new port") Acked-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Kishen Maloor <kishen.maloor@intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-04mptcp: allow to use port and non-signal in set_flagsGeliang Tang
It's illegal to use both port and non-signal flags for adding address. But it's legal to use both of them for setting flags, which always uses non-signal flags, backup or fullmesh. This patch moves this non-signal flag with port check from mptcp_pm_parse_addr() to mptcp_nl_cmd_add_addr(). Do the check only when adding addresses, not setting flags or deleting addresses. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-03mptcp: set fullmesh flag in pm_netlinkGeliang Tang
This patch added the fullmesh flag setting support in pm_netlink. If the fullmesh flag of the address is changed, remove all the related subflows, update the fullmesh flag and create subflows again. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-03mptcp: print out reset infos of MP_RSTGeliang Tang
This patch printed out the reset infos, reset_transient and reset_reason, of MP_RST in mptcp_parse_option() to show that MP_RST is received. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-03mptcp: clarify when options can be usedMatthieu Baerts
RFC8684 doesn't seem to clearly specify which MPTCP options can be used together. Some options are mutually exclusive -- e.g. MP_CAPABLE and MP_JOIN --, some can be used together -- e.g. DSS + MP_PRIO --, some can but we prefer not to -- e.g. DSS + ADD_ADDR -- and some have to be used together at some points -- e.g. MP_FAIL and DSS. We need to clarify this as a base before allowing other modifications. For example, does it make sense to send a RM_ADDR with an MPC or MPJ? This remains open for possible future discussions. Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-03mptcp: reduce branching when writing MP_FAIL optionMatthieu Baerts
MP_FAIL should be use in very rare cases, either when the TCP RST flag is set -- with or without an MP_RST -- or with a DSS, see mptcp_established_options(). Here, we do the same in mptcp_write_options(). Co-developed-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-03mptcp: move the declarations of ssk and subflowGeliang Tang
Move the declarations of ssk and subflow in MP_FAIL and MP_PRIO to the beginning of the function mptcp_write_options(). Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-21mptcp: Use struct_group() to avoid cross-field memset()Kees Cook
In preparation for FORTIFY_SOURCE performing compile-time and run-time field bounds checking for memcpy(), memmove(), and memset(), avoid intentionally writing across neighboring fields. Use struct_group() to capture the fields to be reset, so that memset() can be appropriately bounds-checked by the compiler. Cc: Matthieu Baerts <matthieu.baerts@tessares.net> Cc: mptcp@lists.linux.dev Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Link: https://lore.kernel.org/r/20220121073935.1154263-1-keescook@chromium.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-20mptcp: fix removing ids bitmap settingGeliang Tang
In mptcp_pm_nl_rm_addr_or_subflow(), the bit of rm_list->ids[i] in the id_avail_bitmap should be set, not rm_list->ids[1]. This patch fixed it. Fixes: 86e39e04482b ("mptcp: keep track of local endpoint still available for each msk") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-20mptcp: fix msk traversal in mptcp_nl_cmd_set_flags()Paolo Abeni
The MPTCP endpoint list is under RCU protection, guarded by the pernet spinlock. mptcp_nl_cmd_set_flags() traverses the list without acquiring the spin-lock nor under the RCU critical section. This change addresses the issue performing the lookup and the endpoint update under the pernet spinlock. Fixes: 0f9f696a502e ("mptcp: add set_flags command in PM netlink") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Merge in fixes directly in prep for the 5.17 merge window. No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-07mptcp: reuse __mptcp_make_csum in validate_data_csumGeliang Tang
This patch reused __mptcp_make_csum() in validate_data_csum() instead of open-coding. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-07mptcp: change the parameter of __mptcp_make_csumGeliang Tang
This patch changed the type of the last parameter of __mptcp_make_csum() from __sum16 to __wsum. And export this function in protocol.h. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-07mptcp: Check reclaim amount before reducing allocationMat Martineau
syzbot found a page counter underflow that was triggered by MPTCP's reclaim code: page_counter underflow: -4294964789 nr_pages=4294967295 WARNING: CPU: 2 PID: 3785 at mm/page_counter.c:56 page_counter_cancel+0xcf/0xe0 mm/page_counter.c:56 Modules linked in: CPU: 2 PID: 3785 Comm: kworker/2:6 Not tainted 5.16.0-rc1-syzkaller #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014 Workqueue: events mptcp_worker RIP: 0010:page_counter_cancel+0xcf/0xe0 mm/page_counter.c:56 Code: c7 04 24 00 00 00 00 45 31 f6 eb 97 e8 2a 2b b5 ff 4c 89 ea 48 89 ee 48 c7 c7 00 9e b8 89 c6 05 a0 c1 ba 0b 01 e8 95 e4 4b 07 <0f> 0b eb a8 4c 89 e7 e8 25 5a fb ff eb c7 0f 1f 00 41 56 41 55 49 RSP: 0018:ffffc90002d4f918 EFLAGS: 00010082 RAX: 0000000000000000 RBX: ffff88806a494120 RCX: 0000000000000000 RDX: ffff8880688c41c0 RSI: ffffffff815e8f28 RDI: fffff520005a9f15 RBP: ffffffff000009cb R08: 0000000000000000 R09: 0000000000000000 R10: ffffffff815e2cfe R11: 0000000000000000 R12: ffff88806a494120 R13: 00000000ffffffff R14: 0000000000000000 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2de21000 CR3: 000000005ad59000 CR4: 0000000000150ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> page_counter_uncharge+0x2e/0x60 mm/page_counter.c:160 drain_stock+0xc1/0x180 mm/memcontrol.c:2219 refill_stock+0x139/0x2f0 mm/memcontrol.c:2271 __sk_mem_reduce_allocated+0x24d/0x550 net/core/sock.c:2945 __mptcp_rmem_reclaim net/mptcp/protocol.c:167 [inline] __mptcp_mem_reclaim_partial+0x124/0x410 net/mptcp/protocol.c:975 mptcp_mem_reclaim_partial net/mptcp/protocol.c:982 [inline] mptcp_alloc_tx_skb net/mptcp/protocol.c:1212 [inline] mptcp_sendmsg_frag+0x18c6/0x2190 net/mptcp/protocol.c:1279 __mptcp_push_pending+0x232/0x720 net/mptcp/protocol.c:1545 mptcp_release_cb+0xfe/0x200 net/mptcp/protocol.c:2975 release_sock+0xb4/0x1b0 net/core/sock.c:3306 mptcp_worker+0x51e/0xc10 net/mptcp/protocol.c:2443 process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298 worker_thread+0x658/0x11f0 kernel/workqueue.c:2445 kthread+0x405/0x4f0 kernel/kthread.c:327 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> __mptcp_mem_reclaim_partial() could call __mptcp_rmem_reclaim() with a negative value, which passed that negative value to __sk_mem_reduce_allocated() and triggered the splat above. Check for a reclaim amount that is positive and large enough for __mptcp_rmem_reclaim() to actually adjust rmem_fwd_alloc (much like the sk_mem_reclaim_partial() code the function is based on). v2: Use '>' instead of '>=', since SK_MEM_QUANTUM - 1 would get right-shifted into nothing by __mptcp_rmem_reclaim. Fixes: 6511882cdd82 ("mptcp: allocate fwd memory separately on the rx and tx path") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/252 Reported-and-tested-by: syzbot+bc9e2d2dbcb347dd215a@syzkaller.appspotmail.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: fix a DSS option writing errorGeliang Tang
'ptr += 1;' was omitted in the original code. If the DSS is the last option -- which is what we have most of the time -- that's not an issue. But it is if we need to send something else after like a RM_ADDR or an MP_PRIO. Fixes: 1bff1e43a30e ("mptcp: optimize out option generation") Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: fix opt size when sending DSS + MP_FAILMatthieu Baerts
When these two options had to be sent -- which is not common -- the DSS size was not being taken into account in the remaining size. Additionally in this situation, the reported size was only the one of the MP_FAIL which can cause issue if at the end, we need to write more in the TCP options than previously said. Here we use a dedicated variable for MP_FAIL size to keep the WARN_ON_ONCE() just after. Fixes: c25aeb4e0953 ("mptcp: MP_FAIL suboption sending") Acked-and-tested-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: avoid atomic bit manipulation when possiblePaolo Abeni
Currently the msk->flags bitmask carries both state for the mptcp_release_cb() - mostly touched under the mptcp data lock - and others state info touched even outside such lock scope. As a consequence, msk->flags is always manipulated with atomic operations. This change splits such bitmask in two separate fields, so that we use plain bit operations when touching the cb-related info. The MPTCP_PUSH_PENDING bit needs additional care, as it is the only CB related field currently accessed either under the mptcp data lock or the mptcp socket lock. Let's add another mask just for such bit's sake. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: cleanup MPJ subflow list handlingPaolo Abeni
We can simplify the join list handling leveraging the mptcp_release_cb(): if we can acquire the msk socket lock at mptcp_finish_join time, move the new subflow directly into the conn_list, otherwise place it on join_list and let the release_cb process such list. Since pending MPJ connection are now always processed in a timely way, we can avoid flushing the join list every time we have to process all the current subflows. Additionally we can now use the mptcp data lock to protect the join_list, removing the additional spin lock. Finally, the MPJ handshake is now always finalized under the msk socket lock, we can drop the additional synchronization between mptcp_finish_join() and mptcp_close(). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: do not block subflows creation on errorsPaolo Abeni
If the MPTCP configuration allows for multiple subflows creation, and the first additional subflows never reach the fully established status - e.g. due to packets drop or reset - the in kernel path manager do not move to the next subflow. This patch introduces a new PM helper to cope with MPJ subflow creation failure and delay and hook it where appropriate. Such helper triggers additional subflow creation, as needed and updates the PM subflow counter, if the current one is closing. Additionally start all the needed additional subflows as soon as the MPTCP socket is fully established, so we don't have to cope with slow MPJ handshake blocking the next subflow creation. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: keep track of local endpoint still available for each mskPaolo Abeni
Include into the path manager status a bitmap tracking the list of local endpoints still available - not yet used - for the relevant mptcp socket. Keep such map updated at endpoint creation/deletion time, so that we can easily skip already used endpoint at local address selection time. The endpoint used by the initial subflow is lazyly accounted at subflow creation time: the usage bitmap is be up2date before endpoint selection and we avoid such unneeded task in some relevant scenarios - e.g. busy servers accepting incoming subflows but not creating any additional ones nor annuncing additional addresses. Overall this allows for fair local endpoints usage in case of subflow failure. As a side effect, this patch also enforces that each endpoint is used at most once for each mptcp connection. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: clean-up MPJ option writingPaolo Abeni
Check for all MPJ variant at once, this reduces the number of conditionals traversed on average and will simplify the next patch. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: fix per socket endpoint accountingPaolo Abeni
Since full-mesh endpoint support, the reception of a single ADD_ADDR option can cause multiple subflows creation. When such option is accepted we increment 'add_addr_accepted' by one. When we received a paired RM_ADDR option, we deleted all the relevant subflows, decrementing 'add_addr_accepted' by one for each of them. We have a similar issue for 'local_addr_used' Fix them moving the pm endpoint accounting outside the subflow traversal. Fixes: 1a0d6136c5f0 ("mptcp: local addresses fullmesh") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: implement support for user-space disconnectPaolo Abeni
Handle explicitly AF_UNSPEC in mptcp_stream_connnect() to allow user-space to disconnect established MPTCP connections Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: cleanup accept and pollPaolo Abeni
After the previous patch, msk->subflow will never be deleted during the whole msk lifetime. We don't need anymore to acquire references to it in mptcp_stream_accept() and we can use the listener subflow accept queue to simplify mptcp_poll() for listener socket. Overall this removes a lock pair and 4 more atomic operations per accept(). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: full disconnect implementationPaolo Abeni
The current mptcp_disconnect() implementation lacks several steps, we additionally need to reset the msk socket state and flush the subflow list. Factor out the needed helper to avoid code duplication. Additionally ensure that the initial subflow is disposed only after mptcp_close(), just reset it at disconnect time. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: implement fastclose xmit pathPaolo Abeni
Allow the MPTCP xmit path to add MP_FASTCLOSE suboption on RST egress packets. Additionally reorder related options writing to reduce the number of conditionals required in the fast path. Co-developed-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: keep snd_una updated for fallback socketPaolo Abeni
After shutdown, for fallback MPTCP sockets, we always have write_seq == snd_una+1 The above will foul OUTQ ioctl(). Keep snd_una in sync with write_seq even after shutdown. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-17mptcp: clean up harmless false expressionsJean Sacren
entry->addr.id is u8 with a range from 0 to 255 and MAX_ADDR_ID is 255. We should drop both false expressions of (entry->addr.id > MAX_ADDR_ID). We should also remove the obsolete parentheses in the first if branch. Use U8_MAX for MAX_ADDR_ID and add a comment to show the link to mptcp_addr_info.id as suggested by Mr. Matthieu Baerts. Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Jean Sacren <sakiwit@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-17mptcp: enforce HoL-blocking estimationPaolo Abeni
The MPTCP packet scheduler has sub-optimal behavior with asymmetric subflows: if the faster subflow-level cwin is closed, the packet scheduler can enqueue "too much" data on a slower subflow. When all the data on the faster subflow is acked, if the mptcp-level cwin is closed, and link utilization becomes suboptimal. The solution is implementing blest-like[1] HoL-blocking estimation, transmitting only on the subflow with the shorter estimated time to flush the queued memory. If such subflows cwin is closed, we wait even if other subflows are available. This is quite simpler than the original blest implementation, as we leverage the pacing rate provided by the TCP socket. To get a more accurate estimation for the subflow linger-time, we maintain a per-subflow weighted average of such info. Additionally drop magic numbers usage in favor of newly defined macros and use more meaningful names for status variable. [1] http://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/137 Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-14mptcp: fix deadlock in __mptcp_push_pending()Maxim Galaganov
__mptcp_push_pending() may call mptcp_flush_join_list() with subflow socket lock held. If such call hits mptcp_sockopt_sync_all() then subsequently __mptcp_sockopt_sync() could try to lock the subflow socket for itself, causing a deadlock. sysrq: Show Blocked State task:ss-server state:D stack: 0 pid: 938 ppid: 1 flags:0x00000000 Call Trace: <TASK> __schedule+0x2d6/0x10c0 ? __mod_memcg_state+0x4d/0x70 ? csum_partial+0xd/0x20 ? _raw_spin_lock_irqsave+0x26/0x50 schedule+0x4e/0xc0 __lock_sock+0x69/0x90 ? do_wait_intr_irq+0xa0/0xa0 __lock_sock_fast+0x35/0x50 mptcp_sockopt_sync_all+0x38/0xc0 __mptcp_push_pending+0x105/0x200 mptcp_sendmsg+0x466/0x490 sock_sendmsg+0x57/0x60 __sys_sendto+0xf0/0x160 ? do_wait_intr_irq+0xa0/0xa0 ? fpregs_restore_userregs+0x12/0xd0 __x64_sys_sendto+0x20/0x30 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f9ba546c2d0 RSP: 002b:00007ffdc3b762d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007f9ba56c8060 RCX: 00007f9ba546c2d0 RDX: 000000000000077a RSI: 0000000000e5e180 RDI: 0000000000000234 RBP: 0000000000cc57f0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f9ba56c8060 R13: 0000000000b6ba60 R14: 0000000000cc7840 R15: 41d8685b1d7901b8 </TASK> Fix the issue by using __mptcp_flush_join_list() instead of plain mptcp_flush_join_list() inside __mptcp_push_pending(), as suggested by Florian. The sockopt sync will be deferred to the workqueue. Fixes: 1b3e7ede1365 ("mptcp: setsockopt: handle SO_KEEPALIVE and SO_PRIORITY") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/244 Suggested-by: Florian Westphal <fw@strlen.de> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Maxim Galaganov <max@internet.ru> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-14mptcp: clear 'kern' flag from fallback socketsFlorian Westphal
The mptcp ULP extension relies on sk->sk_sock_kern being set correctly: It prevents setsockopt(fd, IPPROTO_TCP, TCP_ULP, "mptcp", 6); from working for plain tcp sockets (any userspace-exposed socket). But in case of fallback, accept() can return a plain tcp sk. In such case, sk is still tagged as 'kernel' and setsockopt will work. This will crash the kernel, The subflow extension has a NULL ctx->conn mptcp socket: BUG: KASAN: null-ptr-deref in subflow_data_ready+0x181/0x2b0 Call Trace: tcp_data_ready+0xf8/0x370 [..] Fixes: cf7da0d66cc1 ("mptcp: Create SUBFLOW socket for incoming connections") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>