Age | Commit message (Collapse) | Author |
|
SPSS should've been SSPS.
Fixes: c9c14be664cf ("usb: typec: tipd: Switch CD321X power state to S0")
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Hector Martin <marcan@marcan.st>
Link: https://lore.kernel.org/r/20211120030717.84287-2-marcan@marcan.st
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Fix the circular lock dependency and unbalanced unlock of addess0_mutex
introduced when fixing an address0_mutex enumeration retry race in commit
ae6dc22d2d1 ("usb: hub: Fix usb enumeration issue due to address0 race")
Make sure locking order between port_dev->status_lock and address0_mutex
is correct, and that address0_mutex is not unlocked in hub_port_connect
"done:" codepath which may be reached without locking address0_mutex
Fixes: 6ae6dc22d2d1 ("usb: hub: Fix usb enumeration issue due to address0 race")
Cc: <stable@vger.kernel.org>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Hans de Goede <hdegoede@redhat.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Link: https://lore.kernel.org/r/20211123101656.1113518-1-mathias.nyman@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
After commit 4adcf2e5829f ("cpufreq: intel_pstate: Add ->offline and
->online callbacks") the EPP value set by the "performance" scaling
algorithm in the active mode is not restored after an offline/online
cycle which replaces it with the saved EPP value coming from user
space.
Address this issue by forcing intel_pstate_hwp_set() to set a new
EPP value when it runs first time after online.
Fixes: 4adcf2e5829f ("cpufreq: intel_pstate: Add ->offline and ->online callbacks")
Link: https://lore.kernel.org/linux-pm/adc7132c8655bd4d1c8b6129578e931a14fe1db2.camel@linux.intel.com/
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Commit fbdc21e9b038 ("cpufreq: intel_pstate: Add Icelake servers
support in no-HWP mode") enabled the use of Intel P-State driver
for Ice Lake servers.
But it doesn't cover the case when OS can't control P-States.
Therefore, for Ice Lake server, if MSR_MISC_PWR_MGMT bits 8 or 18
are enabled, then the Intel P-State driver should exit as OS can't
control P-States.
Fixes: fbdc21e9b038 ("cpufreq: intel_pstate: Add Icelake servers support in no-HWP mode")
Signed-off-by: Adamos Ttofari <attofari@amazon.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Currently mvpp2_xdp_setup won't allow attaching XDP program if
mtu > ETH_DATA_LEN (1500).
The mvpp2_change_mtu on the other hand checks whether
MVPP2_RX_PKT_SIZE(mtu) > MVPP2_BM_LONG_PKT_SIZE.
These two checks are semantically different.
Moreover this limit can be increased to MVPP2_MAX_RX_BUF_SIZE, since in
mvpp2_rx we have
xdp.data = data + MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM;
xdp.frame_sz = PAGE_SIZE;
Change the checks to check whether
mtu > MVPP2_MAX_RX_BUF_SIZE
Fixes: 07dd0a7aae7f ("mvpp2: add basic XDP support")
Signed-off-by: Marek Behún <kabel@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Calling ipa_cmd_pipeline_clear() after stopping the channel
underlying the AP<-modem RX endpoint can lead to a deadlock.
This occurs in the ->runtime_suspend device power operation for the
IPA driver. While this callback is in progress, any other requests
for power will block until the callback returns.
Stopping the AP<-modem RX channel does not prevent the modem from
sending another packet to this endpoint. If a packet arrives for an
RX channel when the channel is stopped, an SUSPEND IPA interrupt
condition will be pending. Handling an IPA interrupt requires
power, so ipa_isr_thread() calls pm_runtime_get_sync() first thing.
The problem occurs because a "pipeline clear" command will not
complete while such a SUSPEND interrupt condition exists. So the
SUSPEND IPA interrupt handler won't proceed until it gets power;
that won't happen until the ->runtime_suspend callback (and its
"pipeline clear" command) completes; and that can't happen while
the SUSPEND interrupt condition exists.
It turns out that in this case there is no need to use the "pipeline
clear" command. There are scenarios in which clearing the pipeline
is required while suspending, but those are not (yet) supported
upstream. So a simple fix, avoiding the potential deadlock, is to
stop calling ipa_cmd_pipeline_clear() in ipa_endpoint_suspend().
This removes the only user of ipa_cmd_pipeline_clear(), so get rid
of that function. It can be restored again whenever it's needed.
This is basically a manual revert along with an explanation for
commit 6cb63ea6a39ea ("net: ipa: introduce ipa_cmd_tag_process()").
Fixes: 6cb63ea6a39ea ("net: ipa: introduce ipa_cmd_tag_process()")
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The smsc95xx driver is dropping phy speed settings and causing a stack
trace at device unbind:
[ 536.379147] smsc95xx 2-1:1.0 eth1: unregister 'smsc95xx' usb-ci_hdrc.2-1, smsc95xx USB 2.0 Ethernet
[ 536.425029] ------------[ cut here ]------------
[ 536.429650] WARNING: CPU: 0 PID: 439 at fs/kernfs/dir.c:1535 kernfs_remove_by_name_ns+0xb8/0xc0
[ 536.438416] kernfs: can not remove 'attached_dev', no directory
[ 536.444363] Modules linked in: xts dm_crypt dm_mod atmel_mxt_ts smsc95xx usbnet
[ 536.451748] CPU: 0 PID: 439 Comm: sh Tainted: G W 5.15.0 #1
[ 536.458636] Hardware name: Freescale i.MX53 (Device Tree Support)
[ 536.464735] Backtrace:
[ 536.467190] [<80b1c904>] (dump_backtrace) from [<80b1cb48>] (show_stack+0x20/0x24)
[ 536.474787] r7:000005ff r6:8035b294 r5:600f0013 r4:80d8af78
[ 536.480449] [<80b1cb28>] (show_stack) from [<80b1f764>] (dump_stack_lvl+0x48/0x54)
[ 536.488035] [<80b1f71c>] (dump_stack_lvl) from [<80b1f788>] (dump_stack+0x18/0x1c)
[ 536.495620] r5:00000009 r4:80d9b820
[ 536.499198] [<80b1f770>] (dump_stack) from [<80124fac>] (__warn+0xfc/0x114)
[ 536.506187] [<80124eb0>] (__warn) from [<80b1d21c>] (warn_slowpath_fmt+0xa8/0xdc)
[ 536.513688] r7:000005ff r6:80d9b820 r5:80d9b8e0 r4:83744000
[ 536.519349] [<80b1d178>] (warn_slowpath_fmt) from [<8035b294>] (kernfs_remove_by_name_ns+0xb8/0xc0)
[ 536.528416] r9:00000001 r8:00000000 r7:824926dc r6:00000000 r5:80df6c2c r4:00000000
[ 536.536162] [<8035b1dc>] (kernfs_remove_by_name_ns) from [<80b1f56c>] (sysfs_remove_link+0x4c/0x50)
[ 536.545225] r6:7f00f02c r5:80df6c2c r4:83306400
[ 536.549845] [<80b1f520>] (sysfs_remove_link) from [<806f9c8c>] (phy_detach+0xfc/0x11c)
[ 536.557780] r5:82492000 r4:83306400
[ 536.561359] [<806f9b90>] (phy_detach) from [<806f9cf8>] (phy_disconnect+0x4c/0x58)
[ 536.568943] r7:824926dc r6:7f00f02c r5:82492580 r4:83306400
[ 536.574604] [<806f9cac>] (phy_disconnect) from [<7f00a310>] (smsc95xx_disconnect_phy+0x30/0x38 [smsc95xx])
[ 536.584290] r5:82492580 r4:82492580
[ 536.587868] [<7f00a2e0>] (smsc95xx_disconnect_phy [smsc95xx]) from [<7f001570>] (usbnet_stop+0x70/0x1a0 [usbnet])
[ 536.598161] r5:82492580 r4:82492000
[ 536.601740] [<7f001500>] (usbnet_stop [usbnet]) from [<808baa70>] (__dev_close_many+0xb4/0x12c)
[ 536.610466] r8:83744000 r7:00000000 r6:83744000 r5:83745b74 r4:82492000
[ 536.617170] [<808ba9bc>] (__dev_close_many) from [<808bab78>] (dev_close_many+0x90/0x120)
[ 536.625365] r7:00000001 r6:83745b74 r5:83745b8c r4:82492000
[ 536.631026] [<808baae8>] (dev_close_many) from [<808bf408>] (unregister_netdevice_many+0x15c/0x704)
[ 536.640094] r9:00000001 r8:81130b98 r7:83745b74 r6:83745bc4 r5:83745b8c r4:82492000
[ 536.647840] [<808bf2ac>] (unregister_netdevice_many) from [<808bfa50>] (unregister_netdevice_queue+0xa0/0xe8)
[ 536.657775] r10:8112bcc0 r9:83306c00 r8:83306c80 r7:8291e420 r6:83744000 r5:00000000
[ 536.665608] r4:82492000
[ 536.668143] [<808bf9b0>] (unregister_netdevice_queue) from [<808bfac0>] (unregister_netdev+0x28/0x30)
[ 536.677381] r6:7f01003c r5:82492000 r4:82492000
[ 536.682000] [<808bfa98>] (unregister_netdev) from [<7f000b40>] (usbnet_disconnect+0x64/0xdc [usbnet])
[ 536.691241] r5:82492000 r4:82492580
[ 536.694819] [<7f000adc>] (usbnet_disconnect [usbnet]) from [<8076b958>] (usb_unbind_interface+0x80/0x248)
[ 536.704406] r5:7f01003c r4:83306c80
[ 536.707984] [<8076b8d8>] (usb_unbind_interface) from [<8061765c>] (device_release_driver_internal+0x1c4/0x1cc)
[ 536.718005] r10:8112bcc0 r9:80dff1dc r8:83306c80 r7:83744000 r6:7f01003c r5:00000000
[ 536.725838] r4:8291e420
[ 536.728373] [<80617498>] (device_release_driver_internal) from [<80617684>] (device_release_driver+0x20/0x24)
[ 536.738302] r7:83744000 r6:810d4f4c r5:8291e420 r4:8176ae30
[ 536.743963] [<80617664>] (device_release_driver) from [<806156cc>] (bus_remove_device+0xf0/0x148)
[ 536.752858] [<806155dc>] (bus_remove_device) from [<80610018>] (device_del+0x198/0x41c)
[ 536.760880] r7:83744000 r6:8116e2e4 r5:8291e464 r4:8291e420
[ 536.766542] [<8060fe80>] (device_del) from [<80768fe8>] (usb_disable_device+0xcc/0x1e0)
[ 536.774576] r10:8112bcc0 r9:80dff1dc r8:00000001 r7:8112bc48 r6:8291e400 r5:00000001
[ 536.782410] r4:83306c00
[ 536.784945] [<80768f1c>] (usb_disable_device) from [<80769c30>] (usb_set_configuration+0x514/0x8dc)
[ 536.794011] r10:00000000 r9:00000000 r8:832c3600 r7:00000004 r6:810d5688 r5:00000000
[ 536.801844] r4:83306c00
[ 536.804379] [<8076971c>] (usb_set_configuration) from [<80775fac>] (usb_generic_driver_disconnect+0x34/0x38)
[ 536.814236] r10:832c3610 r9:83745ef8 r8:832c3600 r7:00000004 r6:810d5688 r5:83306c00
[ 536.822069] r4:83306c00
[ 536.824605] [<80775f78>] (usb_generic_driver_disconnect) from [<8076b850>] (usb_unbind_device+0x30/0x70)
[ 536.834100] r5:83306c00 r4:810d5688
[ 536.837678] [<8076b820>] (usb_unbind_device) from [<8061765c>] (device_release_driver_internal+0x1c4/0x1cc)
[ 536.847432] r5:822fb480 r4:83306c80
[ 536.851009] [<80617498>] (device_release_driver_internal) from [<806176a8>] (device_driver_detach+0x20/0x24)
[ 536.860853] r7:00000004 r6:810d4f4c r5:810d5688 r4:83306c80
[ 536.866515] [<80617688>] (device_driver_detach) from [<80614d98>] (unbind_store+0x70/0xe4)
[ 536.874793] [<80614d28>] (unbind_store) from [<80614118>] (drv_attr_store+0x30/0x3c)
[ 536.882554] r7:00000000 r6:00000000 r5:83739200 r4:80614d28
[ 536.888217] [<806140e8>] (drv_attr_store) from [<8035cb68>] (sysfs_kf_write+0x48/0x54)
[ 536.896154] r5:83739200 r4:806140e8
[ 536.899732] [<8035cb20>] (sysfs_kf_write) from [<8035be84>] (kernfs_fop_write_iter+0x11c/0x1d4)
[ 536.908446] r5:83739200 r4:00000004
[ 536.912024] [<8035bd68>] (kernfs_fop_write_iter) from [<802b87fc>] (vfs_write+0x258/0x3e4)
[ 536.920317] r10:00000000 r9:83745f58 r8:83744000 r7:00000000 r6:00000004 r5:00000000
[ 536.928151] r4:82adacc0
[ 536.930687] [<802b85a4>] (vfs_write) from [<802b8b0c>] (ksys_write+0x74/0xf4)
[ 536.937842] r10:00000004 r9:007767a0 r8:83744000 r7:00000000 r6:00000000 r5:82adacc0
[ 536.945676] r4:82adacc0
[ 536.948213] [<802b8a98>] (ksys_write) from [<802b8ba4>] (sys_write+0x18/0x1c)
[ 536.955367] r10:00000004 r9:83744000 r8:80100244 r7:00000004 r6:76f47b58 r5:76fc0350
[ 536.963200] r4:00000004
[ 536.965735] [<802b8b8c>] (sys_write) from [<80100060>] (ret_fast_syscall+0x0/0x48)
[ 536.973320] Exception stack(0x83745fa8 to 0x83745ff0)
[ 536.978383] 5fa0: 00000004 76fc0350 00000001 007767a0 00000004 00000000
[ 536.986569] 5fc0: 00000004 76fc0350 76f47b58 00000004 76f47c7c 76f48114 00000000 7e87991c
[ 536.994753] 5fe0: 00000498 7e879908 76e6dce8 76eca2e8
[ 536.999922] ---[ end trace 9b835d809816b435 ]---
The driver should not be connecting and disconnecting the PHY when the
device is opened and closed, it should be stopping and starting the PHY. The
phy should be connected as part of binding and disconnected during
unbinding.
As this results in the PHY not being reset during open, link speed, etc.
settings set prior to the link coming up are now not being lost.
It is necessary for phy_stop() to only be called when the phydev still
exists (resolving the above stack trace). When unbinding, ".unbind" will be
called prior to ".stop", with phy_disconnect() already having called
phy_stop() before the phydev becomes inaccessible.
Signed-off-by: Martyn Welch <martyn.welch@collabora.com>
Cc: Steve Glendinning <steve.glendinning@shawell.net>
Cc: UNGLinuxDriver@microchip.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: stable@kernel.org # v5.15
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
During the process of driver probing, probe function should return < 0
for failure, otherwise kernel will treat value == 0 as success.
Therefore, we should set err to -EINVAL when
adapter->registered_device_map is NULL. Otherwise kernel will assume
that driver has been successfully probed and will cause unexpected
errors.
Signed-off-by: Zheyu Ma <zheyuma97@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The original changes brakes MAC address assignment on older chip
versions (see bug report [0]), and it brakes random MAC assignment.
is_valid_ether_addr() requires that its argument is word-aligned.
Add the missing alignment to array mac_addr.
[0] https://bugzilla.kernel.org/show_bug.cgi?id=215087
Fixes: 1c5d09d58748 ("ethernet: r8169: use eth_hw_addr_set()")
Reported-by: Richard Herbert <rherbert@sympatico.ca>
Tested-by: Richard Herbert <rherbert@sympatico.ca>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2021-11-22
Maciej Fijalkowski says:
Here are the two fixes for issues around ethtool's set_channels()
callback for ice driver. Both are related to XDP resources. First one
corrects the size of vsi->txq_map that is used to track the usage of Tx
resources and the second one prevents the wrong refcounting of bpf_prog.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Alex Elder says:
====================
net: ipa: prevent shutdown during setup
The setup phase of the IPA driver occurs in one of two ways.
Normally, it is done directly by the main driver probe function.
But some systems (those having a "modem-init" DTS property) don't
start setup until an SMP2P interrupt (sent by the modem) arrives.
Because it isn't performed by the probe function, setup on
"modem-init" systems could be underway at the time a driver
remove (or shutdown) request arrives (or vice-versa). This
situation can lead to hardware state not being cleaned up
properly.
This series addresses this problem by having the driver remove
function disable the setup interrupt. A consequence of this is
that setup will complete if it is underway when the remove function
is called.
So now, when removing the driver, setup:
- will have already completed;
- is underway, and will complete before proceeding; or
- will not have begun (and will not occur).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The IPA setup_complete flag is set at the end of ipa_setup(), when
the setup phase of initialization has completed successfully. This
occurs as part of driver probe processing, or (if "modem-init" is
specified in the DTS file) it is triggered by the "ipa-setup-ready"
SMP2P interrupt generated by the modem.
In the latter case, it's possible for driver shutdown (or remove) to
begin while setup processing is underway, and this can't be allowed.
The problem is that the setup_complete flag is not adequate to signal
that setup is underway.
If setup_complete is set, it will never be un-set, so that case is
not a problem. But if setup_complete is false, there's a chance
setup is underway.
Because setup is triggered by an interrupt on a "modem-init" system,
there is a simple way to ensure the value of setup_complete is safe
to read. The threaded handler--if it is executing--will complete as
part of a request to disable the "ipa-modem-ready" interrupt. This
means that ipa_setup() (which is called from the handler) will run
to completion if it was underway, or will never be called otherwise.
The request to disable the "ipa-setup-ready" interrupt is currently
made within ipa_modem_stop(). Instead, disable the interrupt
outside that function in the two places it's called. In the case of
ipa_remove(), this ensures the setup_complete flag is safe to read
before we read it.
Rename ipa_smp2p_disable() to be ipa_smp2p_irq_disable_setup(), to be
more specific about its effect.
Fixes: 530f9216a953 ("soc: qcom: ipa: AP/modem communications")
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We currently maintain a "disabled" Boolean flag to determine whether
the "ipa-setup-ready" SMP2P IRQ handler does anything. That flag
must be accessed under protection of a mutex.
Instead, disable the SMP2P interrupt when requested, which prevents
the interrupt handler from ever being called. More importantly, it
synchronizes a thread disabling the interrupt with the completion of
the interrupt handler in case they run concurrently.
Use the IPA setup_complete flag rather than the disabled flag in the
handler to determine whether to ignore any interrupts arriving after
the first.
Rename the "disabled" flag to be "setup_disabled", to be specific
about its purpose.
Fixes: 530f9216a953 ("soc: qcom: ipa: AP/modem communications")
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Sr is a repeated start, it is used in both I2C and SMBus protocols.
Provide its description and replace start ("S") conditions with repeated
start ("Sr") conditions when relevant. This allows the documentation to
match the SMBus specification available at [1].
[1] http://www.smbus.org/specs/SMBus_3_1_20180319.pdf
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
|
|
Ido Schimmel says:
====================
mlxsw: Two small fixes
Patch #1 fixes a recent regression that prevents the driver from loading
with old firmware versions.
Patch #2 protects the driver from a NULL pointer dereference when
working on top of a buggy firmware. This was never observed in an actual
system, only on top of an emulator during development.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When processing port up/down events generated by the device's firmware,
the driver protects itself from events reported for non-existent local
ports, but not the CPU port (local port 0), which exists, but lacks a
netdev.
This can result in a NULL pointer dereference when calling
netif_carrier_{on,off}().
Fix this by bailing early when processing an event reported for the CPU
port. Problem was only observed when running on top of a buggy emulator.
Fixes: 28b1987ef506 ("mlxsw: spectrum: Register CPU port with devlink")
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The driver fails to load with old firmware versions that cannot report
the maximum number of RIF MAC profiles [1].
Fix this by defaulting to a maximum of a single profile in such
situations, as multiple profiles are not supported by old firmware
versions.
[1]
mlxsw_spectrum 0000:03:00.0: cannot register bus device
mlxsw_spectrum: probe of 0000:03:00.0 failed with error -5
Fixes: 1c375ffb2efab ("mlxsw: spectrum_router: Expose RIF MAC profiles to devlink resource")
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reported-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Tony Lu says:
====================
smc: Fixes for closing process and minor cleanup
Patch 1 is a minor cleanup for local struct sock variables.
Patch 2 ensures the active closing side enters TIME_WAIT.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The side that actively closed socket, it's clcsock doesn't enter
TIME_WAIT state, but the passive side does it. It should show the same
behavior as TCP sockets.
Consider this, when client actively closes the socket, the clcsock in
server enters TIME_WAIT state, which means the address is occupied and
won't be reused before TIME_WAIT dismissing. If we restarted server, the
service would be unavailable for a long time.
To solve this issue, shutdown the clcsock in [A], perform the TCP active
close progress first, before the passive closed side closing it. So that
the actively closed side enters TIME_WAIT, not the passive one.
Client | Server
close() // client actively close |
smc_release() |
smc_close_active() // PEERCLOSEWAIT1 |
smc_close_final() // abort or closed = 1|
smc_cdc_get_slot_and_msg_send() |
[A] |
|smc_cdc_msg_recv_action() // ACTIVE
| queue_work(smc_close_wq, &conn->close_work)
| smc_close_passive_work() // PROCESSABORT or APPCLOSEWAIT1
| smc_close_passive_abort_received() // only in abort
|
|close() // server recv zero, close
| smc_release() // PROCESSABORT or APPCLOSEWAIT1
| smc_close_active()
| smc_close_abort() or smc_close_final() // CLOSED
| smc_cdc_get_slot_and_msg_send() // abort or closed = 1
smc_cdc_msg_recv_action() | smc_clcsock_release()
queue_work(smc_close_wq, &conn->close_work) | sock_release(tcp) // actively close clc, enter TIME_WAIT
smc_close_passive_work() // PEERCLOSEWAIT1 | smc_conn_free()
smc_close_passive_abort_received() // CLOSED|
smc_conn_free() |
smc_clcsock_release() |
sock_release(tcp) // passive close clc |
Link: https://www.spinics.net/lists/netdev/msg780407.html
Fixes: b38d732477e4 ("smc: socket closing and linkgroup cleanup")
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There remains some variables to replace with local struct sock. So clean
them up all.
Fixes: 3163c5071f25 ("net/smc: use local struct sock variables consistently")
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When we try to add an IPv6 nexthop and IPv6 is not enabled
(!CONFIG_IPV6) we'll hit a NULL pointer dereference[1] in the error path
of nh_create_ipv6() due to calling ipv6_stub->fib6_nh_release. The bug
has been present since the beginning of IPv6 nexthop gateway support.
Commit 1aefd3de7bc6 ("ipv6: Add fib6_nh_init and release to stubs") tells
us that only fib6_nh_init has a dummy stub because fib6_nh_release should
not be called if fib6_nh_init returns an error, but the commit below added
a call to ipv6_stub->fib6_nh_release in its error path. To fix it return
the dummy stub's -EAFNOSUPPORT error directly without calling
ipv6_stub->fib6_nh_release in nh_create_ipv6()'s error path.
[1]
Output is a bit truncated, but it clearly shows the error.
BUG: kernel NULL pointer dereference, address: 000000000000000000
#PF: supervisor instruction fetch in kernel modede
#PF: error_code(0x0010) - not-present pagege
PGD 0 P4D 0
Oops: 0010 [#1] PREEMPT SMP NOPTI
CPU: 4 PID: 638 Comm: ip Kdump: loaded Not tainted 5.16.0-rc1+ #446
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014
RIP: 0010:0x0
Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
RSP: 0018:ffff888109f5b8f0 EFLAGS: 00010286^Ac
RAX: 0000000000000000 RBX: ffff888109f5ba28 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8881008a2860
RBP: ffff888109f5b9d8 R08: 0000000000000000 R09: 0000000000000000
R10: ffff888109f5b978 R11: ffff888109f5b948 R12: 00000000ffffff9f
R13: ffff8881008a2a80 R14: ffff8881008a2860 R15: ffff8881008a2840
FS: 00007f98de70f100(0000) GS:ffff88822bf00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 0000000100efc000 CR4: 00000000000006e0
Call Trace:
<TASK>
nh_create_ipv6+0xed/0x10c
rtm_new_nexthop+0x6d7/0x13f3
? check_preemption_disabled+0x3d/0xf2
? lock_is_held_type+0xbe/0xfd
rtnetlink_rcv_msg+0x23f/0x26a
? check_preemption_disabled+0x3d/0xf2
? rtnl_calcit.isra.0+0x147/0x147
netlink_rcv_skb+0x61/0xb2
netlink_unicast+0x100/0x187
netlink_sendmsg+0x37f/0x3a0
? netlink_unicast+0x187/0x187
sock_sendmsg_nosec+0x67/0x9b
____sys_sendmsg+0x19d/0x1f9
? copy_msghdr_from_user+0x4c/0x5e
? rcu_read_lock_any_held+0x2a/0x78
___sys_sendmsg+0x6c/0x8c
? asm_sysvec_apic_timer_interrupt+0x12/0x20
? lockdep_hardirqs_on+0xd9/0x102
? sockfd_lookup_light+0x69/0x99
__sys_sendmsg+0x50/0x6e
do_syscall_64+0xcb/0xf2
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f98dea28914
Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 48 8d 05 e9 5d 0c 00 8b 00 85 c0 75 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 41 89 d4 55 48 89 f5 53
RSP: 002b:00007fff859f5e68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e2e
RAX: ffffffffffffffda RBX: 00000000619cb810 RCX: 00007f98dea28914
RDX: 0000000000000000 RSI: 00007fff859f5ed0 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000008
R10: fffffffffffffce6 R11: 0000000000000246 R12: 0000000000000001
R13: 000055c0097ae520 R14: 000055c0097957fd R15: 00007fff859f63a0
</TASK>
Modules linked in: bridge stp llc bonding virtio_net
Cc: stable@vger.kernel.org
Fixes: 53010f991a9f ("nexthop: Add support for IPv6 gateways")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
MIPS/IA64 define END as assembly function ending, which conflict
with END definition in slip.h, just undef it at first
Reported-by: lkp@intel.com
Signed-off-by: Huang Pei <huangpei@loongson.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
MIPS/IA64 define END as assembly function ending, which conflict
with END definition in mkiss.c, just undef it at first
Reported-by: lkp@intel.com
Signed-off-by: Huang Pei <huangpei@loongson.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 5fa6863ba692 ("spi: Check we have a spi_device_id for each DT
compatible") added a test to check that every SPI driver has a
spi_device_id for each DT compatiable string defined by the driver
and warns if the spi_device_id is missing. The spi_device_id is
missing for the MMC SPI driver and the following warning is now seen.
WARNING KERN SPI driver mmc_spi has no spi_device_id for mmc-spi-slot
Fix this by adding the necessary spi_device_id.
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Link: https://lore.kernel.org/r/20211115113813.238044-1-jonathanh@nvidia.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|
|
If a timeout is hit, it can result is incorrect data on the I2C bus
and/or memory corruptions in the guest since the device can still be
operating on the buffers it was given while the guest has freed them.
Here is, for example, the start of a slub_debug splat which was
triggered on the next transfer after one transfer was forced to timeout
by setting a breakpoint in the backend (rust-vmm/vhost-device):
BUG kmalloc-1k (Not tainted): Poison overwritten
First byte 0x1 instead of 0x6b
Allocated in virtio_i2c_xfer+0x65/0x35c age=350 cpu=0 pid=29
__kmalloc+0xc2/0x1c9
virtio_i2c_xfer+0x65/0x35c
__i2c_transfer+0x429/0x57d
i2c_transfer+0x115/0x134
i2cdev_ioctl_rdwr+0x16a/0x1de
i2cdev_ioctl+0x247/0x2ed
vfs_ioctl+0x21/0x30
sys_ioctl+0xb18/0xb41
Freed in virtio_i2c_xfer+0x32e/0x35c age=244 cpu=0 pid=29
kfree+0x1bd/0x1cc
virtio_i2c_xfer+0x32e/0x35c
__i2c_transfer+0x429/0x57d
i2c_transfer+0x115/0x134
i2cdev_ioctl_rdwr+0x16a/0x1de
i2cdev_ioctl+0x247/0x2ed
vfs_ioctl+0x21/0x30
sys_ioctl+0xb18/0xb41
There is no simple fix for this (the driver would have to always create
bounce buffers and hold on to them until the device eventually returns
the buffers), so just disable the timeout support for now.
Fixes: 3cfc88380413d20f ("i2c: virtio: add a virtio i2c frontend driver")
Acked-by: Jie Deng <jie.deng@intel.com>
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
|
|
At least some PL2303GC have a bcdDevice of 0x105 instead of 0x100 as the
datasheet claims. Add it to the list of known release numbers for the
HXN (G) type.
Note the chip type could only be determined indirectly based on its
package being of QFP type, which appears to only be available for
PL2303GC.
Fixes: 894758d0571d ("USB: serial: pl2303: tighten type HXN (G) detection")
Cc: stable@vger.kernel.org # 5.13
Reported-by: Anton Lundin <glance@acc.umu.se>
Link: https://lore.kernel.org/r/20211123071613.GZ108031@montezuma.acc.umu.se
Link: https://lore.kernel.org/r/20211123091017.30708-1-johan@kernel.org
Signed-off-by: Johan Hovold <johan@kernel.org>
|
|
Currently interrupt storm will occur from i2c-i801 after first
transaction if SMB_ALERT signal is enabled and ever asserted. It is
enough if the signal is asserted once even before the driver is loaded
and does not recover because that interrupt is not acknowledged.
This fix aims to fix it by two ways:
- Add acknowledging for the SMB_ALERT interrupt status
- Disable the SMB_ALERT interrupt on platforms where possible since the
driver currently does not make use for it
Acknowledging resets the SMB_ALERT interrupt status on all platforms and
also should help to avoid interrupt storm on older platforms where the
SMB_ALERT interrupt disabling is not available.
For simplicity this fix reuses the host notify feature for disabling and
restoring original register value.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=177311
Reported-by: ck+kernelbugzilla@bl4ckb0x.de
Reported-by: stephane.poignant@protonmail.com
Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Reviewed-by: Jean Delvare <jdelvare@suse.de>
Tested-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
|
|
If driver interrupts are enabled, SMBHSTCNT_INTREN will be 1 after
the first transaction, and will stay to that value forever. This
means that interrupts will be generated for both host-initiated
transactions and also SMBus Alert events even after the driver is
unloaded. To be on the safe side, we should restore the initial state
of this bit at suspend and reboot time, as we do for several other
configuration bits already and for the same reason: the BIOS should
be handed the device in the same configuration state in which we
received it. Otherwise interrupts may be generated which nobody
will process.
Signed-off-by: Jean Delvare <jdelvare@suse.de>
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
|
|
The i.MX 8QM DTS files use two compatibles, so update the binding to fix
dtbs_check warnings like:
arch/arm64/boot/dts/freescale/imx8qm-mek.dt.yaml: i2c@5a800000:
compatible: ['fsl,imx8qm-lpi2c', 'fsl,imx7ulp-lpi2c'] is too long
Signed-off-by: Abel Vesa <abel.vesa@nxp.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Wolfram Sang <wsa@kernel.org>
|
|
syzbot reported that the warning in perf_sigtrap() fires, saying that
the event's task does not match current:
| WARNING: CPU: 0 PID: 9090 at kernel/events/core.c:6446 perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
| Modules linked in:
| CPU: 0 PID: 9090 Comm: syz-executor.1 Not tainted 5.15.0-syzkaller #0
| Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
| RIP: 0010:perf_sigtrap kernel/events/core.c:6446 [inline]
| RIP: 0010:perf_pending_event_disable kernel/events/core.c:6470 [inline]
| RIP: 0010:perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
| ...
| Call Trace:
| <IRQ>
| irq_work_single+0x106/0x220 kernel/irq_work.c:211
| irq_work_run_list+0x6a/0x90 kernel/irq_work.c:242
| irq_work_run+0x4f/0xd0 kernel/irq_work.c:251
| __sysvec_irq_work+0x95/0x3d0 arch/x86/kernel/irq_work.c:22
| sysvec_irq_work+0x8e/0xc0 arch/x86/kernel/irq_work.c:17
| </IRQ>
| <TASK>
| asm_sysvec_irq_work+0x12/0x20 arch/x86/include/asm/idtentry.h:664
| RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
| RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70 kernel/locking/spinlock.c:194
| ...
| coredump_task_exit kernel/exit.c:371 [inline]
| do_exit+0x1865/0x25c0 kernel/exit.c:771
| do_group_exit+0xe7/0x290 kernel/exit.c:929
| get_signal+0x3b0/0x1ce0 kernel/signal.c:2820
| arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
| handle_signal_work kernel/entry/common.c:148 [inline]
| exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
| exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
| __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
| syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
| do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
| entry_SYSCALL_64_after_hwframe+0x44/0xae
On x86 this shouldn't happen, which has arch_irq_work_raise().
The test program sets up a perf event with sigtrap set to fire on the
'sched_wakeup' tracepoint, which fired in ttwu_do_wakeup().
This happened because the 'sched_wakeup' tracepoint also takes a task
argument passed on to perf_tp_event(), which is used to deliver the
event to that other task.
Since we cannot deliver synchronous signals to other tasks, skip an event if
perf_tp_event() is targeted at another task and perf_event_attr::sigtrap is
set, which will avoid ever entering perf_sigtrap() for such events.
Fixes: 97ba62b27867 ("perf: Add support for SIGTRAP on perf events")
Reported-by: syzbot+663359e32ce6f1a305ad@syzkaller.appspotmail.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YYpoCOBmC/kJWfmI@elver.google.com
|
|
We found that a process with 10 thousnads threads has been encountered
a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of
workload which will concurrently allocate lots of memory in different
threads sometimes. In this case, we will see the down_read_trylock()
with a high hotspot. Therefore, we suppose that rwsem has a regression
at least since Linux-v5.4. In order to easily debug this problem, we
write a simply benchmark to create the similar situation lile the
following.
```c++
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sched.h>
#include <cstdio>
#include <cassert>
#include <thread>
#include <vector>
#include <chrono>
volatile int mutex;
void trigger(int cpu, char* ptr, std::size_t sz)
{
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(cpu, &set);
assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0);
while (mutex);
for (std::size_t i = 0; i < sz; i += 4096) {
*ptr = '\0';
ptr += 4096;
}
}
int main(int argc, char* argv[])
{
std::size_t sz = 100;
if (argc > 1)
sz = atoi(argv[1]);
auto nproc = std::thread::hardware_concurrency();
std::vector<std::thread> thr;
sz <<= 30;
auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON |
MAP_PRIVATE, -1, 0);
assert(ptr != MAP_FAILED);
char* cptr = static_cast<char*>(ptr);
auto run = sz / nproc;
run = (run >> 12) << 12;
mutex = 1;
for (auto i = 0U; i < nproc; ++i) {
thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); }));
cptr += run;
}
rusage usage_start;
getrusage(RUSAGE_SELF, &usage_start);
auto start = std::chrono::system_clock::now();
mutex = 0;
for (auto& t : thr)
t.join();
rusage usage_end;
getrusage(RUSAGE_SELF, &usage_end);
auto end = std::chrono::system_clock::now();
timeval utime;
timeval stime;
timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime);
timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime);
printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec);
printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec);
printf("real: %lu\n",
std::chrono::duration_cast<std::chrono::milliseconds>(end -
start).count());
return 0;
}
```
The functionality of above program is simply which creates `nproc`
threads and each of them are trying to touch memory (trigger page
fault) on different CPU. Then we will see the similar profile by
`perf top`.
25.55% [kernel] [k] down_read_trylock
14.78% [kernel] [k] handle_mm_fault
13.45% [kernel] [k] up_read
8.61% [kernel] [k] clear_page_erms
3.89% [kernel] [k] __do_page_fault
The highest hot instruction, which accounts for about 92%, in
down_read_trylock() is cmpxchg like the following.
91.89 │ lock cmpxchg %rdx,(%rdi)
Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4,
so we easily found that the commit ddb20d1d3aed ("locking/rwsem: Optimize
down_read_trylock()") caused the regression. The reason is that the
commit assumes the rwsem is not contended at all. But it is not always
true for mmap lock which could be contended with thousands threads.
So most threads almost need to run at least 2 times of "cmpxchg" to
acquire the lock. The overhead of atomic operation is higher than
non-atomic instructions, which caused the regression.
By using the above benchmark, the real executing time on a x86-64 system
before and after the patch were:
Before Patch After Patch
# of Threads real real reduced by
------------ ------ ------ ----------
1 65,373 65,206 ~0.0%
4 15,467 15,378 ~0.5%
40 6,214 5,528 ~11.0%
For the uncontended case, the new down_read_trylock() is the same as
before. For the contended cases, the new down_read_trylock() is faster
than before. The more contended, the more fast.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com
|
|
There are some inconsistency in the way that the handoff bit is being
handled in readers and writers that lead to a race condition.
Firstly, when a queue head writer set the handoff bit, it will clear
it when the writer is being killed or interrupted on its way out
without acquiring the lock. That is not the case for a queue head
reader. The handoff bit will simply be inherited by the next waiter.
Secondly, in the out_nolock path of rwsem_down_read_slowpath(), both
the waiter and handoff bits are cleared if the wait queue becomes
empty. For rwsem_down_write_slowpath(), however, the handoff bit is
not checked and cleared if the wait queue is empty. This can
potentially make the handoff bit set with empty wait queue.
Worse, the situation in rwsem_down_write_slowpath() relies on wstate,
a variable set outside of the critical section containing the ->count
manipulation, this leads to race condition where RWSEM_FLAG_HANDOFF
can be double subtracted, corrupting ->count.
To make the handoff bit handling more consistent and robust, extract
out handoff bit clearing code into the new rwsem_del_waiter() helper
function. Also, completely eradicate wstate; always evaluate
everything inside the same critical section.
The common function will only use atomic_long_andnot() to clear bits
when the wait queue is empty to avoid possible race condition. If the
first waiter with handoff bit set is killed or interrupted to exit the
slowpath without acquiring the lock, the next waiter will inherit the
handoff bit.
While at it, simplify the trylock for loop in
rwsem_down_write_slowpath() to make it easier to read.
Fixes: 4f23dbc1e657 ("locking/rwsem: Implement lock handoff to prevent lock starvation")
Reported-by: Zhenhua Ma <mazhenhua@xiaomi.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211116012912.723980-1-longman@redhat.com
|
|
We observed the following deadlock in the stress test under low
memory scenario:
Thread A Thread B
- erofs_shrink_scan
- erofs_try_to_release_workgroup
- erofs_workgroup_try_to_freeze -- A
- z_erofs_do_read_page
- z_erofs_collection_begin
- z_erofs_register_collection
- erofs_insert_workgroup
- xa_lock(&sbi->managed_pslots) -- B
- erofs_workgroup_get
- erofs_wait_on_workgroup_freezed -- A
- xa_erase
- xa_lock(&sbi->managed_pslots) -- B
To fix this, it needs to hold xa_lock before freezing the workgroup
since xarray will be touched then. So let's hold the lock before
accessing each workgroup, just like what we did with the radix tree
before.
[ Gao Xiang: Jianhua Hao also reports this issue at
https://lore.kernel.org/r/b10b85df30694bac8aadfe43537c897a@xiaomi.com ]
Link: https://lore.kernel.org/r/20211118135844.3559-1-huangjianan@oppo.com
Fixes: 64094a04414f ("erofs: convert workstn to XArray")
Reviewed-by: Chao Yu <chao@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Huang Jianan <huangjianan@oppo.com>
Reported-by: Jianhua Hao <haojianhua1@xiaomi.com>
Signed-off-by: Gao Xiang <xiang@kernel.org>
|
|
When a reset is requested the position of the write pointer is updated but
the data in the corresponding zone is not cleared. Instead scsi_debug
returns any data written before the write pointer was reset. This is an
error and prevents using scsi_debug for stale page cache testing of the
BLKRESETZONE ioctl.
Zero written data in the zone when resetting the write pointer.
Link: https://lore.kernel.org/r/20211122061223.298890-1-shinichiro.kawasaki@wdc.com
Fixes: f0d1cf9378bd ("scsi: scsi_debug: Add ZBC zone commands")
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
This fixes an issue added in commit 4edd8cd4e86d ("scsi: core: sysfs: Fix
hang when device state is set via sysfs") where if userspace is requesting
to set the device state to SDEV_RUNNING when the state is already
SDEV_RUNNING, we return -EINVAL instead of count. The commmit above set ret
to count for this case, when it should have set it to 0.
Link: https://lore.kernel.org/r/20211120164917.4924-1-michael.christie@oracle.com
Fixes: 4edd8cd4e86d ("scsi: core: sysfs: Fix hang when device state is set via sysfs")
Reviewed-by: Lee Duncan <lduncan@suse.com>
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
In resp_mode_select() sanity check the block descriptor len to avoid UAF.
BUG: KASAN: use-after-free in resp_mode_select+0xa4c/0xb40 drivers/scsi/scsi_debug.c:2509
Read of size 1 at addr ffff888026670f50 by task scsicmd/15032
CPU: 1 PID: 15032 Comm: scsicmd Not tainted 5.15.0-01d0625 #15
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Call Trace:
<TASK>
dump_stack_lvl+0x89/0xb5 lib/dump_stack.c:107
print_address_description.constprop.9+0x28/0x160 mm/kasan/report.c:257
kasan_report.cold.14+0x7d/0x117 mm/kasan/report.c:443
__asan_report_load1_noabort+0x14/0x20 mm/kasan/report_generic.c:306
resp_mode_select+0xa4c/0xb40 drivers/scsi/scsi_debug.c:2509
schedule_resp+0x4af/0x1a10 drivers/scsi/scsi_debug.c:5483
scsi_debug_queuecommand+0x8c9/0x1e70 drivers/scsi/scsi_debug.c:7537
scsi_queue_rq+0x16b4/0x2d10 drivers/scsi/scsi_lib.c:1521
blk_mq_dispatch_rq_list+0xb9b/0x2700 block/blk-mq.c:1640
__blk_mq_sched_dispatch_requests+0x28f/0x590 block/blk-mq-sched.c:325
blk_mq_sched_dispatch_requests+0x105/0x190 block/blk-mq-sched.c:358
__blk_mq_run_hw_queue+0xe5/0x150 block/blk-mq.c:1762
__blk_mq_delay_run_hw_queue+0x4f8/0x5c0 block/blk-mq.c:1839
blk_mq_run_hw_queue+0x18d/0x350 block/blk-mq.c:1891
blk_mq_sched_insert_request+0x3db/0x4e0 block/blk-mq-sched.c:474
blk_execute_rq_nowait+0x16b/0x1c0 block/blk-exec.c:63
sg_common_write.isra.18+0xeb3/0x2000 drivers/scsi/sg.c:837
sg_new_write.isra.19+0x570/0x8c0 drivers/scsi/sg.c:775
sg_ioctl_common+0x14d6/0x2710 drivers/scsi/sg.c:941
sg_ioctl+0xa2/0x180 drivers/scsi/sg.c:1166
__x64_sys_ioctl+0x19d/0x220 fs/ioctl.c:52
do_syscall_64+0x3a/0x80 arch/x86/entry/common.c:50
entry_SYSCALL_64_after_hwframe+0x44/0xae arch/x86/entry/entry_64.S:113
Link: https://lore.kernel.org/r/1637262208-28850-1-git-send-email-george.kennedy@oracle.com
Reported-by: syzkaller <syzkaller@googlegroups.com>
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: George Kennedy <george.kennedy@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
As io_remove_next_linked() is now under ->timeout_lock (see
io_link_timeout_fn), we should update locking around io_for_each_link()
and io_match_task() to use the new lock.
Cc: stable@kernel.org # 5.15+
Fixes: 89850fce16a1a ("io_uring: run timeouts from task_work")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b54541cedf7de59cb5ae36109e58529ca16e66aa.1637631883.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
disk->fops->owner is grabbed in blkdev_get_no_open() after the disk
kobject refcount is increased. This way can't make sure that
disk->fops->owner is still alive since del_gendisk() still can move
on if the kobject refcount of disk is grabbed by open() and
disk->fops->open() isn't called yet.
Fixes the issue by moving try_module_get() into blkdev_get_by_dev()
with ->open_mutex() held, then we can drain the in-progress open()
in del_gendisk(). Meantime new open() won't succeed because disk
becomes not alive.
This way is reasonable because blkdev_get_no_open() needn't to touch
disk->fops or defined callbacks.
Cc: Christoph Hellwig <hch@lst.de>
Cc: czhong@redhat.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211111020343.316126-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
- fix VIDIOC_DQEVENT ioctl handling for 32-bit userspace with a 64-bit
kernel
- regression fix for videobuf2 core
- fix for CEC core when handling non-block transmit
- hi846: fix a clang warning
* tag 'media/v5.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: hi846: remove the of_match_ptr macro
media: hi846: include property.h instead of of_graph.h
media: cec: copy sequence field for the reply
media: videobuf2-dma-sg: Fix buf->vb NULL pointer dereference
media: v4l2-core: fix VIDIOC_DQEVENT handling on non-x86
|
|
xprtsock.c reclassifies sock locks based on the protocol.
However there are 3 protocols and only 2 classification keys.
The same key is used for both INET6 and LOCAL.
This causes lockdep complaints. The complaints started since Commit
ea9afca88bbe ("SUNRPC: Replace use of socket sk_callback_lock with
sock_lock") which resulted in the sock locks beings used more.
So add another key, and renumber them slightly.
Fixes: ea9afca88bbe ("SUNRPC: Replace use of socket sk_callback_lock with sock_lock")
Fixes: 176e21ee2ec8 ("SUNRPC: Support for RPC over AF_LOCAL transports")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
We must flush the TLB before releasing i_mmap_rwsem to avoid the
potential reuse of an unshared PMDs page. This is not true in the case
of move_hugetlb_page_tables(). The last reference on the page table can
therefore be dropped before the TLB flush took place.
Prevent it by reordering the operations and flushing the TLB before
releasing i_mmap_rwsem.
Fixes: 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed vma")
Signed-off-by: Nadav Amit <namit@vmware.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When __unmap_hugepage_range() calls to huge_pmd_unshare() succeed, a TLB
flush is missing. This TLB flush must be performed before releasing the
i_mmap_rwsem, in order to prevent an unshared PMDs page from being
released and reused before the TLB flush took place.
Arguably, a comprehensive solution would use mmu_gather interface to
batch the TLB flushes and the PMDs page release, however it is not an
easy solution: (1) try_to_unmap_one() and try_to_migrate_one() also call
huge_pmd_unshare() and they cannot use the mmu_gather interface; and (2)
deferring the release of the page reference for the PMDs page until
after i_mmap_rwsem is dropeed can confuse huge_pmd_unshare() into
thinking PMDs are shared when they are not.
Fix __unmap_hugepage_range() by adding the missing TLB flush, and
forcing a flush when unshare is successful.
Fixes: 24669e58477e ("hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages)" # 3.6
Signed-off-by: Nadav Amit <namit@vmware.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Ice driver has the routines for managing XDP resources that are shared
between ndo_bpf op and VSI rebuild flow. The latter takes place for
example when user changes queue count on an interface via ethtool's
set_channels().
There is an issue around the bpf_prog refcounting when VSI is being
rebuilt - since ice_prepare_xdp_rings() is called with vsi->xdp_prog as
an argument that is used later on by ice_vsi_assign_bpf_prog(), same
bpf_prog pointers are swapped with each other. Then it is also
interpreted as an 'old_prog' which in turn causes us to call
bpf_prog_put on it that will decrement its refcount.
Below splat can be interpreted in a way that due to zero refcount of a
bpf_prog it is wiped out from the system while kernel still tries to
refer to it:
[ 481.069429] BUG: unable to handle page fault for address: ffffc9000640f038
[ 481.077390] #PF: supervisor read access in kernel mode
[ 481.083335] #PF: error_code(0x0000) - not-present page
[ 481.089276] PGD 100000067 P4D 100000067 PUD 1001cb067 PMD 106d2b067 PTE 0
[ 481.097141] Oops: 0000 [#1] PREEMPT SMP PTI
[ 481.101980] CPU: 12 PID: 3339 Comm: sudo Tainted: G OE 5.15.0-rc5+ #1
[ 481.110840] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
[ 481.122021] RIP: 0010:dev_xdp_prog_id+0x25/0x40
[ 481.127265] Code: 80 00 00 00 00 0f 1f 44 00 00 89 f6 48 c1 e6 04 48 01 fe 48 8b 86 98 08 00 00 48 85 c0 74 13 48 8b 50 18 31 c0 48 85 d2 74 07 <48> 8b 42 38 8b 40 20 c3 48 8b 96 90 08 00 00 eb e8 66 2e 0f 1f 84
[ 481.148991] RSP: 0018:ffffc90007b63868 EFLAGS: 00010286
[ 481.155034] RAX: 0000000000000000 RBX: ffff889080824000 RCX: 0000000000000000
[ 481.163278] RDX: ffffc9000640f000 RSI: ffff889080824010 RDI: ffff889080824000
[ 481.171527] RBP: ffff888107af7d00 R08: 0000000000000000 R09: ffff88810db5f6e0
[ 481.179776] R10: 0000000000000000 R11: ffff8890885b9988 R12: ffff88810db5f4bc
[ 481.188026] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 481.196276] FS: 00007f5466d5bec0(0000) GS:ffff88903fb00000(0000) knlGS:0000000000000000
[ 481.205633] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 481.212279] CR2: ffffc9000640f038 CR3: 000000014429c006 CR4: 00000000003706e0
[ 481.220530] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 481.228771] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 481.237029] Call Trace:
[ 481.239856] rtnl_fill_ifinfo+0x768/0x12e0
[ 481.244602] rtnl_dump_ifinfo+0x525/0x650
[ 481.249246] ? __alloc_skb+0xa5/0x280
[ 481.253484] netlink_dump+0x168/0x3c0
[ 481.257725] netlink_recvmsg+0x21e/0x3e0
[ 481.262263] ____sys_recvmsg+0x87/0x170
[ 481.266707] ? __might_fault+0x20/0x30
[ 481.271046] ? _copy_from_user+0x66/0xa0
[ 481.275591] ? iovec_from_user+0xf6/0x1c0
[ 481.280226] ___sys_recvmsg+0x82/0x100
[ 481.284566] ? sock_sendmsg+0x5e/0x60
[ 481.288791] ? __sys_sendto+0xee/0x150
[ 481.293129] __sys_recvmsg+0x56/0xa0
[ 481.297267] do_syscall_64+0x3b/0xc0
[ 481.301395] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 481.307238] RIP: 0033:0x7f5466f39617
[ 481.311373] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bd 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2f 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
[ 481.342944] RSP: 002b:00007ffedc7f4308 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
[ 481.361783] RAX: ffffffffffffffda RBX: 00007ffedc7f5460 RCX: 00007f5466f39617
[ 481.380278] RDX: 0000000000000000 RSI: 00007ffedc7f5360 RDI: 0000000000000003
[ 481.398500] RBP: 00007ffedc7f53f0 R08: 0000000000000000 R09: 000055d556f04d50
[ 481.416463] R10: 0000000000000077 R11: 0000000000000246 R12: 00007ffedc7f5360
[ 481.434131] R13: 00007ffedc7f5350 R14: 00007ffedc7f5344 R15: 0000000000000e98
[ 481.451520] Modules linked in: ice(OE) af_packet binfmt_misc nls_iso8859_1 ipmi_ssif intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp mxm_wmi mei_me coretemp mei ipmi_si ipmi_msghandler wmi acpi_pad acpi_power_meter ip_tables x_tables autofs4 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel ahci crypto_simd cryptd libahci lpc_ich [last unloaded: ice]
[ 481.528558] CR2: ffffc9000640f038
[ 481.542041] ---[ end trace d1f24c9ecf5b61c1 ]---
Fix this by only calling ice_vsi_assign_bpf_prog() inside
ice_prepare_xdp_rings() when current vsi->xdp_prog pointer is NULL.
This way set_channels() flow will not attempt to swap the vsi->xdp_prog
pointers with itself.
Also, sprinkle around some comments that provide a reasoning about
correlation between driver and kernel in terms of bpf_prog refcount.
Fixes: efc2214b6047 ("ice: Add support for XDP")
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Marta Plantykow <marta.a.plantykow@intel.com>
Co-developed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
The approach of having XDP queue per CPU regardless of user's setting
exposed a hidden bug that could occur in case when Rx queue count differ
from Tx queue count. Currently vsi->txq_map's size is equal to the
doubled vsi->alloc_txq, which is not correct due to the fact that XDP
rings were previously based on the Rx queue count. Below splat can be
seen when ethtool -L is used and XDP rings are configured:
[ 682.875339] BUG: kernel NULL pointer dereference, address: 000000000000000f
[ 682.883403] #PF: supervisor read access in kernel mode
[ 682.889345] #PF: error_code(0x0000) - not-present page
[ 682.895289] PGD 0 P4D 0
[ 682.898218] Oops: 0000 [#1] PREEMPT SMP PTI
[ 682.903055] CPU: 42 PID: 2878 Comm: ethtool Tainted: G OE 5.15.0-rc5+ #1
[ 682.912214] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
[ 682.923380] RIP: 0010:devres_remove+0x44/0x130
[ 682.928527] Code: 49 89 f4 55 48 89 fd 4c 89 ff 53 48 83 ec 10 e8 92 b9 49 00 48 8b 9d a8 02 00 00 48 8d 8d a0 02 00 00 49 89 c2 48 39 cb 74 0f <4c> 3b 63 10 74 25 48 8b 5b 08 48 39 cb 75 f1 4c 89 ff 4c 89 d6 e8
[ 682.950237] RSP: 0018:ffffc90006a679f0 EFLAGS: 00010002
[ 682.956285] RAX: 0000000000000286 RBX: ffffffffffffffff RCX: ffff88908343a370
[ 682.964538] RDX: 0000000000000001 RSI: ffffffff81690d60 RDI: 0000000000000000
[ 682.972789] RBP: ffff88908343a0d0 R08: 0000000000000000 R09: 0000000000000000
[ 682.981040] R10: 0000000000000286 R11: 3fffffffffffffff R12: ffffffff81690d60
[ 682.989282] R13: ffffffff81690a00 R14: ffff8890819807a8 R15: ffff88908343a36c
[ 682.997535] FS: 00007f08c7bfa740(0000) GS:ffff88a03fd00000(0000) knlGS:0000000000000000
[ 683.006910] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 683.013557] CR2: 000000000000000f CR3: 0000001080a66003 CR4: 00000000003706e0
[ 683.021819] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 683.030075] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 683.038336] Call Trace:
[ 683.041167] devm_kfree+0x33/0x50
[ 683.045004] ice_vsi_free_arrays+0x5e/0xc0 [ice]
[ 683.050380] ice_vsi_rebuild+0x4c8/0x750 [ice]
[ 683.055543] ice_vsi_recfg_qs+0x9a/0x110 [ice]
[ 683.060697] ice_set_channels+0x14f/0x290 [ice]
[ 683.065962] ethnl_set_channels+0x333/0x3f0
[ 683.070807] genl_family_rcv_msg_doit+0xea/0x150
[ 683.076152] genl_rcv_msg+0xde/0x1d0
[ 683.080289] ? channels_prepare_data+0x60/0x60
[ 683.085432] ? genl_get_cmd+0xd0/0xd0
[ 683.089667] netlink_rcv_skb+0x50/0xf0
[ 683.094006] genl_rcv+0x24/0x40
[ 683.097638] netlink_unicast+0x239/0x340
[ 683.102177] netlink_sendmsg+0x22e/0x470
[ 683.106717] sock_sendmsg+0x5e/0x60
[ 683.110756] __sys_sendto+0xee/0x150
[ 683.114894] ? handle_mm_fault+0xd0/0x2a0
[ 683.119535] ? do_user_addr_fault+0x1f3/0x690
[ 683.134173] __x64_sys_sendto+0x25/0x30
[ 683.148231] do_syscall_64+0x3b/0xc0
[ 683.161992] entry_SYSCALL_64_after_hwframe+0x44/0xae
Fix this by taking into account the value that num_possible_cpus()
yields in addition to vsi->alloc_txq instead of doubling the latter.
Fixes: efc2214b6047 ("ice: Add support for XDP")
Fixes: 22bf877e528f ("ice: introduce XDP_TX fallback path")
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Kiran Bhandare <kiranx.bhandare@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
|
|
Nikolay Aleksandrov says:
====================
net: nexthop: fix refcount issues when replacing groups
This set fixes a refcount bug when replacing nexthop groups and
modifying routes. It is complex because the objects look valid when
debugging memory dumps, but we end up having refcount dependency between
unlinked objects which can never be released, so in turn they cannot
free their resources and refcounts. The problem happens because we can
have stale IPv6 per-cpu dsts in nexthops which were removed from a
group. Even though the IPv6 gen is bumped, the dsts won't be released
until traffic passes through them or the nexthop is freed, that can take
arbitrarily long time, and even worse we can create a scenario[1] where it
can never be released. The fix is to release the IPv6 per-cpu dsts of
replaced nexthops after an RCU grace period so no new ones can be
created. To do that we add a new IPv6 stub - fib6_nh_release_dsts, which
is used by the nexthop code only when necessary. We can further optimize
group replacement, but that is more suited for net-next as these patches
would have to be backported to stable releases.
v2: patch 02: update commit msg
patch 03: check for mausezahn before testing and make a few comments
more verbose
[1]
This info is also present in patch 02's commit message.
Initial state:
$ ip nexthop list
id 200 via 2002:db8::2 dev bridge.10 scope link onlink
id 201 via 2002:db8::3 dev bridge scope link onlink
id 203 group 201/200
$ ip -6 route
2001:db8::10 nhid 203 metric 1024 pref medium
nexthop via 2002:db8::3 dev bridge weight 1 onlink
nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink
Create rt6_info through one of the multipath legs, e.g.:
$ taskset -a -c 1 ./pkt_inj 24 bridge.10 2001:db8::10
(pkt_inj is just a custom packet generator, nothing special)
Then remove that leg from the group by replace (let's assume it is id
200 in this case):
$ ip nexthop replace id 203 group 201
Now remove the IPv6 route:
$ ip -6 route del 2001:db8::10/128
The route won't be really deleted due to the stale rt6_info holding 1
refcnt in nexthop id 200.
At this point we have the following reference count dependency:
(deleted) IPv6 route holds 1 reference over nhid 203
nh 203 holds 1 ref over id 201
nh 200 holds 1 ref over the net device and the route due to the stale
rt6_info
Now to create circular dependency between nh 200 and the IPv6 route, and
also to get a reference over nh 200, restore nhid 200 in the group:
$ ip nexthop replace id 203 group 201/200
And now we have a permanent circular dependncy because nhid 203 holds a
reference over nh 200 and 201, but the route holds a ref over nh 203 and
is deleted.
To trigger the bug just delete the group (nhid 203):
$ ip nexthop del id 203
It won't really be deleted due to the IPv6 route dependency, and now we
have 2 unlinked and deleted objects that reference each other: the group
and the IPv6 route. Since the group drops the reference it holds over its
entries at free time (i.e. its own refcount needs to drop to 0) that will
never happen and we get a permanent ref on them, since one of the entries
holds a reference over the IPv6 route it will also never be released.
At this point the dependencies are:
(deleted, only unlinked) IPv6 route holds reference over group nh 203
(deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
nh 200 holds 1 ref over the net device and the route due to the stale
rt6_info
This is the last point where it can be fixed by running traffic through
nh 200, and specifically through the same CPU so the rt6_info (dst) will
get released due to the IPv6 genid, that in turn will free the IPv6
route, which in turn will free the ref count over the group nh 203.
If nh 200 is deleted at this point, it will never be released due to the
ref from the unlinked group 203, it will only be unlinked:
$ ip nexthop del id 200
$ ip nexthop
$
Now we can never release that stale rt6_info, we have IPv6 route with ref
over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
rt6_info (dst) with ref over the net device and the IPv6 route. All of
these objects are only unlinked, and cannot be released, thus they can't
release their ref counts.
Message from syslogd@dev at Nov 19 14:04:10 ...
kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
Message from syslogd@dev at Nov 19 14:04:20 ...
kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The new selftest runs a sequence which causes circular refcount
dependency between deleted objects which cannot be released and results
in a netdevice refcount imbalance.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When replacing a nexthop group, we must release the IPv6 per-cpu dsts of
the removed nexthop entries after an RCU grace period because they
contain references to the nexthop's net device and to the fib6 info.
With specific series of events[1] we can reach net device refcount
imbalance which is unrecoverable. IPv4 is not affected because dsts
don't take a refcount on the route.
[1]
$ ip nexthop list
id 200 via 2002:db8::2 dev bridge.10 scope link onlink
id 201 via 2002:db8::3 dev bridge scope link onlink
id 203 group 201/200
$ ip -6 route
2001:db8::10 nhid 203 metric 1024 pref medium
nexthop via 2002:db8::3 dev bridge weight 1 onlink
nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink
Create rt6_info through one of the multipath legs, e.g.:
$ taskset -a -c 1 ./pkt_inj 24 bridge.10 2001:db8::10
(pkt_inj is just a custom packet generator, nothing special)
Then remove that leg from the group by replace (let's assume it is id
200 in this case):
$ ip nexthop replace id 203 group 201
Now remove the IPv6 route:
$ ip -6 route del 2001:db8::10/128
The route won't be really deleted due to the stale rt6_info holding 1
refcnt in nexthop id 200.
At this point we have the following reference count dependency:
(deleted) IPv6 route holds 1 reference over nhid 203
nh 203 holds 1 ref over id 201
nh 200 holds 1 ref over the net device and the route due to the stale
rt6_info
Now to create circular dependency between nh 200 and the IPv6 route, and
also to get a reference over nh 200, restore nhid 200 in the group:
$ ip nexthop replace id 203 group 201/200
And now we have a permanent circular dependncy because nhid 203 holds a
reference over nh 200 and 201, but the route holds a ref over nh 203 and
is deleted.
To trigger the bug just delete the group (nhid 203):
$ ip nexthop del id 203
It won't really be deleted due to the IPv6 route dependency, and now we
have 2 unlinked and deleted objects that reference each other: the group
and the IPv6 route. Since the group drops the reference it holds over its
entries at free time (i.e. its own refcount needs to drop to 0) that will
never happen and we get a permanent ref on them, since one of the entries
holds a reference over the IPv6 route it will also never be released.
At this point the dependencies are:
(deleted, only unlinked) IPv6 route holds reference over group nh 203
(deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
nh 200 holds 1 ref over the net device and the route due to the stale
rt6_info
This is the last point where it can be fixed by running traffic through
nh 200, and specifically through the same CPU so the rt6_info (dst) will
get released due to the IPv6 genid, that in turn will free the IPv6
route, which in turn will free the ref count over the group nh 203.
If nh 200 is deleted at this point, it will never be released due to the
ref from the unlinked group 203, it will only be unlinked:
$ ip nexthop del id 200
$ ip nexthop
$
Now we can never release that stale rt6_info, we have IPv6 route with ref
over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
rt6_info (dst) with ref over the net device and the IPv6 route. All of
these objects are only unlinked, and cannot be released, thus they can't
release their ref counts.
Message from syslogd@dev at Nov 19 14:04:10 ...
kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
Message from syslogd@dev at Nov 19 14:04:20 ...
kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
Fixes: 7bf4796dd099 ("nexthops: add support for replace")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We need a way to release a fib6_nh's per-cpu dsts when replacing
nexthops otherwise we can end up with stale per-cpu dsts which hold net
device references, so add a new IPv6 stub called fib6_nh_release_dsts.
It must be used after an RCU grace period, so no new dsts can be created
through a group's nexthop entry.
Similar to fib6_nh_release it shouldn't be used if fib6_nh_init has failed
so it doesn't need a dummy stub when IPv6 is not enabled.
Fixes: 7bf4796dd099 ("nexthops: add support for replace")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When IPv6 module gets initialized, but it's hitting an error in inet6_init()
where it then needs to undo all the prior initialization work, it also might
do a call to ndisc_cleanup() which then calls neigh_table_clear(). In there
is a missing timer cancellation of the table's managed_work item.
The kernel test robot explicitly triggered this error path and caused a UAF
crash similar to the below:
[...]
[ 28.833183][ C0] BUG: unable to handle page fault for address: f7a43288
[ 28.833973][ C0] #PF: supervisor write access in kernel mode
[ 28.834660][ C0] #PF: error_code(0x0002) - not-present page
[ 28.835319][ C0] *pde = 06b2c067 *pte = 00000000
[ 28.835853][ C0] Oops: 0002 [#1] PREEMPT
[ 28.836367][ C0] CPU: 0 PID: 303 Comm: sed Not tainted 5.16.0-rc1-00233-g83ff5faa0d3b #7
[ 28.837293][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
[ 28.838338][ C0] EIP: __run_timers.constprop.0+0x82/0x440
[...]
[ 28.845607][ C0] Call Trace:
[ 28.845942][ C0] <SOFTIRQ>
[ 28.846333][ C0] ? check_preemption_disabled.isra.0+0x2a/0x80
[ 28.846975][ C0] ? __this_cpu_preempt_check+0x8/0xa
[ 28.847570][ C0] run_timer_softirq+0xd/0x40
[ 28.848050][ C0] __do_softirq+0xf5/0x576
[ 28.848547][ C0] ? __softirqentry_text_start+0x10/0x10
[ 28.849127][ C0] do_softirq_own_stack+0x2b/0x40
[ 28.849749][ C0] </SOFTIRQ>
[ 28.850087][ C0] irq_exit_rcu+0x7d/0xc0
[ 28.850587][ C0] common_interrupt+0x2a/0x40
[ 28.851068][ C0] asm_common_interrupt+0x119/0x120
[...]
Note that IPv6 module cannot be unloaded as per 8ce440610357 ("ipv6: do not
allow ipv6 module to be removed") hence this can only be seen during module
initialization error. Tested with kernel test robot's reproducer.
Fixes: 7482e3841d52 ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Li Zhijian <zhijianx.li@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The change to eth_hw_addr_set() caused gcc to correctly spot a
bug that was introduced in an earlier incorrect fix:
In file included from include/linux/etherdevice.h:21,
from drivers/net/ethernet/ni/nixge.c:7:
In function '__dev_addr_set',
inlined from 'eth_hw_addr_set' at include/linux/etherdevice.h:319:2,
inlined from 'nixge_probe' at drivers/net/ethernet/ni/nixge.c:1286:3:
include/linux/netdevice.h:4648:9: error: 'memcpy' reading 6 bytes from a region of size 0 [-Werror=stringop-overread]
4648 | memcpy(dev->dev_addr, addr, len);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As nixge_get_nvmem_address() can return either NULL or an error
pointer, the NULL check is wrong, and we can end up reading from
ERR_PTR(-EOPNOTSUPP), which gcc knows to contain zero readable
bytes.
Make the function always return an error pointer again but fix
the check to match that.
Fixes: f3956ebb3bf0 ("ethernet: use eth_hw_addr_set() instead of ether_addr_copy()")
Fixes: abcd3d6fc640 ("net: nixge: Fix error path for obtaining mac address")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
|