Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull block handle updates from Christian Brauner:
"Last cycle we changed opening of block devices, and opening a block
device would return a bdev_handle. This allowed us to implement
support for restricting and forbidding writes to mounted block
devices. It was accompanied by converting and adding helpers to
operate on bdev_handles instead of plain block devices.
That was already a good step forward but ultimately it isn't necessary
to have special purpose helpers for opening block devices internally
that return a bdev_handle.
Fundamentally, opening a block device internally should just be
equivalent to opening files. So now all internal opens of block
devices return files just as a userspace open would. Instead of
introducing a separate indirection into bdev_open_by_*() via struct
bdev_handle bdev_file_open_by_*() is made to just return a struct
file. Opening and closing a block device just becomes equivalent to
opening and closing a file.
This all works well because internally we already have a pseudo fs for
block devices and so opening block devices is simple. There's a few
places where we needed to be careful such as during boot when the
kernel is supposed to mount the rootfs directly without init doing it.
Here we need to take care to ensure that we flush out any asynchronous
file close. That's what we already do for opening, unpacking, and
closing the initramfs. So nothing new here.
The equivalence of opening and closing block devices to regular files
is a win in and of itself. But it also has various other advantages.
We can remove struct bdev_handle completely. Various low-level helpers
are now private to the block layer. Other helpers were simply
removable completely.
A follow-up series that is already reviewed build on this and makes it
possible to remove bdev->bd_inode and allows various clean ups of the
buffer head code as well. All places where we stashed a bdev_handle
now just stash a file and use simple accessors to get to the actual
block device which was already the case for bdev_handle"
* tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits)
block: remove bdev_handle completely
block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access
bdev: remove bdev pointer from struct bdev_handle
bdev: make struct bdev_handle private to the block layer
bdev: make bdev_{release, open_by_dev}() private to block layer
bdev: remove bdev_open_by_path()
reiserfs: port block device access to file
ocfs2: port block device access to file
nfs: port block device access to files
jfs: port block device access to file
f2fs: port block device access to files
ext4: port block device access to file
erofs: port device access to file
btrfs: port device access to file
bcachefs: port block device access to file
target: port block device access to file
s390: port block device access to file
nvme: port block device access to file
block2mtd: port device access to files
bcache: port block device access to files
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull file locking updates from Christian Brauner:
"A few years ago struct file_lock_context was added to allow for
separate lists to track different types of file locks instead of using
a singly-linked list for all of them.
Now leases no longer need to be tracked using struct file_lock.
However, a lot of the infrastructure is identical for leases and locks
so separating them isn't trivial.
This splits a group of fields used by both file locks and leases into
a new struct file_lock_core. The new core struct is embedded in struct
file_lock. Coccinelle was used to convert a lot of the callers to deal
with the move, with the remaining 25% or so converted by hand.
Afterwards several internal functions in fs/locks.c are made to work
with struct file_lock_core. Ultimately this allows to split struct
file_lock into struct file_lock and struct file_lease. The file lease
APIs are then converted to take struct file_lease"
* tag 'vfs-6.9.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (51 commits)
filelock: fix deadlock detection in POSIX locking
filelock: always define for_each_file_lock()
smb: remove redundant check
filelock: don't do security checks on nfsd setlease calls
filelock: split leases out of struct file_lock
filelock: remove temporary compatibility macros
smb/server: adapt to breakup of struct file_lock
smb/client: adapt to breakup of struct file_lock
ocfs2: adapt to breakup of struct file_lock
nfsd: adapt to breakup of struct file_lock
nfs: adapt to breakup of struct file_lock
lockd: adapt to breakup of struct file_lock
fuse: adapt to breakup of struct file_lock
gfs2: adapt to breakup of struct file_lock
dlm: adapt to breakup of struct file_lock
ceph: adapt to breakup of struct file_lock
afs: adapt to breakup of struct file_lock
9p: adapt to breakup of struct file_lock
filelock: convert seqfile handling to use file_lock_core
filelock: convert locks_translate_pid to take file_lock_core
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull pdfd updates from Christian Brauner:
- Until now pidfds could only be created for thread-group leaders but
not for threads. There was no technical reason for this. We simply
had no users that needed support for this. Now we do have users that
need support for this.
This introduces a new PIDFD_THREAD flag for pidfd_open(). If that
flag is set pidfd_open() creates a pidfd that refers to a specific
thread.
In addition, we now allow clone() and clone3() to be called with
CLONE_PIDFD | CLONE_THREAD which wasn't possible before.
A pidfd that refers to an individual thread differs from a pidfd that
refers to a thread-group leader:
(1) Pidfds are pollable. A task may poll a pidfd and get notified
when the task has exited.
For thread-group leader pidfds the polling task is woken if the
thread-group is empty. In other words, if the thread-group
leader task exits when there are still threads alive in its
thread-group the polling task will not be woken when the
thread-group leader exits but rather when the last thread in the
thread-group exits.
For thread-specific pidfds the polling task is woken if the
thread exits.
(2) Passing a thread-group leader pidfd to pidfd_send_signal() will
generate thread-group directed signals like kill(2) does.
Passing a thread-specific pidfd to pidfd_send_signal() will
generate thread-specific signals like tgkill(2) does.
The default scope of the signal is thus determined by the type
of the pidfd.
Since use-cases exist where the default scope of the provided
pidfd needs to be overriden the following flags are added to
pidfd_send_signal():
- PIDFD_SIGNAL_THREAD
Send a thread-specific signal.
- PIDFD_SIGNAL_THREAD_GROUP
Send a thread-group directed signal.
- PIDFD_SIGNAL_PROCESS_GROUP
Send a process-group directed signal.
The scope change will only work if the struct pid is actually
used for this scope.
For example, in order to send a thread-group directed signal the
provided pidfd must be used as a thread-group leader and
similarly for PIDFD_SIGNAL_PROCESS_GROUP the struct pid must be
used as a process group leader.
- Move pidfds from the anonymous inode infrastructure to a tiny pseudo
filesystem. This will unblock further work that we weren't able to do
simply because of the very justified limitations of anonymous inodes.
Moving pidfds to a tiny pseudo filesystem allows for statx on pidfds
to become useful for the first time. They can now be compared by
inode number which are unique for the system lifetime.
Instead of stashing struct pid in file->private_data we can now stash
it in inode->i_private. This makes it possible to introduce concepts
that operate on a process once all file descriptors have been closed.
A concrete example is kill-on-last-close. Another side-effect is that
file->private_data is now freed up for per-file options for pidfds.
Now, each struct pid will refer to a different inode but the same
struct pid will refer to the same inode if it's opened multiple
times. In contrast to now where each struct pid refers to the same
inode.
The tiny pseudo filesystem is not visible anywhere in userspace
exactly like e.g., pipefs and sockfs. There's no lookup, there's no
complex inode operations, nothing. Dentries and inodes are always
deleted when the last pidfd is closed.
We allocate a new inode and dentry for each struct pid and we reuse
that inode and dentry for all pidfds that refer to the same struct
pid. The code is entirely optional and fairly small. If it's not
selected we fallback to anonymous inodes. Heavily inspired by nsfs.
The dentry and inode allocation mechanism is moved into generic
infrastructure that is now shared between nsfs and pidfs. The
path_from_stashed() helper must be provided with a stashing location,
an inode number, a mount, and the private data that is supposed to be
used and it will provide a path that can be passed to dentry_open().
The helper will try retrieve an existing dentry from the provided
stashing location. If a valid dentry is found it is reused. If not a
new one is allocated and we try to stash it in the provided location.
If this fails we retry until we either find an existing dentry or the
newly allocated dentry could be stashed. Subsequent openers of the
same namespace or task are then able to reuse it.
- Currently it is only possible to get notified when a task has exited,
i.e., become a zombie and userspace gets notified with EPOLLIN. We
now also support waiting until the task has been reaped, notifying
userspace with EPOLLHUP.
- Ensure that ESRCH is reported for getfd if a task is exiting instead
of the confusing EBADF.
- Various smaller cleanups to pidfd functions.
* tag 'vfs-6.9.pidfd' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (23 commits)
libfs: improve path_from_stashed()
libfs: add stashed_dentry_prune()
libfs: improve path_from_stashed() helper
pidfs: convert to path_from_stashed() helper
nsfs: convert to path_from_stashed() helper
libfs: add path_from_stashed()
pidfd: add pidfs
pidfd: move struct pidfd_fops
pidfd: allow to override signal scope in pidfd_send_signal()
pidfd: change pidfd_send_signal() to respect PIDFD_THREAD
signal: fill in si_code in prepare_kill_siginfo()
selftests: add ESRCH tests for pidfd_getfd()
pidfd: getfd should always report ESRCH if a task is exiting
pidfd: clone: allow CLONE_THREAD | CLONE_PIDFD together
pidfd: exit: kill the no longer used thread_group_exited()
pidfd: change do_notify_pidfd() to use __wake_up(poll_to_key(EPOLLIN))
pid: kill the obsolete PIDTYPE_PID code in transfer_pid()
pidfd: kill the no longer needed do_notify_pidfd() in de_thread()
pidfd_poll: report POLLHUP when pid_task() == NULL
pidfd: implement PIDFD_THREAD flag for pidfd_open()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull iomap updates from Christian Brauner:
- Restore read-write hints in struct bio through the bi_write_hint
member for the sake of UFS devices in mobile applications. This can
result in up to 40% lower write amplification in UFS devices. The
patch series that builds on this will be coming in via the SCSI
maintainers (Bart)
- Overhaul the iomap writeback code. Afterwards ->map_blocks() is able
to map multiple blocks at once as long as they're in the same folio.
This reduces CPU usage for buffered write workloads on e.g., xfs on
systems with lots of cores (Christoph)
- Record processed bytes in iomap_iter() trace event (Kassey)
- Extend iomap_writepage_map() trace event after Christoph's
->map_block() changes to map mutliple blocks at once (Zhang)
* tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
iomap: Add processed for iomap_iter
iomap: add pos and dirty_len into trace_iomap_writepage_map
block, fs: Restore the per-bio/request data lifetime fields
fs: Propagate write hints to the struct block_device inode
fs: Move enum rw_hint into a new header file
fs: Split fcntl_rw_hint()
fs: Verify write lifetime constants at compile time
fs: Fix rw_hint validation
iomap: pass the length of the dirty region to ->map_blocks
iomap: map multiple blocks at a time
iomap: submit ioends immediately
iomap: factor out a iomap_writepage_map_block helper
iomap: only call mapping_set_error once for each failed bio
iomap: don't chain bios
iomap: move the iomap_sector sector calculation out of iomap_add_to_ioend
iomap: clean up the iomap_alloc_ioend calling convention
iomap: move all remaining per-folio logic into iomap_writepage_map
iomap: factor out a iomap_writepage_handle_eof helper
iomap: move the PF_MEMALLOC check to iomap_writepages
iomap: move the io_folios field out of struct iomap_ioend
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Misc features, cleanups, and fixes for vfs and individual filesystems.
Features:
- Support idmapped mounts for hugetlbfs.
- Add RWF_NOAPPEND flag for pwritev2(). This allows us to fix a bug
where the passed offset is ignored if the file is O_APPEND. The new
flag allows a caller to enforce that the offset is honored to
conform to posix even if the file was opened in append mode.
- Move i_mmap_rwsem in struct address_space to avoid false sharing
between i_mmap and i_mmap_rwsem.
- Convert efs, qnx4, and coda to use the new mount api.
- Add a generic is_dot_dotdot() helper that's used by various
filesystems and the VFS code instead of open-coding it multiple
times.
- Recently we've added stable offsets which allows stable ordering
when iterating directories exported through NFS on e.g., tmpfs
filesystems. Originally an xarray was used for the offset map but
that caused slab fragmentation issues over time. This switches the
offset map to the maple tree which has a dense mode that handles
this scenario a lot better. Includes tests.
- Finally merge the case-insensitive improvement series Gabriel has
been working on for a long time. This cleanly propagates case
insensitive operations through ->s_d_op which in turn allows us to
remove the quite ugly generic_set_encrypted_ci_d_ops() operations.
It also improves performance by trying a case-sensitive comparison
first and then fallback to case-insensitive lookup if that fails.
This also fixes a bug where overlayfs would be able to be mounted
over a case insensitive directory which would lead to all sort of
odd behaviors.
Cleanups:
- Make file_dentry() a simple accessor now that ->d_real() is
simplified because of the backing file work we did the last two
cycles.
- Use the dedicated file_mnt_idmap helper in ntfs3.
- Use smp_load_acquire/store_release() in the i_size_read/write
helpers and thus remove the hack to handle i_size reads in the
filemap code.
- The SLAB_MEM_SPREAD is a nop now. Remove it from various places in
fs/
- It's no longer necessary to perform a second built-in initramfs
unpack call because we retain the contents of the previous
extraction. Remove it.
- Now that we have removed various allocators kfree_rcu() always
works with kmem caches and kmalloc(). So simplify various places
that only use an rcu callback in order to handle the kmem cache
case.
- Convert the pipe code to use a lockdep comparison function instead
of open-coding the nesting making lockdep validation easier.
- Move code into fs-writeback.c that was located in a header but can
be made static as it's only used in that one file.
- Rewrite the alignment checking iterators for iovec and bvec to be
easier to read, and also significantly more compact in terms of
generated code. This saves 270 bytes of text on x86-64 (with
clang-18) and 224 bytes on arm64 (with gcc-13). In profiles it also
saves a bit of time for the same workload.
- Switch various places to use KMEM_CACHE instead of
kmem_cache_create().
- Use inode_set_ctime_to_ts() in inode_set_ctime_current()
- Use kzalloc() in name_to_handle_at() to avoid kernel infoleak.
- Various smaller cleanups for eventfds.
Fixes:
- Fix various comments and typos, and unneeded initializations.
- Fix stack allocation hack for clang in the select code.
- Improve dump_mapping() debug code on a best-effort basis.
- Fix build errors in various selftests.
- Avoid wrap-around instrumentation in various places.
- Don't allow user namespaces without an idmapping to be used for
idmapped mounts.
- Fix sysv sb_read() call.
- Fix fallback implementation of the get_name() export operation"
* tag 'vfs-6.9.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (70 commits)
hugetlbfs: support idmapped mounts
qnx4: convert qnx4 to use the new mount api
fs: use inode_set_ctime_to_ts to set inode ctime to current time
libfs: Drop generic_set_encrypted_ci_d_ops
ubifs: Configure dentry operations at dentry-creation time
f2fs: Configure dentry operations at dentry-creation time
ext4: Configure dentry operations at dentry-creation time
libfs: Add helper to choose dentry operations at mount-time
libfs: Merge encrypted_ci_dentry_ops and ci_dentry_ops
fscrypt: Drop d_revalidate once the key is added
fscrypt: Drop d_revalidate for valid dentries during lookup
fscrypt: Factor out a helper to configure the lookup dentry
ovl: Always reject mounting over case-insensitive directories
libfs: Attempt exact-match comparison first during casefolded lookup
efs: remove SLAB_MEM_SPREAD flag usage
jfs: remove SLAB_MEM_SPREAD flag usage
minix: remove SLAB_MEM_SPREAD flag usage
openpromfs: remove SLAB_MEM_SPREAD flag usage
proc: remove SLAB_MEM_SPREAD flag usage
qnx6: remove SLAB_MEM_SPREAD flag usage
...
|
|
ioremap_page_range() should be used for ranges within vmalloc range only.
The vmalloc ranges are allocated by get_vm_area(). PCI has "resource"
allocator that manages PCI_IOBASE, IO_SPACE_LIMIT address range, hence
introduce vmap_page_range() to be used exclusively to map pages
in PCI address space.
Fixes: 3e49a866c9dc ("mm: Enforce VM_IOREMAP flag and range in ioremap_page_range.")
Reported-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Miguel Ojeda <ojeda@kernel.org>
Link: https://lore.kernel.org/bpf/CANiq72ka4rir+RTN2FQoT=Vvprp_Ao-CvoYEkSNqtSY+RZj+AA@mail.gmail.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm into pm
Merge OPP (operating performance points) updates for 6.9 from Viresh
Kumar:
"- Fix couple of warnings related to W=1 builds. (Viresh Kumar).
- Move Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h (Viresh Kumar).
- Extend dev_pm_opp_data with turbo support (Sibi Sankar).
- dt-bindings: drop maxItems from inner items (David Heidelberg)."
* tag 'opp-updates-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
dt-bindings: opp: drop maxItems from inner items
OPP: debugfs: Fix warning around icc_get_name()
OPP: debugfs: Fix warning with W=1 builds
cpufreq: Move dev_pm_opp_{init|free}_cpufreq_table() to pm_opp.h
OPP: Extend dev_pm_opp_data with turbo support
|
|
Merge Enery Model changes for 6.9-rc1:
- Allow the Energy Model to be updated dynamically (Lukasz Luba).
* pm-em: (24 commits)
PM: EM: Fix nr_states warnings in static checks
Documentation: EM: Update with runtime modification design
PM: EM: Add em_dev_compute_costs()
PM: EM: Remove old table
PM: EM: Change debugfs configuration to use runtime EM table data
drivers/thermal/devfreq_cooling: Use new Energy Model interface
drivers/thermal/cpufreq_cooling: Use new Energy Model interface
powercap/dtpm_devfreq: Use new Energy Model interface to get table
powercap/dtpm_cpu: Use new Energy Model interface to get table
PM: EM: Optimize em_cpu_energy() and remove division
PM: EM: Support late CPUs booting and capacity adjustment
PM: EM: Add performance field to struct em_perf_state and optimize
PM: EM: Add em_perf_state_from_pd() to get performance states table
PM: EM: Introduce em_dev_update_perf_domain() for EM updates
PM: EM: Add functions for memory allocations for new EM tables
PM: EM: Use runtime modified EM for CPUs energy estimation in EAS
PM: EM: Introduce runtime modifiable table
PM: EM: Split the allocation and initialization of the EM table
PM: EM: Check if the get_cost() callback is present in em_compute_costs()
PM: EM: Introduce em_compute_costs()
...
|
|
Merge power capping changes and power management utilities updates for
6.9-rc1:
- Address multiple issues in the TPMI RAPL driver and add support for
new platforms (Lunar Lake-M, Arrow Lake) to Intel RAPL (Zhang Rui).
- Fix freq_qos_add_request() return value check in dtpm_cpu (Daniel
Lezcano).
- Fix kernel-doc for dtpm_create_hierarchy() (Yang Li).
- Fix file leak in get_pkg_num() in x86_energy_perf_policy (Samasth
Norway Ananda).
- Fix cpupower-frequency-info.1 man page typo (Jan Kratochvil).
* pm-powercap:
powercap: dtpm: Fix kernel-doc for dtpm_create_hierarchy() function
powercap: dtpm_cpu: Fix error check against freq_qos_add_request()
powercap: intel_rapl: Add support for Arrow Lake
powercap: intel_rapl: Add support for Lunar Lake-M paltform
powercap: intel_rapl_tpmi: Fix System Domain probing
powercap: intel_rapl_tpmi: Fix a register bug
powercap: intel_rapl: Fix locking in TPMI RAPL
powercap: intel_rapl: Fix a NULL pointer dereference
* pm-tools:
Fix cpupower-frequency-info.1 man page typo
tools/power x86_energy_perf_policy: Fix file leak in get_pkg_num()
|
|
KVM Xen and pfncache changes for 6.9:
- Rip out the half-baked support for using gfn_to_pfn caches to manage pages
that are "mapped" into guests via physical addresses.
- Add support for using gfn_to_pfn caches with only a host virtual address,
i.e. to bypass the "gfn" stage of the cache. The primary use case is
overlay pages, where the guest may change the gfn used to reference the
overlay page, but the backing hva+pfn remains the same.
- Add an ioctl() to allow mapping Xen's shared_info page using an hva instead
of a gpa, so that userspace doesn't need to reconfigure and invalidate the
cache/mapping if the guest changes the gpa (but userspace keeps the resolved
hva the same).
- When possible, use a single host TSC value when computing the deadline for
Xen timers in order to improve the accuracy of the timer emulation.
- Inject pending upcall events when the vCPU software-enables its APIC to fix
a bug where an upcall can be lost (and to follow Xen's behavior).
- Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen
events fails, e.g. if the guest has aliased xAPIC IDs.
- Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to
refresh), and drop a now-redundant acquisition of xen_lock (that was
protecting the shared_info cache) to fix a deadlock due to recursively
acquiring xen_lock.
|
|
Merge cpufreq changes for 6.9-rc1:
- Enable preferred core support in the amd-pstate cpufreq driver (Meng
Li).
- Fix min_perf assignment in amd_pstate_adjust_perf() and make the
min/max limit perf values in amd-pstate always stay within the
(highest perf, lowest perf) range (Tor Vic, Meng Li).
- Change default transition delay in cpufreq to 2ms (Qais Yousef).
- Drop long-unused cpudata::prev_cummulative_iowait from the
intel_pstate cpufreq driver (Jiri Slaby).
- Allow intel_pstate to assign model-specific values to strings used in
the EPP sysfs interface and make it do so on Meteor Lake (Srinivas
Pandruvada).
- Remove references to 10ms minimum sampling rate from comments in the
cpufreq code (Pierre Gondois).
- Prevent scaling_cur_freq from exceeding scaling_max_freq when the
latter is an inefficient frequency (Shivnandan Kumar).
- Honour transition_latency over transition_delay_us in cpufreq (Qais
Yousef).
- Stop unregistering cpufreq cooling on CPU hot-remove (Viresh Kumar).
- General enhancements / cleanups to ARM cpufreq drivers (tianyu2,
Nícolas F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia
Belova).
- Update cpufreq-dt-platdev to block/approve devices (Richard Acayan).
- Make the SCMI cpufreq driver get a transition delay value from
firmware (Pierre Gondois).
* pm-cpufreq: (28 commits)
cpufreq: scmi: Set transition_delay_us
firmware: arm_scmi: Populate fast channel rate_limit
firmware: arm_scmi: Populate perf commands rate_limit
cpufreq: Don't unregister cpufreq cooling on CPU hotplug
cpufreq: Honour transition_latency over transition_delay_us
cpufreq: Limit resolving a frequency to policy min/max
cpufreq: amd-pstate: adjust min/max limit perf
cpufreq: Remove references to 10ms min sampling rate
cpufreq: intel_pstate: Update default EPPs for Meteor Lake
cpufreq: intel_pstate: Allow model specific EPPs
cpufreq: qcom-hw: add CONFIG_COMMON_CLK dependency
cpufreq: dt-platdev: block SDM670 in cpufreq-dt-platdev
cpufreq: intel_pstate: remove cpudata::prev_cummulative_iowait
cpufreq: Change default transition delay to 2ms
cpufreq: amd-pstate: Fix min_perf assignment in amd_pstate_adjust_perf()
Documentation: PM: amd-pstate: Fix section title underline
Documentation: introduce amd-pstate preferrd core mode kernel command line options
Documentation: amd-pstate: introduce amd-pstate preferred core
cpufreq: amd-pstate: Update amd-pstate preferred core ranking dynamically
ACPI: cpufreq: Add highest perf change notification
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm
Merge ARM cpufreq updates for 6.9 from Viresh Kumar:
"- General enhancements / cleanups to cpufreq drivers (tianyu2, Nícolas
F. R. A. Prado, Erick Archer, Arnd Bergmann, Anastasia Belova).
- Update cpufreq-dt-platdev to block/approve devices (Richard Acayan).
- scmi: get transition delay from firmware (Pierre Gondois)."
* tag 'cpufreq-arm-updates-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm:
cpufreq: scmi: Set transition_delay_us
firmware: arm_scmi: Populate fast channel rate_limit
firmware: arm_scmi: Populate perf commands rate_limit
cpufreq: qcom-hw: add CONFIG_COMMON_CLK dependency
cpufreq: dt-platdev: block SDM670 in cpufreq-dt-platdev
cpufreq: mediatek-hw: Don't error out if supply is not found
Documentation: power: Use kcalloc() instead of kzalloc()
cpufreq: mediatek-hw: Wait for CPU supplies before probing
cpufreq: brcmstb-avs-cpufreq: add check for cpufreq_cpu_get's return value
cpufreq: imx6: use regmap to read ocotp register
|
|
KVM x86 misc changes for 6.9:
- Explicitly initialize a variety of on-stack variables in the emulator that
triggered KMSAN false positives (though in fairness in KMSAN, it's comically
difficult to see that the uninitialized memory is never truly consumed).
- Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading
DR6 and DR7.
- Rework the "force immediate exit" code so that vendor code ultimately
decides how and when to force the exit. This allows VMX to further optimize
handling preemption timer exits, and allows SVM to avoid sending a duplicate
IPI (SVM also has a need to force an exit).
- Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if
vCPU creation ultimately failed, and add WARN to guard against similar bugs.
- Provide a dedicated arch hook for checking if a different vCPU was in-kernel
(for directed yield), and simplify the logic for checking if the currently
loaded vCPU is in-kernel.
- Misc cleanups and fixes.
|
|
The kernel CONFIG_UNWINDER_ORC option enables the ORC unwinder, which is
similar in concept to a DWARF unwinder. The difference is that the format
of the ORC data is much simpler than DWARF, which in turn allows the ORC
unwinder to be much simpler and faster.
The ORC data consists of unwind tables which are generated by objtool.
After analyzing all the code paths of a .o file, it determines information
about the stack state at each instruction address in the file and outputs
that information to the .orc_unwind and .orc_unwind_ip sections.
The per-object ORC sections are combined at link time and are sorted and
post-processed at boot time. The unwinder uses the resulting data to
correlate instruction addresses with their stack states at run time.
Most of the logic are similar with x86, in order to get ra info before ra
is saved into stack, add ra_reg and ra_offset into orc_entry. At the same
time, modify some arch-specific code to silence the objtool warnings.
Co-developed-by: Jinyang He <hejinyang@loongson.cn>
Signed-off-by: Jinyang He <hejinyang@loongson.cn>
Co-developed-by: Youling Tang <tangyouling@loongson.cn>
Signed-off-by: Youling Tang <tangyouling@loongson.cn>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
|
|
KVM async page fault changes for 6.9:
- Always flush the async page fault workqueue when a work item is being
removed, especially during vCPU destruction, to ensure that there are no
workers running in KVM code when all references to KVM-the-module are gone,
i.e. to prevent a use-after-free if kvm.ko is unloaded.
- Grab a reference to the VM's mm_struct in the async #PF worker itself instead
of gifting the worker a reference, e.g. so that there's no need to remember
to *conditionally* clean up after the worker.
|
|
Merge changes related to the runtime power management of devices for
6.9-rc1:
- Simplify pm_runtime_get_if_active() usage and add a replacement for
pm_runtime_put_autosuspend() (Sakari Ailus).
- Add a tracepoint for runtime_status changes tracking (Vilas Bhat).
- Fix section title markdown in the runtime PM documentation (Yiwei
Lin).
* pm-runtime:
Documentation: PM: Fix runtime_pm.rst markdown syntax
PM: runtime: add tracepoint for runtime_status changes
PM: runtime: Add pm_runtime_put_autosuspend() replacement
PM: runtime: Simplify pm_runtime_get_if_active() usage
|
|
Merge changes related to system-wide power management for 6.9-rc1:
- Fix and clean up system suspend statistics collection (Rafael
Wysocki).
- Simplify device suspend and resume handling in the power management
core code (Rafael Wysocki).
- Add support for LZ4 compression algorithm to the hibernation image
creation and loading code (Nikhil V).
- Fix PCI hibernation support description (Yiwei Lin).
- Make hibernation take set_memory_ro() return values into account as
appropriate (Christophe Leroy).
- Set mem_sleep_current during kernel command line setup to avoid an
ordering issue with handling it (Maulik Shah).
- Fix wake IRQs handling when pm_runtime_force_suspend() is used as a
driver's system suspend callback (Qingliang Li).
* pm-sleep: (21 commits)
PM: sleep: wakeirq: fix wake irq warning in system suspend
PM: suspend: Set mem_sleep_current during kernel command line setup
PM: hibernate: Don't ignore return from set_memory_ro()
PM: hibernate: Support to select compression algorithm
Documentation: PM: Fix PCI hibernation support description
PM: hibernate: Add support for LZ4 compression for hibernation
PM: hibernate: Move to crypto APIs for LZO compression
PM: hibernate: Rename lzo* to make it generic
PM: sleep: Call dpm_async_fn() directly in each suspend phase
PM: sleep: Move devices to new lists earlier in each suspend phase
PM: sleep: Move some assignments from under a lock
PM: sleep: stats: Log errors right after running suspend callbacks
PM: sleep: stats: Use locking in dpm_save_failed_dev()
PM: sleep: stats: Call dpm_save_failed_step() at most once per phase
PM: sleep: stats: Define suspend_stats next to the code using it
PM: sleep: stats: Use unsigned int for success and failure counters
PM: sleep: stats: Use an array of step failure counters
PM: sleep: stats: Use array of suspend step names
PM: sleep: Relocate two device PM core functions
PM: sleep: Simplify dpm_suspended_list walk in dpm_resume()
...
|
|
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.9
- Infrastructure for building KVM's trap configuration based on the
architectural features (or lack thereof) advertised in the VM's ID
registers
- Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to
x86's WC) at stage-2, improving the performance of interacting with
assigned devices that can tolerate it
- Conversion of KVM's representation of LPIs to an xarray, utilized to
address serialization some of the serialization on the LPI injection
path
- Support for _architectural_ VHE-only systems, advertised through the
absence of FEAT_E2H0 in the CPU's ID register
- Miscellaneous cleanups, fixes, and spelling corrections to KVM and
selftests
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.9
* Set reserved bits as zero in CPUCFG.
* Start SW timer only when vcpu is blocking.
* Do not restart SW timer when it is expired.
* Remove unnecessary CSR register saving during enter guest.
|
|
'acpi-thermal'
Merge ACPI tables parsing change, ACPI processor driver change, ACPI
device properties handling changes and an ACPI thermal code change for
6.9-rc1:
- Make the NFIT parsing code use acpi_evaluate_dsm_typed() (Andy
Shevchenko).
- Fix a memory leak in acpi_processor_power_exit() (Armin Wolf).
- Make it possible to quirk the CSI-2 and MIPI DisCo for Imaging
properties parsing and add a quirk for Dell XPS 9315 (Sakari Ailus).
- Prevent false-positive static checker warnings from triggering by
intializing some variables in the ACPI thermal code to zero (Colin
Ian King).
* acpi-tables:
ACPI: NFIT: Switch to use acpi_evaluate_dsm_typed()
* acpi-processor:
ACPI: processor_idle: Fix memory leak in acpi_processor_power_exit()
* acpi-property:
ACPI: property: Polish ignoring bad data nodes
ACPI: property: Ignore bad graph port nodes on Dell XPS 9315
ACPI: utils: Make acpi_handle_path() not static
* acpi-thermal:
ACPI: thermal_lib: Initialize temp_decik to zero
|
|
These helpers scatters or gathers a bitmap with the help of the mask
position bits parameter.
bitmap_scatter() does the following:
src: 0000000001011010
||||||
+------+|||||
| +----+||||
| |+----+|||
| || +-+||
| || | ||
mask: ...v..vv...v..vv
...0..11...0..10
dst: 0000001100000010
and bitmap_gather() performs this one:
mask: ...v..vv...v..vv
src: 0000001100000010
^ ^^ ^ 0
| || | 10
| || > 010
| |+--> 1010
| +--> 11010
+----> 011010
dst: 0000000000011010
bitmap_gather() can the seen as the reverse bitmap_scatter() operation.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/lkml/20230926052007.3917389-3-andriy.shevchenko@linux.intel.com/
Co-developed-by: Herve Codina <herve.codina@bootlin.com>
Signed-off-by: Herve Codina <herve.codina@bootlin.com>
Acked-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Move the declaration of functions defined in the OPP core to pm_opp.h.
These were added to cpufreq.h as it was the only user of the APIs, but
that was a mistake perhaps. Fix it.
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
|
|
Let's extend the dev_pm_opp_data with a turbo variable, to allow users to
specify if it's a boost frequency for a dynamically added OPP.
Signed-off-by: Sibi Sankar <quic_sibis@quicinc.com>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
|
|
Since commit 43a7206b0963 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the input_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Link: https://lore.kernel.org/r/20240305-class_cleanup-input-v1-1-0c3d950c25db@marliere.net
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
|
|
Introduce PF_MEMALLOC_* equivalents of some GFP_ flags:
PF_MEMALLOC_NORECLAIM -> GFP_NOWAIT
PF_MEMALLOC_NOWARN -> __GFP_NOWARN
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: linux-mm@kvack.org
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Our proliferation of memalloc_*_{save,restore} APIs is getting a bit
silly, this adds a generic version and converts the existing
save/restore functions to wrappers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Vlastimil Babka <vbabka@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Do not allow large strings (> 4096) as single write to trace_marker
The size of a string written into trace_marker was determined by the
size of the sub-buffer in the ring buffer. That size is dependent on
the PAGE_SIZE of the architecture as it can be mapped into user
space. But on PowerPC, where PAGE_SIZE is 64K, that made the limit of
the string of writing into trace_marker 64K.
One of the selftests looks at the size of the ring buffer sub-buffers
and writes that plus more into the trace_marker. The write will take
what it can and report back what it consumed so that the user space
application (like echo) will write the rest of the string. The string
is stored in the ring buffer and can be read via the "trace" or
"trace_pipe" files.
The reading of the ring buffer uses vsnprintf(), which uses a
precision "%.*s" to make sure it only reads what is stored in the
buffer, as a bug could cause the string to be non terminated.
With the combination of the precision change and the PAGE_SIZE of 64K
allowing huge strings to be added into the ring buffer, plus the test
that would actually stress that limit, a bug was reported that the
precision used was too big for "%.*s" as the string was close to 64K
in size and the max precision of vsnprintf is 32K.
Linus suggested not to have that precision as it could hide a bug if
the string was again stored without a nul byte.
Another issue that was brought up is that the trace_seq buffer is
also based on PAGE_SIZE even though it is not tied to the
architecture limit like the ring buffer sub-buffer is. Having it be
64K * 2 is simply just too big and wasting memory on systems with 64K
page sizes. It is now hardcoded to 8K which is what all other
architectures with 4K PAGE_SIZE has.
Finally, the write to trace_marker is now limited to 4K as there is
no reason to write larger strings into trace_marker.
- ring_buffer_wait() should not loop.
The ring_buffer_wait() does not have the full context (yet) on if it
should loop or not. Just exit the loop as soon as its woken up and
let the callers decide to loop or not (they already do, so it's a bit
redundant).
- Fix shortest_full field to be the smallest amount in the ring buffer
that a waiter is waiting for. The "shortest_full" field is updated
when a new waiter comes in and wants to wait for a smaller amount of
data in the ring buffer than other waiters. But after all waiters are
woken up, it's not reset, so if another waiter comes in wanting to
wait for more data, it will be woken up when the ring buffer has a
smaller amount from what the previous waiters were waiting for.
- The wake up all waiters on close is incorrectly called frome
.release() and not from .flush() so it will never wake up any waiters
as the .release() will not get called until all .read() calls are
finished. And the wakeup is for the waiters in those .read() calls.
* tag 'trace-ring-buffer-v6.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Use .flush() call to wake up readers
ring-buffer: Fix resetting of shortest_full
ring-buffer: Fix waking up ring buffer readers
tracing: Limit trace_marker writes to just 4K
tracing: Limit trace_seq size to just 8K and not depend on architecture PAGE_SIZE
tracing: Remove precision vsnprintf() check from print event
|
|
Pull kvm fixes from Paolo Bonzini:
"KVM GUEST_MEMFD fixes for 6.8:
- Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY
to avoid creating an inconsistent ABI (KVM_MEM_GUEST_MEMFD is not
writable from userspace, so there would be no way to write to a
read-only guest_memfd).
- Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
clear that such VMs are purely for development and testing.
- Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term
plan is to support confidential VMs with deterministic private
memory (SNP and TDX) only in the TDP MMU.
- Fix a bug in a GUEST_MEMFD dirty logging test that caused false
passes.
x86 fixes:
- Fix missing marking of a guest page as dirty when emulating an
atomic access.
- Check for mmu_notifier invalidation events before faulting in the
pfn, and before acquiring mmu_lock, to avoid unnecessary work and
lock contention with preemptible kernels (including
CONFIG_PREEMPT_DYNAMIC in non-preemptible mode).
- Disable AMD DebugSwap by default, it breaks VMSA signing and will
be re-enabled with a better VM creation API in 6.10.
- Do the cache flush of converted pages in svm_register_enc_region()
before dropping kvm->lock, to avoid a race with unregistering of
the same region and the consequent use-after-free issue"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
SEV: disable SEV-ES DebugSwap by default
KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing
KVM: SVM: Flush pages under kvm->lock to fix UAF in svm_register_enc_region()
KVM: selftests: Add a testcase to verify GUEST_MEMFD and READONLY are exclusive
KVM: selftests: Create GUEST_MEMFD for relevant invalid flags testcases
KVM: x86/mmu: Restrict KVM_SW_PROTECTED_VM to the TDP MMU
KVM: x86: Update KVM_SW_PROTECTED_VM docs to make it clear they're a WIP
KVM: Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY
KVM: x86: Mark target gfn of emulated atomic instruction as dirty
|
|
https://github.com/kvm-x86/linux into HEAD
KVM GUEST_MEMFD fixes for 6.8:
- Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to
avoid creating ABI that KVM can't sanely support.
- Update documentation for KVM_SW_PROTECTED_VM to make it abundantly
clear that such VMs are purely a development and testing vehicle, and
come with zero guarantees.
- Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan
is to support confidential VMs with deterministic private memory (SNP
and TDX) only in the TDP MMU.
- Fix a bug in a GUEST_MEMFD negative test that resulted in false passes
when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
|
|
In production we have been hitting the following warning consistently
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 17 PID: 1800359 at lib/refcount.c:28 refcount_warn_saturate+0x9c/0xe0
Workqueue: nfsiod nfs_direct_write_schedule_work [nfs]
RIP: 0010:refcount_warn_saturate+0x9c/0xe0
PKRU: 55555554
Call Trace:
<TASK>
? __warn+0x9f/0x130
? refcount_warn_saturate+0x9c/0xe0
? report_bug+0xcc/0x150
? handle_bug+0x3d/0x70
? exc_invalid_op+0x16/0x40
? asm_exc_invalid_op+0x16/0x20
? refcount_warn_saturate+0x9c/0xe0
nfs_direct_write_schedule_work+0x237/0x250 [nfs]
process_one_work+0x12f/0x4a0
worker_thread+0x14e/0x3b0
? ZSTD_getCParams_internal+0x220/0x220
kthread+0xdc/0x120
? __btf_name_valid+0xa0/0xa0
ret_from_fork+0x1f/0x30
This is because we're completing the nfs_direct_request twice in a row.
The source of this is when we have our commit requests to submit, we
process them and send them off, and then in the completion path for the
commit requests we have
if (nfs_commit_end(cinfo.mds))
nfs_direct_write_complete(dreq);
However since we're submitting asynchronous requests we sometimes have
one that completes before we submit the next one, so we end up calling
complete on the nfs_direct_request twice.
The only other place we use nfs_generic_commit_list() is in
__nfs_commit_inode, which wraps this call in a
nfs_commit_begin();
nfs_commit_end();
Which is a common pattern for this style of completion handling, one
that is also repeated in the direct code with get_dreq()/put_dreq()
calls around where we process events as well as in the completion paths.
Fix this by using the same pattern for the commit requests.
Before with my 200 node rocksdb stress running this warning would pop
every 10ish minutes. With my patch the stress test has been running for
several hours without popping.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
We want to be able to have our rpc stats handled in a per network
namespace manner, so add an option to rpc_create_args to specify a
different rpc_stats struct instead of using the one on the rpc_program.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
Nothing uses this, and thank goodness, as the syntax looks horrid.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
|
To allow event log info access after boot, EFI boot stub extracts
the event log information and installs it in an EFI configuration
table. Currently, EFI boot stub only supports installation of event
log only for TPM 1.2 and TPM 2.0 protocols. Extend the same support
for CC protocol. Since CC platform also uses TCG2 format, reuse TPM2
support code as much as possible.
Link: https://uefi.org/specs/UEFI/2.10/38_Confidential_Computing.html#efi-cc-measurement-protocol [1]
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://lkml.kernel.org/r/0229a87e-fb19-4dad-99fc-4afd7ed4099a%40collabora.com
[ardb: Split out final events table handling to avoid version confusion]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
If the virtual firmware implements TPM support, TCG2 protocol will be
used for kernel measurements and event logging support. But in CC
environment, not all platforms support or enable the TPM feature. UEFI
specification [1] exposes protocol and interfaces used for kernel
measurements in CC platforms without TPM support.
More details about the EFI CC measurements and logging can be found
in [1].
Link: https://uefi.org/specs/UEFI/2.10/38_Confidential_Computing.html#efi-cc-measurement-protocol [1]
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
[ardb: Drop code changes, keep typedefs and #define's only]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
The LINUX_EFI_ GUID identifiers are only intended to be used to refer to
GUIDs that are part of the Linux implementation, and are not considered
external ABI. (Famous last words).
GUIDs that already have a symbolic name in the spec should use that
name, to avoid confusion between firmware components. So use the
official name EFI_TCG2_FINAL_EVENTS_TABLE_GUID for the TCG2 'final
events' configuration table.
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
Support Multi-PF netdev (Socket Direct)
This series adds support for combining multiple devices (PFs) of the
same port under one netdev instance. Passing traffic through different
devices belonging to different NUMA sockets saves cross-numa traffic and
allows apps running on the same netdev from different numas to still
feel a sense of proximity to the device and achieve improved
performance.
We achieve this by grouping PFs together, and creating the netdev only
once all group members are probed. Symmetrically, we destroy the netdev
once any of the PFs is removed.
The channels are distributed between all devices, a proper configuration
would utilize the correct close numa when working on a certain app/cpu.
We pick one device to be a primary (leader), and it fills a special
role. The other devices (secondaries) are disconnected from the network
in the chip level (set to silent mode). All RX/TX traffic is steered
through the primary to/from the secondaries.
Currently, we limit the support to PFs only, and up to two devices
(sockets).
* tag 'mlx5-socket-direct-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
Documentation: networking: Add description for multi-pf netdev
net/mlx5: Enable SD feature
net/mlx5e: Block TLS device offload on combined SD netdev
net/mlx5e: Support per-mdev queue counter
net/mlx5e: Support cross-vhca RSS
net/mlx5e: Let channels be SD-aware
net/mlx5e: Create EN core HW resources for all secondary devices
net/mlx5e: Create single netdev per SD group
net/mlx5: SD, Add debugfs
net/mlx5: SD, Add informative prints in kernel log
net/mlx5: SD, Implement steering for primary and secondaries
net/mlx5: SD, Implement devcom communication and primary election
net/mlx5: SD, Implement basic query and instantiation
net/mlx5: SD, Introduce SD lib
net/mlx5: Add MPIR bit in mcam_access_reg
====================
Link: https://lore.kernel.org/r/20240307084229.500776-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Both AER and DPC RP PIO provide TLP Header Log registers (PCIe r6.1 secs
7.8.4 & 7.9.14) to convey error diagnostics but the struct is named after
AER as the struct aer_header_log_regs. Also, not all places that handle TLP
Header Log use the struct and the struct members are named individually.
Generalize the struct name and members, and use it consistently where TLP
Header Log is being handled so that a pcie_read_tlp_log() helper can be
easily added.
Link: https://lore.kernel.org/r/20240206135717.8565-3-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
[bhelgaas: drop ixgbe changes for now, tidy whitespace]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
|
|
Similar to skb_unref(), add skb_data_unref() to save an expensive
atomic operation (and cache line dirtying) when last reference
on shinfo->dataref is released.
I saw this opportunity on hosts with RAW sockets accidentally
bound to UDP protocol, forcing an skb_clone() on all received packets.
These RAW sockets had their receive queue full, so all clone
packets were immediately dropped.
When UDP recvmsg() consumes later the original skb, skb_release_data()
is hitting atomic_sub_return() quite badly, because skb->clone
has been set permanently.
Note that this patch helps TCP TX performance, because
TCP stack also use (fast) clones.
This means that at least one of the two packets (the main skb or
its clone) will no longer have to perform this atomic operation
in skb_release_data().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240307123446.2302230-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When enabling CONFIG_OF on a platform where 'of_root' is not populated
by firmware, we end up without a root node. In order to apply overlays
and create subnodes of the root node, we need one. Create this root node
by unflattening an empty builtin dtb.
If firmware provides a flattened device tree (FDT) then the FDT is
unflattened via setup_arch(). Otherwise, the call to
unflatten(_and_copy)?_device_tree() will create an empty root node.
We make of_have_populated_dt() return true only if the DTB was loaded by
firmware so that existing callers don't change behavior after this
patch. The call in the of platform code is removed because it prevents
overlays from creating platform devices when the empty root node is
used.
[sboyd@kernel.org: Update of_have_populated_dt() to treat this empty dtb
as not populated. Drop setup_of() initcall]
Signed-off-by: Frank Rowand <frowand.list@gmail.com>
Link: https://lore.kernel.org/r/20230317053415.2254616-2-frowand.list@gmail.com
Cc: Rob Herring <robh+dt@kernel.org>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Link: https://lore.kernel.org/r/20240217010557.2381548-3-sboyd@kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v6.9
The fourth "new features" pull request for v6.9 with changes both in
stack and in drivers. The theme in this pull request is to fix sparse
warnings but we still have some left in wireless subsystem. Otherwise
quite normal.
Major changes:
rtw89
* NL80211_EXT_FEATURE_SCAN_RANDOM_SN support
* NL80211_EXT_FEATURE_SET_SCAN_DWELL support
rtw88
* support for more rtw8811cu and rtw8821cu devices
mt76
* mt76x2u: add Netgear WNDA3100v3 USB
* mt7915: newer ADIE version support
* mt7925: radio temperature sensor support
* mt7996: remove GCMP IGTK offload
* tag 'wireless-next-2024-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (125 commits)
wifi: rtw89: wow: move release offload packet earlier for WoWLAN mode
wifi: rtw89: wow: set security engine options for 802.11ax chips only
wifi: rtw89: update suspend/resume for different generation
wifi: rtw89: wow: update config mac function with different generation
wifi: rtw89: update DMA function with different generation
wifi: rtw89: wow: update WoWLAN status register for different generation
wifi: rtw89: wow: update WoWLAN reason register for different chips
wifi: brcm80211: handle pmk_op allocation failure
wifi: rtw89: coex: Add coexistence policy to decrease WiFi packet CRC-ERR
wifi: rtw89: coex: When Bluetooth not available don't set power/gain
wifi: rtw89: coex: add return value to ensure H2C command is success or not
wifi: rtw89: coex: Reorder H2C command index to align with firmware
wifi: rtw89: coex: add BTC ctrl_info version 7 and related logic
wifi: rtw89: coex: add init_info H2C command format version 7
wifi: rtw89: 8922a: add coexistence helpers of SW grant
wifi: rtw89: mac: add coexistence helpers {cfg/get}_plt
wifi: cw1200: restore endian swapping
wifi: wlcore: sdio: Rate limit wl12xx_sdio_raw_{read,write}() failures warns
wifi: rtlwifi: Remove rtl_intf_ops.read_efuse_byte
wifi: rtw88: 8821c: Fix false alarm count
...
====================
Link: https://lore.kernel.org/r/20240308100429.B8EA2C433F1@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We only use the flag for this purpose, so rename it accordingly. This
further prevents various other use cases of it, keeping it clean and
consistent. Then we can also check it in one spot, when it's being
attempted recycled, and remove some dead code in io_kbuf_recycle_ring().
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since commit 43a7206b0963 ("driver core: class: make class_register() take
a const *"), the driver core allows for struct class to be in read-only
memory, so move the rtc_class structure to be declared at build time
placing it into read-only memory, instead of having to be dynamically
allocated at boot time.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net>
Link: https://lore.kernel.org/r/20240305-class_cleanup-abelloni-v1-1-944c026137c8@marliere.net
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
|
|
softnet_data->time_squeeze is sometimes used as a proxy for
host overload or indication of scheduling problems. In practice
this statistic is very noisy and has hard to grasp units -
e.g. is 10 squeezes a second to be expected, or high?
Delaying network (NAPI) processing leads to drops on NIC queues
but also RTT bloat, impacting pacing and CA decisions.
Stalls are a little hard to detect on the Rx side, because
there may simply have not been any packets received in given
period of time. Packet timestamps help a little bit, but
again we don't know if packets are stale because we're
not keeping up or because someone (*cough* cgroups)
disabled IRQs for a long time.
We can, however, use Tx as a proxy for Rx stalls. Most drivers
use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
On the Tx side we know exactly when packets get queued,
and completed, so there is no uncertainty.
This patch adds stall checks to BQL. Why BQL? Because
it's a convenient place to add such checks, already
called by most drivers, and it has copious free space
in its structures (this patch adds no extra cache
references or dirtying to the fast path).
The algorithm takes one parameter - max delay AKA stall
threshold and increments a counter whenever NAPI got delayed
for at least that amount of time. It also records the length
of the longest stall.
To be precise every time NAPI has not polled for at least
stall thrs we check if there were any Tx packets queued
between last NAPI run and now - stall_thrs/2.
Unlike the classic Tx watchdog this mechanism does not
ignore stalls caused by Tx being disabled, or loss of link.
I don't think the check is worth the complexity, and
stall is a stall, whether due to host overload, flow
control, link down... doesn't matter much to the application.
We have been running this detector in production at Meta
for 2 years, with the threshold of 8ms. It's the lowest
value where false positives become rare. There's still
a constant stream of reported stalls (especially without
the ksoftirqd deferral patches reverted), those who like
their stall metrics to be 0 may prefer higher value.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
'x86/amd' and 'core' into next
|
|
The ethtool-nl family does a good job exposing various protocol
related and IEEE/IETF statistics which used to get dumped under
ethtool -S, with creative names. Queue stats don't have a netlink
API, yet, and remain a lion's share of ethtool -S output for new
drivers. Not only is that bad because the names differ driver to
driver but it's also bug-prone. Intuitively drivers try to report
only the stats for active queues, but querying ethtool stats
involves multiple system calls, and the number of stats is
read separately from the stats themselves. Worse still when user
space asks for values of the stats, it doesn't inform the kernel
how big the buffer is. If number of stats increases in the meantime
kernel will overflow user buffer.
Add a netlink API for dumping queue stats. Queue information is
exposed via the netdev-genl family, so add the stats there.
Support per-queue and sum-for-device dumps. Latter will be useful
when subsequent patches add more interesting common stats than
just bytes and packets.
The API does not currently distinguish between HW and SW stats.
The expectation is that the source of the stats will either not
matter much (good packets) or be obvious (skb alloc errors).
Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240306195509.1502746-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Move RPS related structures and helpers from include/linux/netdevice.h
and include/net/sock.h to a new include file.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-18-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
skbuff_cache, skbuff_fclone_cache and skb_small_head_cache
are used in rx/tx fast paths.
Move them to net_hotdata for better cache locality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-11-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
dev_rx_weight is read from process_backlog().
Move it to net_hotdata for better cache locality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-10-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
dev_tx_weight is used in tx fast path.
Move it to net_hotdata for better cache locality.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20240306160031.874438-9-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|