summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2022-07-28KVM: nVMX: Attempt to load PERF_GLOBAL_CTRL on nVMX xfer iff it existsSean Christopherson
Attempt to load PERF_GLOBAL_CTRL during nested VM-Enter/VM-Exit if and only if the MSR exists (according to the guest vCPU model). KVM has very misguided handling of VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL and attempts to force the nVMX MSR settings to match the vPMU model, i.e. to hide/expose the control based on whether or not the MSR exists from the guest's perspective. KVM's modifications fail to handle the scenario where the vPMU is hidden from the guest _after_ being exposed to the guest, e.g. by userspace doing multiple KVM_SET_CPUID2 calls, which is allowed if done before any KVM_RUN. nested_vmx_pmu_refresh() is called if and only if there's a recognized vPMU, i.e. KVM will leave the bits in the allow state and then ultimately reject the MSR load and WARN. KVM should not force the VMX MSRs in the first place. KVM taking control of the MSRs was a misguided attempt at mimicking what commit 5f76f6f5ff96 ("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled", 2018-10-01) did for MPX. However, the MPX commit was a workaround for another KVM bug and not something that should be imitated (and it should never been done in the first place). In other words, KVM's ABI _should_ be that userspace has full control over the MSRs, at which point triggering the WARN that loading the MSR must not fail is trivial. The intent of the WARN is still valid; KVM has consistency checks to ensure that vmcs12->{guest,host}_ia32_perf_global_ctrl is valid. The problem is that '0' must be considered a valid value at all times, and so the simple/obvious solution is to just not actually load the MSR when it does not exist. It is userspace's responsibility to provide a sane vCPU model, i.e. KVM is well within its ABI and Intel's VMX architecture to skip the loads if the MSR does not exist. Fixes: 03a8871add95 ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL VM-{Entry,Exit} control") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220722224409.1336532-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: VMX: Add helper to check if the guest PMU has PERF_GLOBAL_CTRLSean Christopherson
Add a helper to check of the guest PMU has PERF_GLOBAL_CTRL, which is unintuitive _and_ diverges from Intel's architecturally defined behavior. Even worse, KVM currently implements the check using two different (but equivalent) checks, _and_ there has been at least one attempt to add a _third_ flavor. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220722224409.1336532-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: VMX: Mark all PERF_GLOBAL_(OVF)_CTRL bits reserved if there's no vPMUSean Christopherson
Mark all MSR_CORE_PERF_GLOBAL_CTRL and MSR_CORE_PERF_GLOBAL_OVF_CTRL bits as reserved if there is no guest vPMU. The nVMX VM-Entry consistency checks do not check for a valid vPMU prior to consuming the masks via kvm_valid_perf_global_ctrl(), i.e. may incorrectly allow a non-zero mask to be loaded via VM-Enter or VM-Exit (well, attempted to be loaded, the actual MSR load will be rejected by intel_is_valid_msr()). Fixes: f5132b01386b ("KVM: Expose a version 2 architectural PMU to a guests") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220722224409.1336532-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28Revert "KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled"Paolo Bonzini
Since commit 5f76f6f5ff96 ("KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled"), KVM has taken ownership of the "load IA32_BNDCFGS" and "clear IA32_BNDCFGS" VMX entry/exit controls, trying to set these bits in the IA32_VMX_TRUE_{ENTRY,EXIT}_CTLS MSRs if the guest's CPUID supports MPX, and clear otherwise. The intent of the patch was to apply it to L0 in order to work around L1 kernels that lack the fix in commit 691bd4340bef ("kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS", 2017-07-04): by hiding the control bits from L0, L1 hides BNDCFGS from KVM_GET_MSR_INDEX_LIST, and the L1 bug is neutralized even in the lack of commit 691bd4340bef. This was perhaps a sensible kludge at the time, but a horrible idea in the long term and in fact it has not been extended to other CPUID bits like these: X86_FEATURE_LM => VM_EXIT_HOST_ADDR_SPACE_SIZE, VM_ENTRY_IA32E_MODE, VMX_MISC_SAVE_EFER_LMA X86_FEATURE_TSC => CPU_BASED_RDTSC_EXITING, CPU_BASED_USE_TSC_OFFSETTING, SECONDARY_EXEC_TSC_SCALING X86_FEATURE_INVPCID_SINGLE => SECONDARY_EXEC_ENABLE_INVPCID X86_FEATURE_MWAIT => CPU_BASED_MONITOR_EXITING, CPU_BASED_MWAIT_EXITING X86_FEATURE_INTEL_PT => SECONDARY_EXEC_PT_CONCEAL_VMX, SECONDARY_EXEC_PT_USE_GPA, VM_EXIT_CLEAR_IA32_RTIT_CTL, VM_ENTRY_LOAD_IA32_RTIT_CTL X86_FEATURE_XSAVES => SECONDARY_EXEC_XSAVES These days it's sort of common knowledge that any MSR in KVM_GET_MSR_INDEX_LIST must allow *at least* setting it with KVM_SET_MSR to a default value, so it is unlikely that something like commit 5f76f6f5ff96 will be needed again. So revert it, at the potential cost of breaking L1s with a 6 year old kernel. While in principle the L0 owner doesn't control what runs on L1, such an old hypervisor would probably have many other bugs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nVMX: Let userspace set nVMX MSR to any _host_ supported valueSean Christopherson
Restrict the nVMX MSRs based on KVM's config, not based on the guest's current config. Using the guest's config to audit the new config prevents userspace from restoring the original config (KVM's config) if at any point in the past the guest's config was restricted in any way. Fixes: 62cc6b9dc61e ("KVM: nVMX: support restore of VMX capability MSRs") Cc: stable@vger.kernel.org Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nVMX: Rename handle_vm{on,off}() to handle_vmx{on,off}()Sean Christopherson
Rename the exit handlers for VMXON and VMXOFF to match the instruction names, the terms "vmon" and "vmoff" are not used anywhere in Intel's documentation, nor are they used elsehwere in KVM. Sadly, the exit reasons are exposed to userspace and so cannot be renamed without breaking userspace. :-( Fixes: ec378aeef9df ("KVM: nVMX: Implement VMXON and VMXOFF") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nVMX: Inject #UD if VMXON is attempted with incompatible CR0/CR4Sean Christopherson
Inject a #UD if L1 attempts VMXON with a CR0 or CR4 that is disallowed per the associated nested VMX MSRs' fixed0/1 settings. KVM cannot rely on hardware to perform the checks, even for the few checks that have higher priority than VM-Exit, as (a) KVM may have forced CR0/CR4 bits in hardware while running the guest, (b) there may incompatible CR0/CR4 bits that have lower priority than VM-Exit, e.g. CR0.NE, and (c) userspace may have further restricted the allowed CR0/CR4 values by manipulating the guest's nested VMX MSRs. Note, despite a very strong desire to throw shade at Jim, commit 70f3aac964ae ("kvm: nVMX: Remove superfluous VMX instruction fault checks") is not to blame for the buggy behavior (though the comment...). That commit only removed the CR0.PE, EFLAGS.VM, and COMPATIBILITY mode checks (though it did erroneously drop the CPL check, but that has already been remedied). KVM may force CR0.PE=1, but will do so only when also forcing EFLAGS.VM=1 to emulate Real Mode, i.e. hardware will still #UD. Link: https://bugzilla.kernel.org/show_bug.cgi?id=216033 Fixes: ec378aeef9df ("KVM: nVMX: Implement VMXON and VMXOFF") Reported-by: Eric Li <ercli@ucdavis.edu> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nVMX: Account for KVM reserved CR4 bits in consistency checksSean Christopherson
Check that the guest (L2) and host (L1) CR4 values that would be loaded by nested VM-Enter and VM-Exit respectively are valid with respect to KVM's (L0 host) allowed CR4 bits. Failure to check KVM reserved bits would allow L1 to load an illegal CR4 (or trigger hardware VM-Fail or failed VM-Entry) by massaging guest CPUID to allow features that are not supported by KVM. Amusingly, KVM itself is an accomplice in its doom, as KVM adjusts L1's MSR_IA32_VMX_CR4_FIXED1 to allow L1 to enable bits for L2 based on L1's CPUID model. Note, although nested_{guest,host}_cr4_valid() are _currently_ used if and only if the vCPU is post-VMXON (nested.vmxon == true), that may not be true in the future, e.g. emulating VMXON has a bug where it doesn't check the allowed/required CR0/CR4 bits. Cc: stable@vger.kernel.org Fixes: 3899152ccbf4 ("KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86: Split kvm_is_valid_cr4() and export only the non-vendor bitsSean Christopherson
Split the common x86 parts of kvm_is_valid_cr4(), i.e. the reserved bits checks, into a separate helper, __kvm_is_valid_cr4(), and export only the inner helper to vendor code in order to prevent nested VMX from calling back into vmx_is_valid_cr4() via kvm_is_valid_cr4(). On SVM, this is a nop as SVM doesn't place any additional restrictions on CR4. On VMX, this is also currently a nop, but only because nested VMX is missing checks on reserved CR4 bits for nested VM-Enter. That bug will be fixed in a future patch, and could simply use kvm_is_valid_cr4() as-is, but nVMX has _another_ bug where VMXON emulation doesn't enforce VMX's restrictions on CR0/CR4. The cleanest and most intuitive way to fix the VMXON bug is to use nested_host_cr{0,4}_valid(). If the CR4 variant routes through kvm_is_valid_cr4(), using nested_host_cr4_valid() won't do the right thing for the VMXON case as vmx_is_valid_cr4() enforces VMX's restrictions if and only if the vCPU is post-VMXON. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220607213604.3346000-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Don't bottom out on leafs when zapping collapsible SPTEsSean Christopherson
When zapping collapsible SPTEs in the TDP MMU, don't bottom out on a leaf SPTE now that KVM doesn't require a PFN to compute the host mapping level, i.e. now that there's no need to first find a leaf SPTE and then step back up. Drop the now unused tdp_iter_step_up(), as it is not the safest of helpers (using any of the low level iterators requires some understanding of the various side effects). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715232107.3775620-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Document the "rules" for using host_pfn_mapping_level()Sean Christopherson
Add a comment to document how host_pfn_mapping_level() can be used safely, as the line between safe and dangerous is quite thin. E.g. if KVM were to ever support in-place promotion to create huge pages, consuming the level is safe if the caller holds mmu_lock and checks that there's an existing _leaf_ SPTE, but unsafe if the caller only checks that there's a non-leaf SPTE. Opportunistically tweak the existing comments to explicitly document why KVM needs to use READ_ONCE(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715232107.3775620-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Don't require refcounted "struct page" to create huge SPTEsSean Christopherson
Drop the requirement that a pfn be backed by a refcounted, compound or or ZONE_DEVICE, struct page, and instead rely solely on the host page tables to identify huge pages. The PageCompound() check is a remnant of an old implementation that identified (well, attempt to identify) huge pages without walking the host page tables. The ZONE_DEVICE check was added as an exception to the PageCompound() requirement. In other words, neither check is actually a hard requirement, if the primary has a pfn backed with a huge page, then KVM can back the pfn with a huge page regardless of the backing store. Dropping the @pfn parameter will also allow KVM to query the max host mapping level without having to first get the pfn, which is advantageous for use outside of the page fault path where KVM wants to take action if and only if a page can be mapped huge, i.e. avoids the pfn lookup for gfns that can't be backed with a huge page. Cc: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220715232107.3775620-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Restrict mapping level based on guest MTRR iff they're usedSean Christopherson
Restrict the mapping level for SPTEs based on the guest MTRRs if and only if KVM may actually use the guest MTRRs to compute the "real" memtype. For all forms of paging, guest MTRRs are purely virtual in the sense that they are completely ignored by hardware, i.e. they affect the memtype only if software manually consumes them. The only scenario where KVM consumes the guest MTRRs is when shadow_memtype_mask is non-zero and the guest has non-coherent DMA, in all other cases KVM simply leaves the PAT field in SPTEs as '0' to encode WB memtype. Note, KVM may still ultimately ignore guest MTRRs, e.g. if the backing pfn is host MMIO, but false positives are ok as they only cause a slight performance blip (unless the guest is doing weird things with its MTRRs, which is extremely unlikely). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220715230016.3762909-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Add shadow mask for effective host MTRR memtypeSean Christopherson
Add shadow_memtype_mask to capture that EPT needs a non-zero memtype mask instead of relying on TDP being enabled, as NPT doesn't need a non-zero mask. This is a glorified nop as kvm_x86_ops.get_mt_mask() returns zero for NPT anyways. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220715230016.3762909-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86: Drop unnecessary goto+label in kvm_arch_init()Sean Christopherson
Return directly if kvm_arch_init() detects an error before doing any real work, jumping through a label obfuscates what's happening and carries the unnecessary risk of leaving 'r' uninitialized. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220715230016.3762909-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86: Reject loading KVM if host.PAT[0] != WBSean Christopherson
Reject KVM if entry '0' in the host's IA32_PAT MSR is not programmed to writeback (WB) memtype. KVM subtly relies on IA32_PAT entry '0' to be programmed to WB by leaving the PAT bits in shadow paging and NPT SPTEs as '0'. If something other than WB is in PAT[0], at _best_ guests will suffer very poor performance, and at worst KVM will crash the system by breaking cache-coherency expecations (e.g. using WC for guest memory). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220715230016.3762909-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: SVM: Fix x2APIC MSRs interceptionSuravee Suthikulpanit
The index for svm_direct_access_msrs was incorrectly initialized with the APIC MMIO register macros. Fix by introducing a macro for calculating x2APIC MSRs. Fixes: 5c127c85472c ("KVM: SVM: Adding support for configuring x2APIC MSRs interception") Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <20220718083833.222117-1-suravee.suthikulpanit@amd.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Remove underscores from __pte_list_remove()Sean Christopherson
Remove the underscores from __pte_list_remove(), the function formerly known as pte_list_remove() is now named kvm_zap_one_rmap_spte() to show that it zaps rmaps/PTEs, i.e. doesn't just remove an entry from a list. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Rename pte_list_{destroy,remove}() to show they zap SPTEsSean Christopherson
Rename pte_list_remove() and pte_list_destroy() to kvm_zap_one_rmap_spte() and kvm_zap_all_rmap_sptes() respectively to document that (a) they zap SPTEs and (b) to better document how they differ (remove vs. destroy does not exactly scream "one vs. all"). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Rename rmap zap helpers to eliminate "unmap" wrapperSean Christopherson
Rename kvm_unmap_rmap() and kvm_zap_rmap() to kvm_zap_rmap() and __kvm_zap_rmap() respectively to show that what was the "unmap" helper is just a wrapper for the "zap" helper, i.e. that they do the exact same thing, one just exists to deal with its caller passing in more params. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Rename __kvm_zap_rmaps() to align with other nomenclatureSean Christopherson
Rename __kvm_zap_rmaps() to kvm_rmap_zap_gfn_range() to avoid future confusion with a soon-to-be-introduced __kvm_zap_rmap(). Using a plural "rmaps" is somewhat ambiguous without additional context, as it's not obvious whether it's referring to multiple rmap lists, versus multiple rmap entries within a single list. Use kvm_rmap_zap_gfn_range() to align with the pattern established by kvm_rmap_zap_collapsible_sptes(), without losing the information that it zaps only rmap-based MMUs, i.e. don't rename it to __kvm_zap_gfn_range(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Drop the "p is for pointer" from rmap helpersSean Christopherson
Drop the trailing "p" from rmap helpers, i.e. rename functions to simply be kvm_<action>_rmap(). Declaring that a function takes a pointer is completely unnecessary and goes against kernel style. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Directly "destroy" PTE list when recycling rmapsSean Christopherson
Use pte_list_destroy() directly when recycling rmaps instead of bouncing through kvm_unmap_rmapp() and kvm_zap_rmapp(). Calling kvm_unmap_rmapp() is unnecessary and odd as it requires passing dummy parameters; passing NULL for @slot when __rmap_add() already has a valid slot is especially weird and confusing. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: x86/mmu: Return a u64 (the old SPTE) from mmu_spte_clear_track_bits()Sean Christopherson
Return a u64, not an int, from mmu_spte_clear_track_bits(). The return value is the old SPTE value, which is very much a 64-bit value. The sole caller that consumes the return value, drop_spte(), already uses a u64. The only reason that truncating the SPTE value is not problematic is because drop_spte() only queries the shadow-present bit, which is in the lower 32 bits. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220715224226.3749507-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-28KVM: nSVM: Pull CS.Base from actual VMCB12 for soft int/ex re-injectionMaciej S. Szmigiero
enter_svm_guest_mode() first calls nested_vmcb02_prepare_control() to copy control fields from VMCB12 to the current VMCB, then nested_vmcb02_prepare_save() to perform a similar copy of the save area. This means that nested_vmcb02_prepare_control() still runs with the previous save area values in the current VMCB so it shouldn't take the L2 guest CS.Base from this area. Explicitly pull CS.Base from the actual VMCB12 instead in enter_svm_guest_mode(). Granted, having a non-zero CS.Base is a very rare thing (and even impossible in 64-bit mode), having it change between nested VMRUNs is probably even rarer, but if it happens it would create a really subtle bug so it's better to fix it upfront. Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction") Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <4caa0f67589ae3c22c311ee0e6139496902f2edc.1658159083.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-27Revert "x86/sev: Expose sev_es_ghcb_hv_call() for use by HyperV"Borislav Petkov
This reverts commit 007faec014cb5d26983c1f86fd08c6539b41392e. Now that hyperv does its own protocol negotiation: 49d6a3c062a1 ("x86/Hyper-V: Add SEV negotiate protocol support in Isolation VM") revert this exposure of the sev_es_ghcb_hv_call() helper. Cc: Wei Liu <wei.liu@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by:Tianyu Lan <tiala@microsoft.com> Link: https://lore.kernel.org/r/20220614014553.1915929-1-ltykernel@gmail.com
2022-07-27perf/x86/ibs: Add new IBS register bits into headerRavi Bangoria
IBS support has been enhanced with two new features in upcoming uarch: 1. DataSrc extension and 2. L3 miss filtering. Additional set of bits has been introduced in IBS registers to use these features. Define these new bits into arch/x86/ header. [ bp: Massage commit message. ] Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20220604044519.594-7-ravi.bangoria@amd.com
2022-07-26x86/cyrix: include header linux/isa-dma.hRandy Dunlap
x86/kernel/cpu/cyrix.c now needs to include <linux/isa-dma.h> since the 'isa_dma_bridge_buggy' variable was moved to it. Fixes this build error: ../arch/x86/kernel/cpu/cyrix.c: In function ‘init_cyrix’: ../arch/x86/kernel/cpu/cyrix.c:277:17: error: ‘isa_dma_bridge_buggy’ undeclared (first use in this function) 277 | isa_dma_bridge_buggy = 2; Fixes: abb4970ac335 ("PCI: Move isa_dma_bridge_buggy out of asm/dma.h") Link: https://lore.kernel.org/r/20220725202224.29269-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Stafford Horne <shorne@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org
2022-07-25random: handle archrandom with multiple longsJason A. Donenfeld
The archrandom interface was originally designed for x86, which supplies RDRAND/RDSEED for receiving random words into registers, resulting in one function to generate an int and another to generate a long. However, other architectures don't follow this. On arm64, the SMCCC TRNG interface can return between one and three longs. On s390, the CPACF TRNG interface can return arbitrary amounts, with four longs having the same cost as one. On UML, the os_getrandom() interface can return arbitrary amounts. So change the api signature to take a "max_longs" parameter designating the maximum number of longs requested, and then return the number of longs generated. Since callers need to check this return value and loop anyway, each arch implementation does not bother implementing its own loop to try again to fill the maximum number of longs. Additionally, all existing callers pass in a constant max_longs parameter. Taken together, these two things mean that the codegen doesn't really change much for one-word-at-a-time platforms, while performance is greatly improved on platforms such as s390. Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-07-25x86/purgatory: Omit use of bin2cMasahiro Yamada
The .incbin assembler directive is much faster than bin2c + $(CC). Do similar refactoring as in 4c0f032d4963 ("s390/purgatory: Omit use of bin2c"). Please note the .quad directive matches to size_t in C (both 8 byte) because the purgatory is compiled only for the 64-bit kernel. (KEXEC_FILE depends on X86_64). Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220725020812.622255-2-masahiroy@kernel.org
2022-07-25x86/purgatory: Hard-code obj-y in MakefileMasahiro Yamada
arch/x86/Kbuild guards the entire purgatory/ directory, and CONFIG_KEXEC_FILE is bool type. $(CONFIG_KEXEC_FILE) is always 'y' when this directory is being built. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220725020812.622255-1-masahiroy@kernel.org
2022-07-24Merge tag 'perf_urgent_for_v5.19_rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Borislav Petkov: - Reorganize the perf LBR init code so that a TSX quirk is applied early enough in order for the LBR MSR access to not #GP * tag 'perf_urgent_for_v5.19_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel/lbr: Fix unchecked MSR access error on HSW
2022-07-24Merge tag 'x86_urgent_for_v5.19_rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: "A couple more retbleed fallout fixes. It looks like their urgency is decreasing so it seems like we've managed to catch whatever snafus the limited -rc testing has exposed. Maybe we're getting ready... :) - Make retbleed mitigations 64-bit only (32-bit will need a bit more work if even needed, at all). - Prevent return thunks patching of the LKDTM modules as it is not needed there - Avoid writing the SPEC_CTRL MSR on every kernel entry on eIBRS parts - Enhance error output of apply_returns() when it fails to patch a return thunk - A sparse fix to the sev-guest module - Protect EFI fw calls by issuing an IBPB on AMD" * tag 'x86_urgent_for_v5.19_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation: Make all RETbleed mitigations 64-bit only lkdtm: Disable return thunks in rodata.c x86/bugs: Warn when "ibrs" mitigation is selected on Enhanced IBRS parts x86/alternative: Report missing return thunk details virt: sev-guest: Pass the appropriate argument type to iounmap() x86/amd: Use IBPB for firmware calls
2022-07-23Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: - Check for invalid flags to KVM_CAP_X86_USER_SPACE_MSR - Fix use of sched_setaffinity in selftests - Sync kernel headers to tools - Fix KVM_STATS_UNIT_MAX * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Protect the unused bits in MSR exiting flags tools headers UAPI: Sync linux/kvm.h with the kernel sources KVM: selftests: Fix target thread to be migrated in rseq_test KVM: stats: Fix value for KVM_STATS_UNIT_MAX for boolean stats
2022-07-23x86/speculation: Make all RETbleed mitigations 64-bit onlyBen Hutchings
The mitigations for RETBleed are currently ineffective on x86_32 since entry_32.S does not use the required macros. However, for an x86_32 target, the kconfig symbols for them are still enabled by default and /sys/devices/system/cpu/vulnerabilities/retbleed will wrongly report that mitigations are in place. Make all of these symbols depend on X86_64, and only enable RETHUNK by default on X86_64. Fixes: f43b9876e857 ("x86/retbleed: Add fine grained Kconfig knobs") Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/YtwSR3NNsWp1ohfV@decadent.org.uk
2022-07-22Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== bpf-next 2022-07-22 We've added 73 non-merge commits during the last 12 day(s) which contain a total of 88 files changed, 3458 insertions(+), 860 deletions(-). The main changes are: 1) Implement BPF trampoline for arm64 JIT, from Xu Kuohai. 2) Add ksyscall/kretsyscall section support to libbpf to simplify tracing kernel syscalls through kprobe mechanism, from Andrii Nakryiko. 3) Allow for livepatch (KLP) and BPF trampolines to attach to the same kernel function, from Song Liu & Jiri Olsa. 4) Add new kfunc infrastructure for netfilter's CT e.g. to insert and change entries, from Kumar Kartikeya Dwivedi & Lorenzo Bianconi. 5) Add a ksym BPF iterator to allow for more flexible and efficient interactions with kernel symbols, from Alan Maguire. 6) Bug fixes in libbpf e.g. for uprobe binary path resolution, from Dan Carpenter. 7) Fix BPF subprog function names in stack traces, from Alexei Starovoitov. 8) libbpf support for writing custom perf event readers, from Jon Doron. 9) Switch to use SPDX tag for BPF helper man page, from Alejandro Colomar. 10) Fix xsk send-only sockets when in busy poll mode, from Maciej Fijalkowski. 11) Reparent BPF maps and their charging on memcg offlining, from Roman Gushchin. 12) Multiple follow-up fixes around BPF lsm cgroup infra, from Stanislav Fomichev. 13) Use bootstrap version of bpftool where possible to speed up builds, from Pu Lehui. 14) Cleanup BPF verifier's check_func_arg() handling, from Joanne Koong. 15) Make non-prealloced BPF map allocations low priority to play better with memcg limits, from Yafang Shao. 16) Fix BPF test runner to reject zero-length data for skbs, from Zhengchao Shao. 17) Various smaller cleanups and improvements all over the place. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (73 commits) bpf: Simplify bpf_prog_pack_[size|mask] bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch) bpf, x64: Allow to use caller address from stack ftrace: Allow IPMODIFY and DIRECT ops on the same function ftrace: Add modify_ftrace_direct_multi_nolock bpf/selftests: Fix couldn't retrieve pinned program in xdp veth test bpf: Fix build error in case of !CONFIG_DEBUG_INFO_BTF selftests/bpf: Fix test_verifier failed test in unprivileged mode selftests/bpf: Add negative tests for new nf_conntrack kfuncs selftests/bpf: Add tests for new nf_conntrack kfuncs selftests/bpf: Add verifier tests for trusted kfunc args net: netfilter: Add kfuncs to set and change CT status net: netfilter: Add kfuncs to set and change CT timeout net: netfilter: Add kfuncs to allocate and insert CT net: netfilter: Deduplicate code in bpf_{xdp,skb}_ct_lookup bpf: Add documentation for kfuncs bpf: Add support for forcing kfunc args to be trusted bpf: Switch to new kfunc flags infrastructure tools/resolve_btfids: Add support for 8-byte BTF sets bpf: Introduce 8-byte BTF set ... ==================== Link: https://lore.kernel.org/r/20220722221218.29943-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-22PCI: Move isa_dma_bridge_buggy out of asm/dma.hStafford Horne
The isa_dma_bridge_buggy symbol is only used for x86_32, and only x86_32 platforms or quirks ever set it. Add a new linux/isa-dma.h header that #defines isa_dma_bridge_buggy to 0 except on x86_32, where we keep it as a variable, and remove all the arch- specific definitions. [bhelgaas: commit log] Suggested-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Christoph Hellwig <hch@infradead.org> Link: https://lore.kernel.org/r/20220722214944.831438-3-shorne@gmail.com Signed-off-by: Stafford Horne <shorne@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
2022-07-22PCI: Remove pci_get_legacy_ide_irq() and asm-generic/pci.hStafford Horne
pci_get_legacy_ide_irq() is only used on platforms that support PNP, so many architectures define it but never use it. Replace uses of it with ATA_PRIMARY_IRQ() and ATA_SECONDARY_IRQ(), which provide the same functionality. Since pci_get_legacy_ide_irq() is no longer used, remove all the architecture-specific definitions of it as well as asm-generic/pci.h, which only provides pci_get_legacy_ide_irq() [bhelgaas: commit log] Co-developed-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20220722214944.831438-2-shorne@gmail.com Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Stafford Horne <shorne@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Pierre Morel <pmorel@linux.ibm.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2022-07-22bpf, x64: Allow to use caller address from stackJiri Olsa
Currently we call the original function by using the absolute address given at the JIT generation. That's not usable when having trampoline attached to multiple functions, or the target address changes dynamically (in case of live patch). In such cases we need to take the return address from the stack. Adding support to retrieve the original function address from the stack by adding new BPF_TRAMP_F_ORIG_STACK flag for arch_prepare_bpf_trampoline function. Basically we take the return address of the 'fentry' call: function + 0: call fentry # stores 'function + 5' address on stack function + 5: ... The 'function + 5' address will be used as the address for the original function to call. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220720002126.803253-4-song@kernel.org
2022-07-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-21mmu_gather: Remove per arch tlb_{start,end}_vma()Peter Zijlstra
Scattered across the archs are 3 basic forms of tlb_{start,end}_vma(). Provide two new MMU_GATHER_knobs to enumerate them and remove the per arch tlb_{start,end}_vma() implementations. - MMU_GATHER_NO_FLUSH_CACHE indicates the arch has flush_cache_range() but does *NOT* want to call it for each VMA. - MMU_GATHER_MERGE_VMAS indicates the arch wants to merge the invalidate across multiple VMAs if possible. With these it is possible to capture the three forms: 1) empty stubs; select MMU_GATHER_NO_FLUSH_CACHE and MMU_GATHER_MERGE_VMAS 2) start: flush_cache_range(), end: empty; select MMU_GATHER_MERGE_VMAS 3) start: flush_cache_range(), end: flush_tlb_range(); default Obviously, if the architecture does not have flush_cache_range() then it also doesn't need to select MMU_GATHER_NO_FLUSH_CACHE. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Cc: David Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-21x86/extable: Fix ex_handler_msr() print conditionPeter Zijlstra
On Fri, Jun 17, 2022 at 02:08:52PM +0300, Stephane Eranian wrote: > Some changes to the way invalid MSR accesses are reported by the > kernel is causing some problems with messages printed on the > console. > > We have seen several cases of ex_handler_msr() printing invalid MSR > accesses once but the callstack multiple times causing confusion on > the console. > The problem here is that another earlier commit (5.13): > > a358f40600b3 ("once: implement DO_ONCE_LITE for non-fast-path "do once" functionality") > > Modifies all the pr_*_once() calls to always return true claiming > that no caller is ever checking the return value of the functions. > > This is why we are seeing the callstack printed without the > associated printk() msg. Extract the ONCE_IF(cond) part into __ONCE_LTE_IF() and use that to implement DO_ONCE_LITE_IF() and fix the extable code. Fixes: a358f40600b3 ("once: implement DO_ONCE_LITE for non-fast-path "do once" functionality") Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Stephane Eranian <eranian@google.com> Link: https://lkml.kernel.org/r/YqyVFsbviKjVGGZ9@worktop.programming.kicks-ass.net
2022-07-21x86,nospec: Simplify {JMP,CALL}_NOSPECPeter Zijlstra
Have {JMP,CALL}_NOSPEC generate the same code GCC does for indirect calls and rely on the objtool retpoline patching infrastructure. There's no reason these should be alternatives while the vast bulk of compiler generated retpolines are not. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2022-07-20perf/x86/intel/lbr: Fix unchecked MSR access error on HSWKan Liang
The fuzzer triggers the below trace. [ 7763.384369] unchecked MSR access error: WRMSR to 0x689 (tried to write 0x1fffffff8101349e) at rIP: 0xffffffff810704a4 (native_write_msr+0x4/0x20) [ 7763.397420] Call Trace: [ 7763.399881] <TASK> [ 7763.401994] intel_pmu_lbr_restore+0x9a/0x1f0 [ 7763.406363] intel_pmu_lbr_sched_task+0x91/0x1c0 [ 7763.410992] __perf_event_task_sched_in+0x1cd/0x240 On a machine with the LBR format LBR_FORMAT_EIP_FLAGS2, when the TSX is disabled, a TSX quirk is required to access LBR from registers. The lbr_from_signext_quirk_needed() is introduced to determine whether the TSX quirk should be applied. However, the lbr_from_signext_quirk_needed() is invoked before the intel_pmu_lbr_init(), which parses the LBR format information. Without the correct LBR format information, the TSX quirk never be applied. Move the lbr_from_signext_quirk_needed() into the intel_pmu_lbr_init(). Checking x86_pmu.lbr_has_tsx in the lbr_from_signext_quirk_needed() is not required anymore. Both LBR_FORMAT_EIP_FLAGS2 and LBR_FORMAT_INFO have LBR_TSX flag, but only the LBR_FORMAT_EIP_FLAGS2 requirs the quirk. Update the comments accordingly. Fixes: 1ac7fd8159a8 ("perf/x86/intel/lbr: Support LBR format V7") Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20220714182630.342107-1-kan.liang@linux.intel.com
2022-07-20lkdtm: Disable return thunks in rodata.cJosh Poimboeuf
The following warning was seen: WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:557 apply_returns (arch/x86/kernel/alternative.c:557 (discriminator 1)) Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.19.0-rc4-00008-gee88d363d156 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 RIP: 0010:apply_returns (arch/x86/kernel/alternative.c:557 (discriminator 1)) Code: ff ff 74 cb 48 83 c5 04 49 39 ee 0f 87 81 fe ff ff e9 22 ff ff ff 0f 0b 48 83 c5 04 49 39 ee 0f 87 6d fe ff ff e9 0e ff ff ff <0f> 0b 48 83 c5 04 49 39 ee 0f 87 59 fe ff ff e9 fa fe ff ff 48 89 The warning happened when apply_returns() failed to convert "JMP __x86_return_thunk" to RET. It was instead a JMP to nowhere, due to the thunk relocation not getting resolved. That rodata.o code is objcopy'd to .rodata, and later memcpy'd, so relocations don't work (and are apparently silently ignored). LKDTM is only used for testing, so the naked RET should be fine. So just disable return thunks for that file. While at it, disable objtool and KCSAN for the file. Fixes: 0b53c374b9ef ("x86/retpoline: Use -mfunction-return") Reported-by: kernel test robot <oliver.sang@intel.com> Debugged-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/lkml/Ys58BxHxoDZ7rfpr@xsang-OptiPlex-9020/
2022-07-20x86/bugs: Warn when "ibrs" mitigation is selected on Enhanced IBRS partsPawan Gupta
IBRS mitigation for spectre_v2 forces write to MSR_IA32_SPEC_CTRL at every kernel entry/exit. On Enhanced IBRS parts setting MSR_IA32_SPEC_CTRL[IBRS] only once at boot is sufficient. MSR writes at every kernel entry/exit incur unnecessary performance loss. When Enhanced IBRS feature is present, print a warning about this unnecessary performance loss. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/2a5eaf54583c2bfe0edc4fea64006656256cca17.1657814857.git.pawan.kumar.gupta@linux.intel.com
2022-07-20x86/alternative: Report missing return thunk detailsKees Cook
Debugging missing return thunks is easier if we can see where they're happening. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/lkml/Ys66hwtFcGbYmoiZ@hirez.programming.kicks-ass.net/
2022-07-20x86/amd_nb: Add AMD PCI IDs for SMN communicationMario Limonciello
Add support for SMN communication on family 17h model A0h and family 19h models 60h-70h. [ bp: Merge into a single patch. ] Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci_ids.h Acked-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20220719195256.1516-1-mario.limonciello@amd.com
2022-07-19x86/cpu: Use MSR_IA32_MISC_ENABLE constantsPaolo Bonzini
Instead of the magic numbers 1<<11 and 1<<12 use the constants from msr-index.h. This makes it obvious where those bits of MSR_IA32_MISC_ENABLE are consumed (and in fact that Linux consumes them at all) to simple minds that grep for MSR_IA32_MISC_ENABLE_.*_UNAVAIL. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220719174714.2410374-1-pbonzini@redhat.com
2022-07-19KVM: x86: Protect the unused bits in MSR exiting flagsAaron Lewis
The flags for KVM_CAP_X86_USER_SPACE_MSR and KVM_X86_SET_MSR_FILTER have no protection for their unused bits. Without protection, future development for these features will be difficult. Add the protection needed to make it possible to extend these features in the future. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20220714161314.1715227-1-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>