diff options
Diffstat (limited to 'Documentation/virt/kvm/api.rst')
-rw-r--r-- | Documentation/virt/kvm/api.rst | 343 |
1 files changed, 259 insertions, 84 deletions
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index aed52b0fc16e..22d077562149 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -55,7 +55,7 @@ not cause harm to the host, their actual behavior is not guaranteed by the API. See "General description" for details on the ioctl usage model that is supported by KVM. -It is important to note that althought VM ioctls may only be issued from +It is important to note that although VM ioctls may only be issued from the process that created the VM, a VM's lifecycle is associated with its file descriptor, not its creator (process). In other words, the VM and its resources, *including the associated address space*, are not freed @@ -182,6 +182,9 @@ is dependent on the CPU capability and the kernel configuration. The limit can be retrieved using KVM_CAP_ARM_VM_IPA_SIZE of the KVM_CHECK_EXTENSION ioctl() at run-time. +Creation of the VM will fail if the requested IPA size (whether it is +implicit or explicit) is unsupported on the host. + Please note that configuring the IPA size does not affect the capability exposed by the guest CPUs in ID_AA64MMFR0_EL1[PARange]. It only affects size of the address translated by the stage2 level (guest physical to @@ -201,7 +204,7 @@ Errors: ====== ============================================================ EFAULT the msr index list cannot be read from or written to - E2BIG the msr index list is to be to fit in the array specified by + E2BIG the msr index list is too big to fit in the array specified by the user. ====== ============================================================ @@ -1492,7 +1495,8 @@ Fails if any VCPU has already been created. Define which vcpu is the Bootstrap Processor (BSP). Values are the same as the vcpu id in KVM_CREATE_VCPU. If this ioctl is not called, the default -is vcpu 0. +is vcpu 0. This ioctl has to be called before vcpu creation, +otherwise it will return EBUSY error. 4.42 KVM_GET_XSAVE @@ -3112,6 +3116,18 @@ optional features it should have. This will cause a reset of the cpu registers to their initial values. If this is not called, KVM_RUN will return ENOEXEC for that vcpu. +The initial values are defined as: + - Processor state: + * AArch64: EL1h, D, A, I and F bits set. All other bits + are cleared. + * AArch32: SVC, A, I and F bits set. All other bits are + cleared. + - General Purpose registers, including PC and SP: set to 0 + - FPSIMD/NEON registers: set to 0 + - SVE registers: set to 0 + - System registers: Reset to their architecturally defined + values as for a warm reset to EL1 (resp. SVC) + Note that because some registers reflect machine topology, all vcpus should be created before this ioctl is invoked. @@ -3331,7 +3347,8 @@ The top 16 bits of the control field are architecture specific control flags which can include the following: - KVM_GUESTDBG_USE_SW_BP: using software breakpoints [x86, arm64] - - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390, arm64] + - KVM_GUESTDBG_USE_HW_BP: using hardware breakpoints [x86, s390] + - KVM_GUESTDBG_USE_HW: using hardware debug events [arm64] - KVM_GUESTDBG_INJECT_DB: inject DB type exception [x86] - KVM_GUESTDBG_INJECT_BP: inject BP type exception [x86] - KVM_GUESTDBG_EXIT_PENDING: trigger an immediate guest exit [s390] @@ -3354,6 +3371,9 @@ indicating the number of supported registers. For ppc, the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability indicates whether the single-step debug event (KVM_GUESTDBG_SINGLESTEP) is supported. +Also when supported, KVM_CAP_SET_GUEST_DEBUG2 capability indicates the +supported KVM_GUESTDBG_* bits in the control field. + When debug events exit the main run loop with the reason KVM_EXIT_DEBUG with the kvm_debug_exit_arch part of the kvm_run structure containing architecture specific debug information. @@ -3686,31 +3706,105 @@ which is the maximum number of possibly pending cpu-local interrupts. Queues an SMI on the thread's vcpu. -4.97 KVM_CAP_PPC_MULTITCE -------------------------- +4.97 KVM_X86_SET_MSR_FILTER +---------------------------- -:Capability: KVM_CAP_PPC_MULTITCE -:Architectures: ppc -:Type: vm +:Capability: KVM_X86_SET_MSR_FILTER +:Architectures: x86 +:Type: vm ioctl +:Parameters: struct kvm_msr_filter +:Returns: 0 on success, < 0 on error -This capability means the kernel is capable of handling hypercalls -H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user -space. This significantly accelerates DMA operations for PPC KVM guests. -User space should expect that its handlers for these hypercalls -are not going to be called if user space previously registered LIOBN -in KVM (via KVM_CREATE_SPAPR_TCE or similar calls). +:: -In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest, -user space might have to advertise it for the guest. For example, -IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is -present in the "ibm,hypertas-functions" device-tree property. + struct kvm_msr_filter_range { + #define KVM_MSR_FILTER_READ (1 << 0) + #define KVM_MSR_FILTER_WRITE (1 << 1) + __u32 flags; + __u32 nmsrs; /* number of msrs in bitmap */ + __u32 base; /* MSR index the bitmap starts at */ + __u8 *bitmap; /* a 1 bit allows the operations in flags, 0 denies */ + }; -The hypercalls mentioned above may or may not be processed successfully -in the kernel based fast path. If they can not be handled by the kernel, -they will get passed on to user space. So user space still has to have -an implementation for these despite the in kernel acceleration. + #define KVM_MSR_FILTER_MAX_RANGES 16 + struct kvm_msr_filter { + #define KVM_MSR_FILTER_DEFAULT_ALLOW (0 << 0) + #define KVM_MSR_FILTER_DEFAULT_DENY (1 << 0) + __u32 flags; + struct kvm_msr_filter_range ranges[KVM_MSR_FILTER_MAX_RANGES]; + }; -This capability is always enabled. +flags values for ``struct kvm_msr_filter_range``: + +``KVM_MSR_FILTER_READ`` + + Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap + indicates that a read should immediately fail, while a 1 indicates that + a read for a particular MSR should be handled regardless of the default + filter action. + +``KVM_MSR_FILTER_WRITE`` + + Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap + indicates that a write should immediately fail, while a 1 indicates that + a write for a particular MSR should be handled regardless of the default + filter action. + +``KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE`` + + Filter both read and write accesses to MSRs using the given bitmap. A 0 + in the bitmap indicates that both reads and writes should immediately fail, + while a 1 indicates that reads and writes for a particular MSR are not + filtered by this range. + +flags values for ``struct kvm_msr_filter``: + +``KVM_MSR_FILTER_DEFAULT_ALLOW`` + + If no filter range matches an MSR index that is getting accessed, KVM will + fall back to allowing access to the MSR. + +``KVM_MSR_FILTER_DEFAULT_DENY`` + + If no filter range matches an MSR index that is getting accessed, KVM will + fall back to rejecting access to the MSR. In this mode, all MSRs that should + be processed by KVM need to explicitly be marked as allowed in the bitmaps. + +This ioctl allows user space to define up to 16 bitmaps of MSR ranges to +specify whether a certain MSR access should be explicitly filtered for or not. + +If this ioctl has never been invoked, MSR accesses are not guarded and the +default KVM in-kernel emulation behavior is fully preserved. + +Calling this ioctl with an empty set of ranges (all nmsrs == 0) disables MSR +filtering. In that mode, ``KVM_MSR_FILTER_DEFAULT_DENY`` is invalid and causes +an error. + +As soon as the filtering is in place, every MSR access is processed through +the filtering except for accesses to the x2APIC MSRs (from 0x800 to 0x8ff); +x2APIC MSRs are always allowed, independent of the ``default_allow`` setting, +and their behavior depends on the ``X2APIC_ENABLE`` bit of the APIC base +register. + +If a bit is within one of the defined ranges, read and write accesses are +guarded by the bitmap's value for the MSR index if the kind of access +is included in the ``struct kvm_msr_filter_range`` flags. If no range +cover this particular access, the behavior is determined by the flags +field in the kvm_msr_filter struct: ``KVM_MSR_FILTER_DEFAULT_ALLOW`` +and ``KVM_MSR_FILTER_DEFAULT_DENY``. + +Each bitmap range specifies a range of MSRs to potentially allow access on. +The range goes from MSR index [base .. base+nmsrs]. The flags field +indicates whether reads, writes or both reads and writes are filtered +by setting a 1 bit in the bitmap for the corresponding MSR index. + +If an MSR access is not permitted through the filtering, it generates a +#GP inside the guest. When combined with KVM_CAP_X86_USER_SPACE_MSR, that +allows user space to deflect and potentially handle various MSR accesses +into user space. + +If a vCPU is in running state while this ioctl is invoked, the vCPU may +experience inconsistent filtering behavior on MSR accesses. 4.98 KVM_CREATE_SPAPR_TCE_64 ---------------------------- @@ -3856,49 +3950,20 @@ base 2 of the page size in the bottom 6 bits. -EFAULT if struct kvm_reinject_control cannot be read, -EINVAL if the supplied shift or flags are invalid, -ENOMEM if unable to allocate the new HPT, - -ENOSPC if there was a hash collision - -:: - - struct kvm_ppc_rmmu_info { - struct kvm_ppc_radix_geom { - __u8 page_shift; - __u8 level_bits[4]; - __u8 pad[3]; - } geometries[8]; - __u32 ap_encodings[8]; - }; - -The geometries[] field gives up to 8 supported geometries for the -radix page table, in terms of the log base 2 of the smallest page -size, and the number of bits indexed at each level of the tree, from -the PTE level up to the PGD level in that order. Any unused entries -will have 0 in the page_shift field. - -The ap_encodings gives the supported page sizes and their AP field -encodings, encoded with the AP value in the top 3 bits and the log -base 2 of the page size in the bottom 6 bits. - -4.102 KVM_PPC_RESIZE_HPT_PREPARE --------------------------------- - -:Capability: KVM_CAP_SPAPR_RESIZE_HPT -:Architectures: powerpc -:Type: vm ioctl -:Parameters: struct kvm_ppc_resize_hpt (in) -:Returns: 0 on successful completion, - >0 if a new HPT is being prepared, the value is an estimated - number of milliseconds until preparation is complete, - -EFAULT if struct kvm_reinject_control cannot be read, - -EINVAL if the supplied shift or flags are invalid,when moving existing - HPT entries to the new HPT, - -EIO on other error conditions Used to implement the PAPR extension for runtime resizing of a guest's Hashed Page Table (HPT). Specifically this starts, stops or monitors the preparation of a new potential HPT for the guest, essentially implementing the H_RESIZE_HPT_PREPARE hypercall. +:: + + struct kvm_ppc_resize_hpt { + __u64 flags; + __u32 shift; + __u32 pad; + }; + If called with shift > 0 when there is no pending HPT for the guest, this begins preparation of a new pending HPT of size 2^(shift) bytes. It then returns a positive integer with the estimated number of @@ -3926,14 +3991,6 @@ Normally this will be called repeatedly with the same parameters until it returns <= 0. The first call will initiate preparation, subsequent ones will monitor preparation until it completes or fails. -:: - - struct kvm_ppc_resize_hpt { - __u64 flags; - __u32 shift; - __u32 pad; - }; - 4.103 KVM_PPC_RESIZE_HPT_COMMIT ------------------------------- @@ -3956,6 +4013,14 @@ Hashed Page Table (HPT). Specifically this requests that the guest be transferred to working with the new HPT, essentially implementing the H_RESIZE_HPT_COMMIT hypercall. +:: + + struct kvm_ppc_resize_hpt { + __u64 flags; + __u32 shift; + __u32 pad; + }; + This should only be called after KVM_PPC_RESIZE_HPT_PREPARE has returned 0 with the same parameters. In other cases KVM_PPC_RESIZE_HPT_COMMIT will return an error (usually -ENXIO or @@ -3971,14 +4036,6 @@ HPT and the previous HPT will be discarded. On failure, the guest will still be operating on its previous HPT. -:: - - struct kvm_ppc_resize_hpt { - __u64 flags; - __u32 shift; - __u32 pad; - }; - 4.104 KVM_X86_GET_MCE_CAP_SUPPORTED ----------------------------------- @@ -4840,8 +4897,10 @@ If an MSR access is not permitted through the filtering, it generates a allows user space to deflect and potentially handle various MSR accesses into user space. -If a vCPU is in running state while this ioctl is invoked, the vCPU may -experience inconsistent filtering behavior on MSR accesses. +Note, invoking this ioctl with a vCPU is running is inherently racy. However, +KVM does guarantee that vCPUs will see either the previous filter or the new +filter, e.g. MSRs with identical settings in both the old and new filter will +have deterministic behavior. 4.127 KVM_XEN_HVM_SET_ATTR -------------------------- @@ -4886,7 +4945,7 @@ KVM_XEN_ATTR_TYPE_SHARED_INFO KVM_XEN_ATTR_TYPE_UPCALL_VECTOR Sets the exception vector used to deliver Xen event channel upcalls. -4.128 KVM_XEN_HVM_GET_ATTR +4.127 KVM_XEN_HVM_GET_ATTR -------------------------- :Capability: KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO @@ -4898,7 +4957,7 @@ KVM_XEN_ATTR_TYPE_UPCALL_VECTOR Allows Xen VM attributes to be read. For the structure and types, see KVM_XEN_HVM_SET_ATTR above. -4.129 KVM_XEN_VCPU_SET_ATTR +4.128 KVM_XEN_VCPU_SET_ATTR --------------------------- :Capability: KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO @@ -4915,6 +4974,14 @@ see KVM_XEN_HVM_SET_ATTR above. union { __u64 gpa; __u64 pad[4]; + struct { + __u64 state; + __u64 state_entry_time; + __u64 time_running; + __u64 time_runnable; + __u64 time_blocked; + __u64 time_offline; + } runstate; } u; }; @@ -4927,7 +4994,32 @@ KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO Sets the guest physical address of an additional pvclock structure for a given vCPU. This is typically used for guest vsyscall support. -4.130 KVM_XEN_VCPU_GET_ATTR +KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR + Sets the guest physical address of the vcpu_runstate_info for a given + vCPU. This is how a Xen guest tracks CPU state such as steal time. + +KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT + Sets the runstate (RUNSTATE_running/_runnable/_blocked/_offline) of + the given vCPU from the .u.runstate.state member of the structure. + KVM automatically accounts running and runnable time but blocked + and offline states are only entered explicitly. + +KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_DATA + Sets all fields of the vCPU runstate data from the .u.runstate member + of the structure, including the current runstate. The state_entry_time + must equal the sum of the other four times. + +KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST + This *adds* the contents of the .u.runstate members of the structure + to the corresponding members of the given vCPU's runstate data, thus + permitting atomic adjustments to the runstate times. The adjustment + to the state_entry_time must equal the sum of the adjustments to the + other four times. The state field must be set to -1, or to a valid + runstate value (RUNSTATE_running, RUNSTATE_runnable, RUNSTATE_blocked + or RUNSTATE_offline) to set the current accounted state as of the + adjusted state_entry_time. + +4.129 KVM_XEN_VCPU_GET_ATTR --------------------------- :Capability: KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO @@ -4939,6 +5031,9 @@ KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO Allows Xen vCPU attributes to be read. For the structure and types, see KVM_XEN_VCPU_SET_ATTR above. +The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used +with the KVM_XEN_VCPU_GET_ATTR ioctl. + 5. The kvm_run structure ======================== @@ -5000,7 +5095,8 @@ local APIC is not used. __u16 flags; More architecture-specific flags detailing state of the VCPU that may -affect the device's behavior. Current defined flags: +affect the device's behavior. Current defined flags:: + /* x86, set if the VCPU is in system management mode */ #define KVM_RUN_X86_SMM (1 << 0) /* x86, set if bus lock detected in VM */ @@ -6217,7 +6313,7 @@ the bus lock vm exit can be preempted by a higher priority VM exit, the exit notifications to userspace can be KVM_EXIT_BUS_LOCK or other reasons. KVM_RUN_BUS_LOCK flag is used to distinguish between them. -7.22 KVM_CAP_PPC_DAWR1 +7.23 KVM_CAP_PPC_DAWR1 ---------------------- :Architectures: ppc @@ -6227,6 +6323,45 @@ KVM_RUN_BUS_LOCK flag is used to distinguish between them. This capability can be used to check / enable 2nd DAWR feature provided by POWER10 processor. +7.24 KVM_CAP_VM_COPY_ENC_CONTEXT_FROM +------------------------------------- + +Architectures: x86 SEV enabled +Type: vm +Parameters: args[0] is the fd of the source vm +Returns: 0 on success; ENOTTY on error + +This capability enables userspace to copy encryption context from the vm +indicated by the fd to the vm this is called on. + +This is intended to support in-guest workloads scheduled by the host. This +allows the in-guest workload to maintain its own NPTs and keeps the two vms +from accidentally clobbering each other with interrupts and the like (separate +APIC/MSRs/etc). + +7.25 KVM_CAP_SGX_ATTRIBUTE +-------------------------- + +:Architectures: x86 +:Target: VM +:Parameters: args[0] is a file handle of a SGX attribute file in securityfs +:Returns: 0 on success, -EINVAL if the file handle is invalid or if a requested + attribute is not supported by KVM. + +KVM_CAP_SGX_ATTRIBUTE enables a userspace VMM to grant a VM access to one or +more priveleged enclave attributes. args[0] must hold a file handle to a valid +SGX attribute file corresponding to an attribute that is supported/restricted +by KVM (currently only PROVISIONKEY). + +The SGX subsystem restricts access to a subset of enclave attributes to provide +additional security for an uncompromised kernel, e.g. use of the PROVISIONKEY +is restricted to deter malware from using the PROVISIONKEY to obtain a stable +system fingerprint. To prevent userspace from circumventing such restrictions +by running an enclave in a VM, KVM prevents access to privileged attributes by +default. + +See Documentation/x86/sgx/2.Kernel-internals.rst for more details. + 8. Other capabilities. ====================== @@ -6702,6 +6837,7 @@ PVHVM guests. Valid flags are:: #define KVM_XEN_HVM_CONFIG_HYPERCALL_MSR (1 << 0) #define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL (1 << 1) #define KVM_XEN_HVM_CONFIG_SHARED_INFO (1 << 2) + #define KVM_XEN_HVM_CONFIG_RUNSTATE (1 << 2) The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG ioctl is available, for the guest to set its hypercall page. @@ -6716,3 +6852,42 @@ KVM_XEN_HVM_SET_ATTR, KVM_XEN_HVM_GET_ATTR, KVM_XEN_VCPU_SET_ATTR and KVM_XEN_VCPU_GET_ATTR ioctls, as well as the delivery of exception vectors for event channel upcalls when the evtchn_upcall_pending field of a vcpu's vcpu_info is set. + +The KVM_XEN_HVM_CONFIG_RUNSTATE flag indicates that the runstate-related +features KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR/_CURRENT/_DATA/_ADJUST are +supported by the KVM_XEN_VCPU_SET_ATTR/KVM_XEN_VCPU_GET_ATTR ioctls. + +8.31 KVM_CAP_PPC_MULTITCE +------------------------- + +:Capability: KVM_CAP_PPC_MULTITCE +:Architectures: ppc +:Type: vm + +This capability means the kernel is capable of handling hypercalls +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user +space. This significantly accelerates DMA operations for PPC KVM guests. +User space should expect that its handlers for these hypercalls +are not going to be called if user space previously registered LIOBN +in KVM (via KVM_CREATE_SPAPR_TCE or similar calls). + +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest, +user space might have to advertise it for the guest. For example, +IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is +present in the "ibm,hypertas-functions" device-tree property. + +The hypercalls mentioned above may or may not be processed successfully +in the kernel based fast path. If they can not be handled by the kernel, +they will get passed on to user space. So user space still has to have +an implementation for these despite the in kernel acceleration. + +This capability is always enabled. + +8.32 KVM_CAP_PTP_KVM +-------------------- + +:Architectures: arm64 + +This capability indicates that the KVM virtual PTP service is +supported in the host. A VMM can check whether the service is +available to the guest on migration. |