CHAPTER 9 Virtualization technologies

One of the most important technologies used for running multiple operating systems on the same physical machine is virtualization. At the time of this writing, there are multiple types of virtualization technologies available from different hardware manufacturers, which have evolved over the years. Virtualization technologies are not only used for running multiple operating systems on a physical machine, but they have also become the basics for important security features like the Virtual Secure Mode (VSM) and Hypervisor-Enforced Code Integrity (HVCI), which can’t be run without a hypervisor.

In this chapter, we give an overview of the Windows virtualization solution, called Hyper-V. Hyper-V is composed of the hypervisor, which is the component that manages the platform-dependent virtualization hardware, and the virtualization stack. We describe the internal architecture of Hyper-V and provide a brief description of its components (memory manager, virtual processors, intercepts, scheduler, and so on). The virtualization stack is built on the top of the hypervisor and provides different services to the root and guest partitions. We describe all the components of the virtualization stack (VM Worker process, virtual machine management service, VID driver, VMBus, and so on) and the different hardware emulation that is supported.

In the last part of the chapter, we describe some technologies based on the virtualization, such as VSM and HVCI. We present all the secure services that those technologies provide to the system.

The Windows hypervisor

The Hyper-V hypervisor (also known as Windows hypervisor) is a type-1 (native or bare-metal) hypervisor: a mini operating system that runs directly on the host’s hardware to manage a single root and one or more guest operating systems. Unlike type-2 (or hosted) hypervisors, which run on the base of a conventional OS like normal applications, the Windows hypervisor abstracts the root OS, which knows about the existence of the hypervisor and communicates with it to allow the execution of one or more guest virtual machines. Because the hypervisor is part of the operating system, managing the guests inside it, as well as interacting with them, is fully integrated in the operating system through standard management mechanisms such as WMI and services. In this case, the root OS contains some enlightenments. Enlightenments are special optimizations in the kernel and possibly device drivers that detect that the code is being run virtualized under a hypervisor, so they perform certain tasks differently, or more efficiently, considering this environment.

Figure 9-1 shows the basic architecture of the Windows virtualization stack, which is described in detail later in this chapter.

Image

Figure 9-1 The Hyper-V architectural stack (hypervisor and virtualization stack).

At the bottom of the architecture is the hypervisor, which is launched very early during the system boot and provides its services for the virtualization stack to use (through the use of the hypercall interface). The early initialization of the hypervisor is described in Chapter 12, “Startup and shutdown.” The hypervisor startup is initiated by the Windows Loader, which determines whether to start the hypervisor and the Secure Kernel; if the hypervisor and Secure Kernel are started, the hypervisor uses the services of the Hvloader.dll to detect the correct hardware platform and load and start the proper version of the hypervisor. Because Intel and AMD (and ARM64) processors have differing implementations of hardware-assisted virtualization, there are different hypervisors. The correct one is selected at boot-up time after the processor has been queried through CPUID instructions. On Intel systems, the Hvix64.exe binary is loaded; on AMD systems, the Hvax64.exe image is used. As of the Windows 10 May 2019 Update (19H1), the ARM64 version of Windows supports its own hypervisor, which is implemented in the Hvaa64.exe image.

At a high level, the hardware virtualization extension used by the hypervisor is a thin layer that resides between the OS kernel and the processor. This layer, which intercepts and emulates in a safe manner sensitive operations executed by the OS, is run in a higher privilege level than the OS kernel. (Intel calls this mode VMXROOT. Most books and literature define the VMXROOT security domain as “Ring -1.”) When an operation executed by the underlying OS is intercepted, the processor stops to run the OS code and transfer the execution to the hypervisor at the higher privilege level. This operation is commonly referred to as a VMEXIT event. In the same way, when the hypervisor has finished processing the intercepted operation, it needs a way to allow the physical CPU to restart the execution of the OS code. New opcodes have been defined by the hardware virtualization extension, which allow a VMENTER event to happen; the CPU restarts the execution of the OS code at its original privilege level.

Partitions, processes, and threads

One of the key architectural components behind the Windows hypervisor is the concept of a partition. A partition essentially represents the main isolation unit, an instance of an operating system installation, which can refer either to what’s traditionally called the host or the guest. Under the Windows hypervisor model, these two terms are not used; instead, we talk of either a root partition or a child partition, respectively. A partition is composed of some physical memory and one or more virtual processors (VPs) with their local virtual APICs and timers. (In the global term, a partition also includes a virtual motherboard and multiple virtual peripherals. These are virtualization stack concepts, which do not belong to the hypervisor.)

At a minimum, a Hyper-V system has a root partition—in which the main operating system controlling the machine runs—the virtualization stack, and its associated components. Each operating system running within the virtualized environment represents a child partition, which might contain certain additional tools that optimize access to the hardware or allow management of the operating system. Partitions are organized in a hierarchical way. The root partition has control of each child and receives some notifications (intercepts) for certain kinds of events that happen in the child. The majority of the physical hardware accesses that happen in the root are passed through by the hypervisor; this means that the parent partition is able to talk directly to the hardware (with some exceptions). As a counterpart, child partitions are usually not able to communicate directly with the physical machine’s hardware (again with some exceptions, which are described later in this chapter in the section “The virtualization stack”). Each I/O is intercepted by the hypervisor and redirected to the root if needed.

One of the main goals behind the design of the Windows hypervisor was to have it be as small and modular as possible, much like a microkernel—no need to support any hypervisor driver or provide a full, monolithic module. This means that most of the virtualization work is actually done by a separate virtualization stack (refer to Figure 9-1). The hypervisor uses the existing Windows driver architecture and talks to actual Windows device drivers. This architecture results in several components that provide and manage this behavior, which are collectively called the virtualization stack. Although the hypervisor is read from the boot disk and executed by the Windows Loader before the root OS (and the parent partition) even exists, it is the parent partition that is responsible for providing the entire virtualization stack. Because these are Microsoft components, only a Windows machine can be a root partition. The Windows OS in the root partition is responsible for providing the device drivers for the hardware on the system, as well as for running the virtualization stack. It’s also the management point for all the child partitions. The main components that the root partition provides are shown in Figure 9-2.

Image

Figure 9-2 Components of the root partition.

Child partitions

A child partition is an instance of any operating system running parallel to the parent partition. (Because you can save or pause the state of any child, it might not necessarily be running.) Unlike the parent partition, which has full access to the APIC, I/O ports, and its physical memory (but not access to the hypervisor’s and Secure Kernel’s physical memory), child partitions are limited for security and management reasons to their own view of address space (the Guest Physical Address, or GPA, space, which is managed by the hypervisor) and have no direct access to hardware (even though they may have direct access to certain kinds of devices; see the “Virtualization stack” section for further details). In terms of hypervisor access, a child partition is also limited mainly to notifications and state changes. For example, a child partition doesn’t have control over other partitions (and can’t create new ones).

Child partitions have many fewer virtualization components than a parent partition because they aren’t responsible for running the virtualization stack—only for communicating with it. Also, these components can also be considered optional because they enhance performance of the environment but aren’t critical to its use. Figure 9-3 shows the components present in a typical Windows child partition.

Image

Figure 9-3 Components of a child partition.

Processes and threads

The Windows hypervisor represents a virtual machine with a partition data structure. A partition, as described in the previous section, is composed of some memory (guest physical memory) and one or more virtual processors (VP). Internally in the hypervisor, each virtual processor is a schedulable entity, and the hypervisor, like the standard NT kernel, includes a scheduler. The scheduler dispatches the execution of virtual processors, which belong to different partitions, to each physical CPU. (We discuss the multiple types of hypervisor schedulers later in this chapter in the “Hyper-V schedulers” section.) A hypervisor thread (TH_THREAD data structure) is the glue between a virtual processor and its schedulable unit. Figure 9-4 shows the data structure, which represents the current physical execution context. It contains the thread execution stack, scheduling data, a pointer to the thread’s virtual processor, the entry point of the thread dispatch loop (discussed later) and, most important, a pointer to the hypervisor process that the thread belongs to.

Image

Figure 9-4 The hypervisor’s thread data structure.

The hypervisor builds a thread for each virtual processor it creates and associates the newborn thread with the virtual processor data structure (VM_VP).

A hypervisor process (TH_PROCESS data structure), shown in Figure 9-5, represents a partition and is a container for its physical (and virtual) address space. It includes the list of the threads (which are backed by virtual processors), scheduling data (the physical CPUs affinity in which the process is allowed to run), and a pointer to the partition basic memory data structures (memory compartment, reserved pages, page directory root, and so on). A process is usually created when the hypervisor builds the partition (VM_PARTITION data structure), which will represent the new virtual machine.

Image

Figure 9-5 The hypervisor’s process data structure.

Enlightenments

Enlightenments are one of the key performance optimizations that Windows virtualization takes advantage of. They are direct modifications to the standard Windows kernel code that can detect that the operating system is running in a child partition and perform work differently. Usually, these optimizations are highly hardware-specific and result in a hypercall to notify the hypervisor.

An example is notifying the hypervisor of a long busy–wait spin loop. The hypervisor can keep some state on the spin wait and decide to schedule another VP on the same physical processor until the wait can be satisfied. Entering and exiting an interrupt state and access to the APIC can be coordinated with the hypervisor, which can be enlightened to avoid trapping the real access and then virtualizing it.

Another example has to do with memory management, specifically translation lookaside buffer (TLB) flushing. (See Part 1, Chapter 5, “Memory management,” for more information on these concepts.) Usually, the operating system executes a CPU instruction to flush one or more stale TLB entries, which affects only a single processor. In multiprocessor systems, usually a TLB entry must be flushed from every active processor’s cache (the system sends an inter-processor interrupt to every active processor to achieve this goal). However, because a child partition could be sharing physical CPUs with many other child partitions, and some of them could be executing a different VM’s virtual processor at the time the TLB flush is initiated, such an operation would also flush this information for those VMs. Furthermore, a virtual processor would be rescheduled to execute only the TLB flushing IPI, resulting in noticeable performance degradation. If Windows is running under a hypervisor, it instead issues a hypercall to have the hypervisor flush only the specific information belonging to the child partition.

Partition’s privileges, properties, and version features

When a partition is initially created (usually by the VID driver), no virtual processors (VPs) are associated with it. At that time, the VID driver is free to add or remove some partition’s privileges. Indeed, when the partition is first created, the hypervisor assigns some default privileges to it, depending on its type.

A partition’s privilege describes which action—usually expressed through hypercalls or synthetic MSRs (model specific registers)—the enlightened OS running inside a partition is allowed to perform on behalf of the partition itself. For example, the Access Root Scheduler privilege allows a child partition to notify the root partition that an event has been signaled and a guest’s VP can be rescheduled (this usually increases the priority of the guest’s VP-backed thread). The Access VSM privilege instead allows the partition to enable VTL 1 and access its properties and configuration (usually exposed through synthetic registers). Table 9-1 lists all the privileges assigned by default by the hypervisor.

Table 9-1 Partition’s privileges

PARTITION TYPE

DEFAULT PRIVILEGES

Root and child partition

Read/write a VP’s runtime counter

Read the current partition reference time

Access SynIC timers and registers

Query/set the VP’s virtual APIC assist page

Read/write hypercall MSRs

Request VP IDLE entry

Read VP’s index

Map or unmap the hypercall’s code area

Read a VP’s emulated TSC (time-stamp counter) and its frequency

Control the partition TSC and re-enlightenment emulation

Read/write VSM synthetic registers

Read/write VP’s per-VTL registers

Starts an AP virtual processor

Enables partition’s fast hypercall support

Root partition only

Create child partition

Look up and reference a partition by ID

Deposit/withdraw memory from the partition compartment

Post messages to a connection port

Signal an event in a connection port’s partition

Create/delete and get properties of a partition’s connection port

Connect/disconnect to a partition’s connection port

Map/unmap the hypervisor statistics page (which describe a VP, LP, partition, or hypervisor)

Enable the hypervisor debugger for the partition

Schedule child partition’s VPs and access SynIC synthetic MSRs

Trigger an enlightened system reset

Read the hypervisor debugger options for a partition

Child partition only

Generate an extended hypercall intercept in the root partition

Notify a root scheduler’s VP-backed thread of an event being signaled

EXO partition

None

Partition privileges can only be set before the partition creates and starts any VPs; the hypervisor won’t allow requests to set privileges after a single VP in the partition starts to execute. Partition properties are similar to privileges but do not have this limitation; they can be set and queried at any time. There are different groups of properties that can be queried or set for a partition. Table 9-2 lists the properties groups.

Table 9-2 Partition’s properties

PROPERTY GROUP

DESCRIPTION

Scheduling properties

Set/query properties related to the classic and core scheduler, like Cap, Weight, and Reserve

Time properties

Allow the partition to be suspended/resumed

Debugging properties

Change the hypervisor debugger runtime configuration

Resource properties

Queries virtual hardware platform-specific properties of the partition (like TLB size, SGX support, and so on)

Compatibility properties

Queries virtual hardware platform-specific properties that are tied to the initial compatibility features

When a partition is created, the VID infrastructure provides a compatibility level (which is specified in the virtual machine’s configuration file) to the hypervisor. Based on that compatibility level, the hypervisor enables or disables specific virtual hardware features that could be exposed by a VP to the underlying OS. There are multiple features that tune how the VP behaves based on the VM’s compatibility level. A good example would be the hardware Page Attribute Table (PAT), which is a configurable caching type for virtual memory. Prior to Windows 10 Anniversary Update (RS1), guest VMs weren’t able to use PAT in guest VMs, so regardless of whether the compatibility level of a VM specifies Windows 10 RS1, the hypervisor will not expose the PAT registers to the underlying guest OS. Otherwise, in case the compatibility level is higher than Windows 10 RS1, the hypervisor exposes the PAT support to the underlying OS running in the guest VM. When the root partition is initially created at boot time, the hypervisor enables the highest compatibility level for it. In that way the root OS can use all the features supported by the physical hardware.

The hypervisor startup

In Chapter 12, we analyze the modality in which a UEFI-based workstation boots up, and all the components engaged in loading and starting the correct version of the hypervisor binary. In this section, we briefly discuss what happens in the machine after the HvLoader module has transferred the execution to the hypervisor, which takes control for the first time.

The HvLoader loads the correct version of the hypervisor binary image (depending on the CPU manufacturer) and creates the hypervisor loader block. It captures a minimal processor context, which the hypervisor needs to start the first virtual processor. The HvLoader then switches to a new, just-created, address space and transfers the execution to the hypervisor image by calling the hypervisor image entry point, KiSystemStartup, which prepares the processor for running the hypervisor and initializes the CPU_PLS data structure. The CPU_PLS represents a physical processor and acts as the PRCB data structure of the NT kernel; the hypervisor is able to quickly address it (using the GS segment). Differently from the NT kernel, KiSystemStartup is called only for the boot processor (the application processors startup sequence is covered in the “Application Processors (APs) Startup” section later in this chapter), thus it defers the real initialization to another function, BmpInitBootProcessor.

BmpInitBootProcessor starts a complex initialization sequence. The function examines the system and queries all the CPU’s supported virtualization features (such as the EPT and VPID; the queried features are platform-specific and vary between the Intel, AMD, or ARM version of the hypervisor). It then determines the hypervisor scheduler, which will manage how the hypervisor will schedule virtual processors. For Intel and AMD server systems, the default scheduler is the core scheduler, whereas the root scheduler is the default for all client systems (including ARM64). The scheduler type can be manually overridden through the hypervisorschedulertype BCD option (more information about the different hypervisor schedulers is available later in this chapter).

The nested enlightenments are initialized. Nested enlightenments allow the hypervisor to be executed in nested configurations, where a root hypervisor (called L0 hypervisor), manages the real hardware, and another hypervisor (called L1 hypervisor) is executed in a virtual machine. After this stage, the BmpInitBootProcessor routine performs the initialization of the following components:

  •     Memory manager (initializes the PFN database and the root compartment).

  •     The hypervisor’s hardware abstraction layer (HAL).

  •     The hypervisor’s process and thread subsystem (which depends on the chosen scheduler type). The system process and its initial thread are created. This process is special; it isn’t tied to any partition and hosts threads that execute the hypervisor code.

  •     The VMX virtualization abstraction layer (VAL). The VAL’s purpose is to abstract differences between all the supported hardware virtualization extensions (Intel, AMD, and ARM64). It includes code that operates on platform-specific features of the machine’s virtualization technology in use by the hypervisor (for example, on the Intel platform the VAL layer manages the “unrestricted guest” support, the EPT, SGX, MBEC, and so on).

  •     The Synthetic Interrupt Controller (SynIC) and I/O Memory Management Unit (IOMMU).

  •     The Address Manager (AM), which is the component responsible for managing the physical memory assigned to a partition (called guest physical memory, or GPA) and its translation to real physical memory (called system physical memory). Although the first implementation of Hyper-V supported shadow page tables (a software technique for address translation), since Windows 8.1, the Address manager uses platform-dependent code for configuring the hypervisor address translation mechanism offered by the hardware (extended page tables for Intel, nested page tables for AMD). In hypervisor terms, the physical address space of a partition is called address domain. The platform-independent physical address space translation is commonly called Second Layer Address Translation (SLAT). The term refers to the Intel’s EPT, AMD’s NPT or ARM 2-stage address translation mechanism.

The hypervisor can now finish constructing the CPU_PLS data structure associated with the boot processor by allocating the initial hardware-dependent virtual machine control structures (VMCS for Intel, VMCB for AMD) and by enabling virtualization through the first VMXON operation. Finally, the per-processor interrupt mapping data structures are initialized.

The creation of the root partition and the boot virtual processor

The first steps that a fully initialized hypervisor needs to execute are the creation of the root partition and the first virtual processor used for starting the system (called BSP VP). Creating the root partition follows almost the same rules as for child partitions; multiple layers of the partition are initialized one after the other. In particular:

  1. The VM-layer initializes the maximum allowed number of VTL levels and sets up the partition privileges based on the partition’s type (see the previous section for more details). Furthermore, the VM layer determines the partition’s allowable features based on the specified partition’s compatibility level. The root partition supports the maximum allowable features.

  2. The VP layer initializes the virtualized CPUID data, which all the virtual processors of the partition use when a CPUID is requested from the guest operating system. The VP layer creates the hypervisor process, which backs the partition.

  3. The Address Manager (AM) constructs the partition’s initial physical address space by using machine platform-dependent code (which builds the EPT for Intel, NPT for AMD). The constructed physical address space depends on the partition type. The root partition uses identity mapping, which means that all the guest physical memory corresponds to the system physical memory (more information is provided later in this chapter in the “Partitions’ physical address space” section).

Finally, after the SynIC, IOMMU, and the intercepts’ shared pages are correctly configured for the partition, the hypervisor creates and starts the BSP virtual processor for the root partition, which is the unique one used to restart the boot process.

A hypervisor virtual processor (VP) is represented by a big data structure (VM_VP), shown in Figure 9-6. A VM_VP data structure maintains all the data used to track the state of the virtual processor: its platform-dependent registers state (like general purposes, debug, XSAVE area, and stack) and data, the VP’s private address space, and an array of VM_VPLC data structures, which are used to track the state of each Virtual Trust Level (VTL) of the virtual processor. The VM_VP also includes a pointer to the VP’s backing thread and a pointer to the physical processor that is currently executing the VP.

Image

Figure 9-6 The VM_VP data structure representing a virtual processor.

As for the partitions, creating the BSP virtual processor is similar to the process of creating normal virtual processors. VmAllocateVp is the function responsible in allocating and initializing the needed memory from the partition’s compartment, used for storing the VM_VP data structure, its platform-dependent part, and the VM_VPLC array (one for each supported VTL). The hypervisor copies the initial processor context, specified by the HvLoader at boot time, into the VM_VP structure and then creates the VP’s private address space and attaches to it (only in case address space isolation is enabled). Finally, it creates the VP’s backing thread. This is an important step: the construction of the virtual processor continues in the context of its own backing thread. The hypervisor’s main system thread at this stage waits until the new BSP VP is completely initialized. The wait brings the hypervisor scheduler to select the newly created thread, which executes a routine, ObConstructVp, that constructs the VP in the context of the new backed thread.

ObConstructVp, in a similar way as for partitions, constructs and initializes each layer of the virtual processor—in particular, the following:

  1. The Virtualization Manager (VM) layer attaches the physical processor data structure (CPU_PLS) to the VP and sets VTL 0 as active.

  2. The VAL layer initializes the platform-dependent portions of the VP, like its registers, XSAVE area, stack, and debug data. Furthermore, for each supported VTL, it allocates and initializes the VMCS data structure (VMCB for AMD systems), which is used by the hardware for keeping track of the state of the virtual machine, and the VTL’s SLAT page tables. The latter allows each VTL to be isolated from each other (more details about VTLs are provided later in the “Virtual Trust Levels (VTLs) and Virtual Secure Mode (VSM)” section) . Finally, the VAL layer enables and sets VTL 0 as active. The platform-specific VMCS (or VMCB for AMD systems) is entirely compiled, the SLAT table of VTL 0 is set as active, and the real-mode emulator is initialized. The Host-state part of the VMCS is set to target the hypervisor VAL dispatch loop. This routine is the most important part of the hypervisor because it manages all the VMEXIT events generated by each guest.

  3. The VP layer allocates the VP’s hypercall page, and, for each VTL, the assist and intercept message pages. These pages are used by the hypervisor for sharing code or data with the guest operating system.

When ObConstructVp finishes its work, the VP’s dispatch thread activates the virtual processor and its synthetic interrupt controller (SynIC). If the VP is the first one of the root partition, the dispatch thread restores the initial VP’s context stored in the VM_VP data structure by writing each captured register in the platform-dependent VMCS (or VMCB) processor area (the context has been specified by the HvLoader earlier in the boot process). The dispatch thread finally signals the completion of the VP initialization (as a result, the main system thread enters the idle loop) and enters the platform-dependent VAL dispatch loop. The VAL dispatch loop detects that the VP is new, prepares it for the first execution, and starts the new virtual machine by executing a VMLAUNCH instruction. The new VM restarts exactly at the point at which the HvLoader has transferred the execution to the hypervisor. The boot process continues normally but in the context of the new hypervisor partition.

The hypervisor memory manager

The hypervisor memory manager is relatively simple compared to the memory manager for NT or the Secure Kernel. The entity that manages a set of physical memory pages is the hypervisor’s memory compartment. Before the hypervisor startup takes palace, the hypervisor loader (Hvloader.dll) allocates the hypervisor loader block and pre-calculates the maximum number of physical pages that will be used by the hypervisor for correctly starting up and creating the root partition. The number depends on the pages used to initialize the IOMMU to store the memory range structures, the system PFN database, SLAT page tables, and HAL VA space. The hypervisor loader preallocates the calculated number of physical pages, marks them as reserved, and attaches the page list array in the loader block. Later, when the hypervisor starts, it creates the root compartment by using the page list that was allocated by the hypervisor loader.

Figure 9-7 shows the layout of the memory compartment data structure. The data structure keeps track of the total number of physical pages “deposited” in the compartment, which can be allocated somewhere or freed. A compartment stores its physical pages in different lists ordered by the NUMA node. Only the head of each list is stored in the compartment. The state of each physical page and its link in the NUMA list is maintained thanks to the entries in the PFN database. A compartment also tracks its relationship with the root. A new compartment can be created using the physical pages that belongs to the parent (the root). Similarly, when the compartment is deleted, all its remaining physical pages are returned to the parent.

Image

Figure 9-7 The hypervisor’s memory compartment. Virtual address space for the global zone is reserved from the end of the compartment data structure

When the hypervisor needs some physical memory for any kind of work, it allocates from the active compartment (depending on the partition). This means that the allocation can fail. Two possible scenarios can arise in case of failure:

  •     If the allocation has been requested for a service internal to the hypervisor (usually on behalf of the root partition), the failure should not happen, and the system is crashed. (This explains why the initial calculation of the total number of pages to be assigned to the root compartment needs to be accurate.)

  •     If the allocation has been requested on behalf of a child partition (usually through a hypercall), the hypervisor will fail the request with the status INSUFFICIENT_MEMORY. The root partition detects the error and performs the allocation of some physical page (more details are discussed later in the “Virtualization stack” section), which will be deposited in the child compartment through the HvDepositMemory hypercall. The operation can be finally reinitiated (and usually will succeed).

The physical pages allocated from the compartment are usually mapped in the hypervisor using a virtual address. When a compartment is created, a virtual address range (sized 4 or 8 GB, depending on whether the compartment is a root or a child) is allocated with the goal of mapping the new compartment, its PDE bitmap, and its global zone.

A hypervisor’s zone encapsulates a private VA range, which is not shared with the entire hypervisor address space (see the “Isolated address space” section later in this chapter). The hypervisor executes with a single root page table (differently from the NT kernel, which uses KVA shadowing). Two entries in the root page table page are reserved with the goal of dynamically switching between each zone and the virtual processors’ address spaces.

Partitions’ physical address space

As discussed in the previous section, when a partition is initially created, the hypervisor allocates a physical address space for it. A physical address space contains all the data structures needed by the hardware to translate the partition’s guest physical addresses (GPAs) to system physical addresses (SPAs). The hardware feature that enables the translation is generally referred to as second level address translation (SLAT). The term SLAT is platform-agnostic: hardware vendors use different names: Intel calls it EPT for extended page tables; AMD uses the term NPT for nested page tables; and ARM simply calls it Stage 2 Address Translation.

The SLAT is usually implemented in a way that’s similar to the implementation of the x64 page tables, which uses four levels of translation (the x64 virtual address translation has already been discussed in detail in Chapter 5 of Part 1). The OS running inside the partition uses the same virtual address translation as if it were running by bare-metal hardware. However, in the former case, the physical processor actually executes two levels of translation: one for virtual addresses and one for translating physical addresses. Figure 9-8 shows the SLAT set up for a guest partition. In a guest partition, a GPA is usually translated to a different SPA. This is not true for the root partition.

Image

Figure 9-8 Address translation for a guest partition.

When the hypervisor creates the root partition, it builds its initial physical address space by using identity mapping. In this model, each GPA corresponds to the same SPA (for example, guest frame 0x1000 in the root partition is mapped to the bare-metal physical frame 0x1000). The hypervisor preallocates the memory needed for mapping the entire physical address space of the machine (which has been discovered by the Windows Loader using UEFI services; see Chapter 12 for details) into all the allowed root partition’s virtual trust levels (VTLs). (The root partition usually supports two VTLs.) The SLAT page tables of each VTL belonging to the partition include the same GPA and SPA entries but usually with a different protection level set. The protection level applied to each partition’s physical frame allows the creation of different security domains (VTL), which can be isolated one from each other. VTLs are explained in detail in the section “The Secure Kernel” later in this chapter. The hypervisor pages are marked as hardware-reserved and are not mapped in the partition’s SLAT table (actually they are mapped using an invalid entry pointing to a dummy PFN).

Image Note

For performance reasons, the hypervisor, while building the physical memory mapping, is able to detect large chunks of contiguous physical memory, and, in a similar way as for virtual memory, is able to map those chunks by using large pages. If for some reason the OS running in the partition decides to apply a more granular protection to the physical page, the hypervisor would use the reserved memory for breaking the large page in the SLAT table.

Earlier versions of the hypervisor also supported another technique for mapping a partition’s physical address space: shadow paging. Shadow paging was used for those machines without the SLAT support. This technique had a very high-performance overhead; as a result, it’s not supported anymore. (The machine must support SLAT; otherwise, the hypervisor would refuse to start.)

The SLAT table of the root is built at partition-creation time, but for a guest partition, the situation is slightly different. When a child partition is created, the hypervisor creates its initial physical address space but allocates only the root page table (PML4) for each partition’s VTL. Before starting the new VM, the VID driver (part of the virtualization stack) reserves the physical pages needed for the VM (the exact number depends on the VM memory size) by allocating them from the root partition. (Remember, we are talking about physical memory; only a driver can allocate physical pages.) The VID driver maintains a list of physical pages, which is analyzed and split in large pages and then is sent to the hypervisor through the HvMapGpaPages Rep hypercall.

Before sending the map request, the VID driver calls into the hypervisor for creating the needed SLAT page tables and internal physical memory space data structures. Each SLAT page table hierarchy is allocated for each available VTL in the partition (this operation is called pre-commit). The operation can fail, such as when the new partition’s compartment could not contain enough physical pages. In this case, as discussed in the previous section, the VID driver allocates more memory from the root partition and deposits it in the child’s partition compartment. At this stage, the VID driver can freely map all the child’s partition physical pages. The hypervisor builds and compiles all the needed SLAT page tables, assigning different protection based on the VTL level. (Large pages require one less indirection level.) This step concludes the child partition’s physical address space creation.

Address space isolation

Speculative execution vulnerabilities discovered in modern CPUs (also known as Meltdown, Spectre, and Foreshadow) allowed an attacker to read secret data located in a more privileged execution context by speculatively reading the stale data located in the CPU cache. This means that software executed in a guest VM could potentially be able to speculatively read private memory that belongs to the hypervisor or to the more privileged root partition. The internal details of the Spectre, Meltdown, and all the side-channel vulnerabilities and how they are mitigated by Windows have been covered in detail in Chapter 8.

The hypervisor has been able to mitigate most of these kinds of attacks by implementing the HyperClear mitigation. The HyperClear mitigation relies on three key components to ensure strong Inter-VM isolation: core scheduler, Virtual-Processor Address Space Isolation, and sensitive data scrubbing. In modern multicore CPUs, often different SMT threads share the same CPU cache. (Details about the core scheduler and symmetric multithreading are provided in the “Hyper-V schedulers” section.) In the virtualization environment, SMT threads on a core can independently enter and exit the hypervisor context based on their activity. For example, events like interrupts can cause an SMT thread to switch out of running the guest virtual processor context and begin executing the hypervisor context. This can happen independently for each SMT thread, so one SMT thread may be executing in the hypervisor context while its sibling SMT thread is still running a VM’s guest virtual processor context. An attacker running code in a less trusted guest VM’s virtual processor context on one SMT thread can then use a side channel vulnerability to potentially observe sensitive data from the hypervisor context running on the sibling SMT thread.

The hypervisor provides strong data isolation to protect against a malicious guest VM by maintaining separate virtual address ranges for each guest SMT thread (which back a virtual processor). When the hypervisor context is entered on a specific SMT thread, no secret data is addressable. The only data that can be brought into the CPU cache is associated with that current guest virtual processor or represent shared hypervisor data. As shown in Figure 9-9, when a VP running on an SMT thread enters the hypervisor, it is enforced (by the root scheduler) that the sibling LP is running another VP that belongs to the same VM. Furthermore, no shared secrets are mapped in the hypervisor. In case the hypervisor needs to access secret data, it assures that no other VP is scheduled in the other sibling SMT thread.

Image

Figure 9-9 The Hyperclear mitigation.

Unlike the NT kernel, the hypervisor always runs with a single page table root, which creates a single global virtual address space. The hypervisor defines the concept of private address space, which has a misleading name. Indeed, the hypervisor reserves two global root page table entries (PML4 entries, which generate a 1-TB virtual address range) for mapping or unmapping a private address space. When the hypervisor initially constructs the VP, it allocates two private page table root entries. Those will be used to map the VP’s secret data, like its stack and data structures that contain private data. Switching the address space means writing the two entries in the global page table root (which explains why the term private address space has a misleading name—actually it is private address range). The hypervisor switches private address spaces only in two cases: when a new virtual processor is created and during thread switches. (Remember, threads are backed by VPs. The core scheduler assures that no sibling SMT threads execute VPs from different partitions.) During runtime, a hypervisor thread has mapped only its own VP’s private data; no other secret data is accessible by that thread.

Mapping secret data in the private address space is achieved by using the memory zone, represented by an MM_ZONE data structure. A memory zone encapsulates a private VA subrange of the private address space, where the hypervisor usually stores per-VP’s secrets.

The memory zone works similarly to the private address space. Instead of mapping root page table entries in the global page table root, a memory zone maps private page directories in the two root entries used by the private address space. A memory zone maintains an array of page directories, which will be mapped and unmapped into the private address space, and a bitmap that keeps track of the used page tables. Figure 9-10 shows the relationship between a private address space and a memory zone. Memory zones can be mapped and unmapped on demand (in the private address space) but are usually switched only at VP creation time. Indeed, the hypervisor does not need to switch them during thread switches; the private address space encapsulates the VA range exposed by the memory zone.

Image

Figure 9-10 The hypervisor’s private address spaces and private memory zones.

In Figure 9-10, the page table’s structures related to the private address space are filled with a pattern, the ones related to the memory zone are shown in gray, and the shared ones belonging to the hypervisor are drawn with a dashed line. Switching private address spaces is a relatively cheap operation that requires the modification of two PML4 entries in the hypervisor’s page table root. Attaching or detaching a memory zone from the private address space requires only the modification of the zone’s PDPTE (a zone VA size is variable; the PDTPE are always allocated contiguously).

Dynamic memory

Virtual machines can use a different percentage of their allocated physical memory. For example, some virtual machines use only a small amount of their assigned guest physical memory, keeping a lot of it freed or zeroed. The performance of other virtual machines can instead suffer for high-memory pressure scenarios, where the page file is used too often because the allocated guest physical memory is not enough. With the goal to prevent the described scenario, the hypervisor and the virtualization stack supports the concept of dynamic memory. Dynamic memory is the ability to dynamically assign and remove physical memory to a virtual machine. The feature is provided by multiple components:

  •     The NT kernel’s memory manager, which supports hot add and hot removal of physical memory (on bare-metal system too)

  •     The hypervisor, through the SLAT (managed by the address manager)

  •     The VM Worker process, which uses the dynamic memory controller module, Vmdynmem.dll, to establish a connection to the VMBus Dynamic Memory VSC driver (Dmvsc.sys), which runs in the child partition

To properly describe dynamic memory, we should quickly introduce how the page frame number (PFN) database is created by the NT kernel. The PFN database is used by Windows to keep track of physical memory. It was discussed in detail in Chapter 5 of Part 1. For creating the PFN database, the NT kernel first calculates the hypothetical size needed to map the highest possible physical address (256 TB on standard 64-bit systems) and then marks the VA space needed to map it entirely as reserved (storing the base address to the MmPfnDatabase global variable). Note that the reserved VA space still has no page tables allocated. The NT kernel cycles between each physical memory descriptor discovered by the boot manager (using UEFI services), coalesces them in the longest ranges possible and, for each range, maps the underlying PFN database entries using large pages. This has an important implication; as shown in Figure 9-11, the PFN database has space for the highest possible amount of physical memory but only a small subset of it is mapped to real physical pages (this technique is called sparse memory).

Image

Figure 9-11 An example of a PFN database where some physical memory has been removed.

Hot add and removal of physical memory works thanks to this principle. When new physical memory is added to the system, the Plug and Play memory driver (Pnpmem.sys) detects it and calls the MmAddPhysicalMemory routine, which is exported by the NT kernel. The latter starts a complex procedure that calculates the exact number of pages in the new range and the Numa node to which they belong, and then it maps the new PFN entries in the database by creating the necessary page tables in the reserved VA space. The new physical pages are added to the free list (see Chapter 5 in Part 1 for more details).

When some physical memory is hot removed, the system performs an inverse procedure. It checks that the pages belong to the correct physical page list, updates the internal memory counters (like the total number of physical pages), and finally frees the corresponding PFN entries, meaning that they all will be marked as “bad.” The memory manager will never use the physical pages described by them anymore. No actual virtual space is unmapped from the PFN database. The physical memory that was described by the freed PFNs can always be re-added in the future.

When an enlightened VM starts, the dynamic memory driver (Dmvsc.sys) detects whether the child VM supports the hot add feature; if so, it creates a worker thread that negotiates the protocol and connects to the VMBus channel of the VSP. (See the “Virtualization stack” section later in this chapter for details about VSC and VSP.) The VMBus connection channel connects the dynamic memory driver running in the child partition to the dynamic memory controller module (Vmdynmem.dll), which is mapped in the VM Worker process in the root partition. A message exchange protocol is started. Every one second, the child partition acquires a memory pressure report by querying different performance counters exposed by the memory manager (global page-file usage; number of available, committed, and dirty pages; number of page faults per seconds; number of pages in the free and zeroed page list). The report is then sent to the root partition.

The VM Worker process in the root partition uses the services exposed by the VMMS balancer, a component of the VmCompute service, for performing the calculation needed for determining the possibility to perform a hot add operation. If the memory status of the root partition allowed a hot add operation, the VMMS balancer calculates the proper number of pages to deposit in the child partition and calls back (through COM) the VM Worker process, which starts the hot add operation with the assistance of the VID driver:

  1. Reserves the proper amount of physical memory in the root partition

  2. Calls the hypervisor with the goal to map the system physical pages reserved by the root partition to some guest physical pages mapped in the child VM, with the proper protection

  3. Sends a message to the dynamic memory driver for starting a hot add operation on some guest physical pages previously mapped by the hypervisor

The dynamic memory driver in the child partition uses the MmAddPhysicalMemory API exposed by the NT kernel to perform the hot add operation. The latter maps the PFNs describing the new guest physical memory in the PFN database, adding new backing pages to the database if needed.

In a similar way, when the VMMS balancer detects that the child VM has plenty of physical pages available, it may require the child partition (still through the VM Worker process) to hot remove some physical pages. The dynamic memory driver uses the MmRemovePhysicalMemory API to perform the hot remove operation. The NT kernel verifies that each page in the range specified by the balancer is either on the zeroed or free list, or it belongs to a stack that can be safely paged out. If all the conditions apply, the dynamic memory driver sends back the “hot removal” page range to the VM Worker process, which will use services provided by the VID driver to unmap the physical pages from the child partition and release them back to the NT kernel.

Image Note

Dynamic memory is not supported when nested virtualization is enabled.

Hyper-V schedulers

The hypervisor is a kind of micro operating system that runs below the root partition’s OS (Windows). As such, it should be able to decide which thread (backing a virtual processor) is being executed by which physical processor. This is especially true when the system runs multiple virtual machines composed in total by more virtual processors than the physical processors installed in the workstation. The hypervisor scheduler role is to select the next thread that a physical CPU is executing after the allocated time slice of the current one ends. Hyper-V can use three different schedulers. To properly manage all the different schedulers, the hypervisor exposes the scheduler APIs, a set of routines that are the only entries into the hypervisor scheduler. Their sole purpose is to redirect API calls to the particular scheduler implementation.

The classic scheduler

The classic scheduler has been the default scheduler used on all versions of Hyper-V since its initial release. The classic scheduler in its default configuration implements a simple, round-robin policy in which any virtual processor in the current execution state (the execution state depends on the total number of VMs running in the system) is equally likely to be dispatched. The classic scheduler supports also setting a virtual processor’s affinity and performs scheduling decisions considering the physical processor’s NUMA node. The classic scheduler doesn’t know what a guest VP is currently executing. The only exception is defined by the spin-lock enlightenment. When the Windows kernel, which is running in a partition, is going to perform an active wait on a spin-lock, it emits a hypercall with the goal to inform the hypervisor (high IRQL synchronization mechanisms are described in Chapter 8, “System mechanisms”). The classic scheduler can preempt the current executing virtual processor (which hasn’t expired its allocated time slice yet) and can schedule another one. In this way it saves the active CPU spin cycles.

The default configuration of the classic scheduler assigns an equal time slice to each VP. This means that in high-workload oversubscribed systems, where multiple virtual processors attempt to execute, and the physical processors are sufficiently busy, performance can quickly degrade. To overcome the problem, the classic scheduler supports different fine-tuning options (see Figure 9-12), which can modify its internal scheduling decision:

  •     VP reservations A user can reserve the CPU capacity in advance on behalf of a guest machine. The reservation is specified as the percentage of the capacity of a physical processor to be made available to the guest machine whenever it is scheduled to run. As a result, Hyper-V schedules the VP to run only if that minimum amount of CPU capacity is available (meaning that the allocated time slice is guaranteed).

  •     VP limits Similar to VP reservations, a user can limit the percentage of physical CPU usage for a VP. This means reducing the available time slice allocated to a VP in a high workload scenario.

  •     VP weight This controls the probability that a VP is scheduled when the reservations have already been met. In default configurations, each VP has an equal probability of being executed. When the user configures weight on the VPs that belong to a virtual machine, scheduling decisions become based on the relative weighting factor the user has chosen. For example, let’s assume that a system with four CPUs runs three virtual machines at the same time. The first VM has set a weighting factor of 100, the second 200, and the third 300. Assuming that all the system’s physical processors are allocated to a uniform number of VPs, the probability of a VP in the first VM to be dispatched is 17%, of a VP in the second VM is 33%, and of a VP in the third one is 50%.

Image

Figure 9-12 The classic scheduler fine-tuning settings property page, which is available only when the classic scheduler is enabled.

The core scheduler

Normally, a classic CPU’s core has a single execution pipeline in which streams of instructions are executed one after each other. An instruction enters the pipe, proceeds through several stages of execution (load data, compute, store data, for example), and is retired from the pipe. Different types of instructions use different parts of the CPU core. A modern CPU’s core is often able to execute in an out-of-order way multiple sequential instructions in the stream (in respect to the order in which they entered the pipeline). Modern CPUs, which support out-of-order execution, often implement what is called symmetric multithreading (SMT): a CPU’s core has two execution pipelines and presents more than one logical processor to the system; thus, two different instruction streams can be executed side by side by a single shared execution engine. (The resources of the core, like its caches, are shared.) The two execution pipelines are exposed to the software as single independent processors (CPUs). From now on, with the term logical processor (or simply LP), we will refer to an execution pipeline of an SMT core exposed to Windows as an independent CPU. (SMT is discussed in Chapters 2 and 4 of Part 1.)

This hardware implementation has led to many security problems: one instruction executed by a shared logical CPU can interfere and affect the instruction executed by the other sibling LP. Furthermore, the physical core’s cache memory is shared; an LP can alter the content of the cache. The other sibling CPU can potentially probe the data located in the cache by measuring the time employed by the processor to access the memory addressed by the same cache line, thus revealing “secret data” accessed by the other logical processor (as described in the “Hardware side-channel vulnerabilities” section of Chapter 8). The classic scheduler can normally select two threads belonging to different VMs to be executed by two LPs in the same processor core. This is clearly not acceptable because in this context, the first virtual machine could potentially read data belonging to the other one.

To overcome this problem, and to be able to run SMT-enabled VMs with predictable performance, Windows Server 2016 has introduced the core scheduler. The core scheduler leverages the properties of SMT to provide isolation and a strong security boundary for guest VPs. When the core scheduler is enabled, Hyper-V schedules virtual cores onto physical cores. Furthermore, it ensures that VPs belonging to different VMs are never scheduled on sibling SMT threads of a physical core. The core scheduler enables the virtual machine for making use of SMT. The VPs exposed to a VM can be part of an SMT set. The OS and applications running in the guest virtual machine can use SMT behavior and programming interfaces (APIs) to control and distribute work across SMT threads, just as they would when run nonvirtualized.

Figure 9-13 shows an example of an SMT system with four logical processors distributed in two CPU cores. In the figure, three VMs are running. The first and second VMs have four VPs in two groups of two, whereas the third one has only one assigned VP. The groups of VPs in the VMs are labelled A through E. Individual VPs in a group that are idle (have no code to execute) are filled with a darker color.

Image

Figure 9-13 A sample SMT system with two processors’ cores and three VMs running.

Each core has a run list containing groups of VPs that are ready to execute, and a deferred list of groups of VPs that are ready to run but have not been added to the core’s run list yet. The groups of VPs execute on the physical cores. If all VPs in a group become idle, then the VP group is descheduled and does not appear on any run list. (In Figure 9-13, this is the situation for VP group D.) The only VP of the group E has recently left the idle state. The VP has been assigned to the CPU core 2. In the figure, a dummy sibling VP is shown. This is because the LP of core 2 never schedules any other VP while its sibling LP of its core is executing a VP belonging to the VM 3. In the same way, no other VPs are scheduled on a physical core if one VP in the LP group became idle but the other is still executing (such as for group A, for example). Each core executes the VP group that is at the head of its run list. If there are no VP groups to execute, the core becomes idle and waits for a VP group to be deposited onto its deferred run list. When this occurs, the core wakes up from idle and empties its deferred run list, placing the contents onto its run list.

The core scheduler is implemented by different components (see Figure 9-14) that provide strict layering between each other. The heart of the core scheduler is the scheduling unit, which represents a virtual core or group of SMT VPs. (For non-SMT VMs, it represents a single VP.) Depending on the VM’s type, the scheduling unit has either one or two threads bound to it. The hypervisor’s process owns a list of scheduling units, which own threads backing up to VPs belonging to the VM. The scheduling unit is the single unit of scheduling for the core scheduler to which scheduling settings—such as reservation, weight, and cap—are applied during runtime. A scheduling unit stays active for the duration of a time slice, can be blocked and unblocked, and can migrate between different physical processor cores. An important concept is that the scheduling unit is analogous to a thread in the classic scheduler, but it doesn’t have a stack or VP context in which to run. It’s one of the threads bound to a scheduling unit that runs on a physical processor core. The thread gang scheduler is the arbiter for each scheduling unit. It’s the entity that decides which thread from the active scheduling unit gets run by which LP from the physical processor core. It enforces thread affinities, applies thread scheduling policies, and updates the related counters for each thread.

Image

Figure 9-14 The components of the core scheduler.

Each LP of the physical processor’s core has an instance of a logical processor dispatcher associated with it. The logical processor dispatcher is responsible for switching threads, maintaining timers, and flushing the VMCS (or VMCB, depending on the architecture) for the current thread. Logical processor dispatchers are owned by the core dispatcher, which represents a physical single processor core and owns exactly two SMT LPs. The core dispatcher manages the current (active) scheduling unit. The unit scheduler, which is bound to its own core dispatcher, decides which scheduling unit needs to run next on the physical processor core the unit scheduler belongs to. The last important component of the core scheduler is the scheduler manager, which owns all the unit schedulers in the system and has a global view of all their states. It provides load balancing and ideal core assignment services to the unit scheduler.

The root scheduler

The root scheduler (also known as integrated scheduler) was introduced in Windows 10 April 2018 Update (RS4) with the goal to allow the root partition to schedule virtual processors (VPs) belonging to guest partitions. The root scheduler was designed with the goal to support lightweight containers used by Windows Defender Application Guard. Those types of containers (internally called Barcelona or Krypton containers) must be managed by the root partition and should consume a small amount of memory and hard-disk space. (Deeply describing Krypton containers is outside the scope of this book. You can find an introduction of server containers in Part 1, Chapter 3, “Processes and jobs”). In addition, the root OS scheduler can readily gather metrics about workload CPU utilization inside the container and use this data as input to the same scheduling policy applicable to all other workloads in the system.

The NT scheduler in the root partition’s OS instance manages all aspects of scheduling work to system LPs. To achieve that, the integrated scheduler’s root component inside the VID driver creates a VP-dispatch thread inside of the root partition (in the context of the new VMMEM process) for each guest VP. (VA-backed VMs are discussed later in this chapter.) The NT scheduler in the root partition schedules VP-dispatch threads as regular threads subject to additional VM/VP-specific scheduling policies and enlightenments. Each VP-dispatch thread runs a VP-dispatch loop until the VID driver terminates the corresponding VP.

The VP-dispatch thread is created by the VID driver after the VM Worker Process (VMWP), which is covered in the “Virtualization stack” section later in this chapter, has requested the partition and VPs creation through the SETUP_PARTITION IOCTL. The VID driver communicates with the WinHvr driver, which in turn initializes the hypervisor’s guest partition creation (through the HvCreatePartition hypercall). In case the created partition represents a VA-backed VM, or in case the system has the root scheduler active, the VID driver calls into the NT kernel (through a kernel extension) with the goal to create the VMMEM minimal process associated with the new guest partition. The VID driver also creates a VP-dispatch thread for each VP belonging to the partition. The VP-dispatch thread executes in the context of the VMMEM process in kernel mode (no user mode code exists in VMMEM) and is implemented in the VID driver (and WinHvr). As shown in Figure 9-15, each VP-dispatch thread runs a VP-dispatch loop until the VID terminates the corresponding VP or an intercept is generated from the guest partition.

Image

Figure 9-15 The root scheduler’s VP-dispatch thread and the associated VMWP worker thread that processes the hypervisor’s messages.

While in the VP-dispatch loop, the VP-dispatch thread is responsible for the following:

  1. Call the hypervisor’s new HvDispatchVp hypercall interface to dispatch the VP on the current processor. On each HvDispatchVp hypercall, the hypervisor tries to switch context from the current root VP to the specified guest VP and let it run the guest code. One of the most important characteristics of this hypercall is that the code that emits it should run at PASSIVE_LEVEL IRQL. The hypervisor lets the guest VP run until either the VP blocks voluntarily, the VP generates an intercept for the root, or there is an interrupt targeting the root VP. Clock interrupts are still processed by the root partitions. When the guest VP exhausts its allocated time slice, the VP-backing thread is preempted by the NT scheduler. On any of the three events, the hypervisor switches back to the root VP and completes the HvDispatchVp hypercall. It then returns to the root partition.

  2. Block on the VP-dispatch event if the corresponding VP in the hypervisor is blocked. Anytime the guest VP is blocked voluntarily, the VP-dispatch thread blocks itself on a VP-dispatch event until the hypervisor unblocks the corresponding guest VP and notifies the VID driver. The VID driver signals the VP-dispatch event, and the NT scheduler unblocks the VP-dispatch thread that can make another HvDispatchVp hypercall.

  3. Process all intercepts reported by the hypervisor on return from the dispatch hypercall. If the guest VP generates an intercept for the root, the VP-dispatch thread processes the intercept request on return from the HvDispatchVp hypercall and makes another HvDispatchVp request after the VID completes processing of the intercept. Each intercept is managed differently. If the intercept requires processing from the user mode VMWP process, the WinHvr driver exits the loop and returns to the VID, which signals an event for the backed VMWP thread and waits for the intercept message to be processed by the VMWP process before restarting the loop.

To properly deliver signals to VP-dispatch threads from the hypervisor to the root, the integrated scheduler provides a scheduler message exchange mechanism. The hypervisor sends scheduler messages to the root partition via a shared page. When a new message is ready for delivery, the hypervisor injects a SINT interrupt into the root, and the root delivers it to the corresponding ISR handler in the WinHvr driver, which routes the message to the VID intercept callback (VidInterceptIsrCallback). The intercept callback tries to handle the intercept message directly from the VID driver. In case the direct handling is not possible, a synchronization event is signaled, which allows the dispatch loop to exit and allows one of the VmWp worker threads to dispatch the intercept in user mode.

Context switches when the root scheduler is enabled are more expensive compared to other hypervisor scheduler implementations. When the system switches between two guest VPs, for example, it always needs to generate two exits to the root partitions. The integrated scheduler treats hypervisor’s root VP threads and guest VP threads very differently (they are internally represented by the same TH_THREAD data structure, though):

  •     Only the root VP thread can enqueue a guest VP thread to its physical processor. The root VP thread has priority over any guest VP that is running or being dispatched. If the root VP is not blocked, the integrated scheduler tries its best to switch the context to the root VP thread as soon as possible.

  •     A guest VP thread has two sets of states: thread internal states and thread root states. The thread root states reflect the states of the VP-dispatch thread that the hypervisor communicates to the root partition. The integrated scheduler maintains those states for each guest VP thread to know when to send a wake-up signal for the corresponding VP-dispatch thread to the root.

Only the root VP can initiate a dispatch of a guest VP for its processor. It can do that either because of HvDispatchVp hypercalls (in this situation, we say that the hypervisor is processing “external work”), or because of any other hypercall that requires sending a synchronous request to the target guest VP (this is what is defined as “internal work”). If the guest VP last ran on the current physical processor, the scheduler can dispatch the guest VP thread right away. Otherwise, the scheduler needs to send a flush request to the processor on which the guest VP last ran and wait for the remote processor to flush the VP context. The latter case is defined as “migration” and is a situation that the hypervisor needs to track (through the thread internal states and root states, which are not described here).

Hypercalls and the hypervisor TLFS

Hypercalls provide a mechanism to the operating system running in the root or the in the child partition to request services from the hypervisor. Hypercalls have a well-defined set of input and output parameters. The hypervisor Top Level Functional Specification (TLFS) is available online (https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs); it defines the different calling conventions used while specifying those parameters. Furthermore, it lists all the publicly available hypervisor features, partition’s properties, hypervisor, and VSM interfaces.

Hypercalls are available because of a platform-dependent opcode (VMCALL for Intel systems, VMMCALL for AMD, HVC for ARM64) which, when invoked, always cause a VM_EXIT into the hypervisor. VM_EXITs are events that cause the hypervisor to restart to execute its own code in the hypervisor privilege level, which is higher than any other software running in the system (except for firmware’s SMM context), while the VP is suspended. VM_EXIT events can be generated from various reasons. In the platform-specific VMCS (or VMCB) opaque data structure the hardware maintains an index that specifies the exit reason for the VM_EXIT. The hypervisor gets the index, and, in case of an exit caused by a hypercall, reads the hypercall input value specified by the caller (generally from a CPU’s general-purpose register—RCX in the case of 64-bit Intel and AMD systems). The hypercall input value (see Figure 9-16) is a 64-bit value that specifies the hypercall code, its properties, and the calling convention used for the hypercall. Three kinds of calling conventions are available:

  •     Standard hypercalls Store the input and output parameters on 8-byte aligned guest physical addresses (GPAs). The OS passes the two addresses via general-purposes registers (RDX and R8 on Intel and AMD 64-bit systems).

  •     Fast hypercalls Usually don’t allow output parameters and employ the two general-purpose registers used in standard hypercalls to pass only input parameters to the hypervisor (up to 16 bytes in size).

  •     Extended fast hypercalls (or XMM fast hypercalls) Similar to fast hypercalls, but these use an additional six floating-point registers to allow the caller to pass input parameters up to 112 bytes in size.

Image

Figure 9-16 The hypercall input value (from the hypervisor TLFS).

There are two classes of hypercalls: simple and rep (which stands for “repeat”). A simple hypercall performs a single operation and has a fixed-size set of input and output parameters. A rep hypercall acts like a series of simple hypercalls. When a caller initially invokes a rep hypercall, it specifies a rep count that indicates the number of elements in the input or output parameter list. Callers also specify a rep start index that indicates the next input or output element that should be consumed.

All hypercalls return another 64-bit value called hypercall result value (see Figure 9-17). Generally, the result value describes the operation’s outcome, and, for rep hypercalls, the total number of completed repetition.

Image

Figure 9-17 The hypercall result value (from the hypervisor TLFS).

Hypercalls could take some time to be completed. Keeping a physical CPU that doesn‘t receive interrupts can be dangerous for the host OS. For example, Windows has a mechanism that detects whether a CPU has not received its clock tick interrupt for a period of time longer than 16 milliseconds. If this condition is detected, the system is suddenly stopped with a BSOD. The hypervisor therefore relies on a hypercall continuation mechanism for some hypercalls, including all rep hypercall forms. If a hypercall isn’t able to complete within the prescribed time limit (usually 50 microseconds), control is returned back to the caller (through an operation called VM_ENTRY), but the instruction pointer is not advanced past the instruction that invoked the hypercall. This allows pending interrupts to be handled and other virtual processors to be scheduled. When the original calling thread resumes execution, it will re-execute the hypercall instruction and make forward progress toward completing the operation.

A driver usually never emits a hypercall directly through the platform-dependent opcode. Instead, it uses services exposed by the Windows hypervisor interface driver, which is available in two different versions:

  •     WinHvr.sys Loaded at system startup if the OS is running in the root partition and exposes hypercalls available in both the root and child partition.

  •     WinHv.sys Loaded only when the OS is running in a child partition. It exposes hypercalls available in the child partition only.

Routines and data structures exported by the Windows hypervisor interface driver are extensively used by the virtualization stack, especially by the VID driver, which, as we have already introduced, covers a key role in the functionality of the entire Hyper-V platform.

Intercepts

The root partition should be able to create a virtual environment that allows an unmodified guest OS, which was written to execute on physical hardware, to run in a hypervisor’s guest partition. Such legacy guests may attempt to access physical devices that do not exist in a hypervisor partition (for example, by accessing certain I/O ports or by writing to specific MSRs). For these cases, the hypervisor provides the host intercepts facility; when a VP of a guest VM executes certain instructions or generates certain exceptions, the authorized root partition can intercept the event and alter the effect of the intercepted instruction such that, to the child, it mirrors the expected behavior in physical hardware.

When an intercept event occurs in a child partition, its VP is suspended, and an intercept message is sent to the root partition by the Synthetic Interrupt Controller (SynIC; see the following section for more details) from the hypervisor. The message is received thanks to the hypervisor’s Synthetic ISR (Interrupt Service Routine), which the NT kernel installs during phase 0 of its startup only in case the system is enlightened and running under the hypervisor (see Chapter 12 for more details). The hypervisor synthetic ISR (KiHvInterrupt), usually installed on vector 0x30, transfers its execution to an external callback, which the VID driver has registered when it started (through the exposed HvlRegisterInterruptCallback NT kernel API).

The VID driver is an intercept driver, meaning that it is able to register host intercepts with the hypervisor and thus receives all the intercept events that occur on child partitions. After the partition is initialized, the WM Worker process registers intercepts for various components of the virtualization stack. (For example, the virtual motherboard registers I/O intercepts for each virtual COM ports of the VM.) It sends an IOCTL to the VID driver, which uses the HvInstallIntercept hypercall to install the intercept on the child partition. When the child partition raises an intercept, the hypervisor suspends the VP and injects a synthetic interrupt in the root partition, which is managed by the KiHvInterrupt ISR. The latter routine transfers the execution to the registered VID Intercept callback, which manages the event and restarts the VP by clearing the intercept suspend synthetic register of the suspended VP.

The hypervisor supports the interception of the following events in the child partition:

  •     Access to I/O ports (read or write)

  •     Access to VP’s MSR (read or write)

  •     Execution of CPUID instruction

  •     Exceptions

  •     Accesses to general purposes registers

  •     Hypercalls

The synthetic interrupt controller (SynIC)

The hypervisor virtualizes interrupts and exceptions for both the root and guest partitions through the synthetic interrupt controller (SynIC), which is an extension of a virtualized local APIC (see the Intel or AMD software developer manual for more details about the APIC). The SynIC is responsible for dispatching virtual interrupts to virtual processors (VPs). Interrupts delivered to a partition fall into two categories: external and synthetic (also known as internal or simply virtual interrupts). External interrupts originate from other partitions or devices; synthetic interrupts are originated from the hypervisor itself and are targeted to a partition’s VP.

When a VP in a partition is created, the hypervisor creates and initializes a SynIC for each supported VTL. It then starts the VTL 0’s SynIC, which means that it enables the virtualization of a physical CPU’s APIC in the VMCS (or VMCB) hardware data structure. The hypervisor supports three kinds of APIC virtualization while dealing with external hardware interrupts:

  •     In standard configuration, the APIC is virtualized through the event injection hardware support. This means that every time a partition accesses the VP’s local APIC registers, I/O ports, or MSRs (in the case of x2APIC), it produces a VMEXIT, causing hypervisor codes to dispatch the interrupt through the SynIC, which eventually “injects” an event to the correct guest VP by manipulating VMCS/VMCB opaque fields (after it goes through the logic similar to a physical APIC, which determines whether the interrupt can be delivered).

  •     The APIC emulation mode works similar to the standard configuration. Every physical interrupt sent by the hardware (usually through the IOAPIC) still causes a VMEXIT, but the hypervisor does not have to inject any event. Instead, it manipulates a virtual-APIC page used by the processor to virtualize certain access to the APIC registers. When the hypervisor wants to inject an event, it simply manipulates some virtual registers mapped in the virtual-APIC page. The event is delivered by the hardware when a VMENTRY happens. At the same time, if a guest VP manipulates certain parts of its local APIC, it does not produce any VMEXIT, but the modification will be stored in the virtual-APIC page.

  •     Posted interrupts allow certain kinds of external interrupts to be delivered directly in the guest partition without producing any VMEXIT. This allows direct access devices to be mapped directly in the child partition without incurring any performance penalties caused by the VMEXITs. The physical processor processes the virtual interrupts by directly recording them as pending on the virtual-APIC page. (For more details, consult the Intel or AMD software developer manual.)

When the hypervisor starts a processor, it usually initializes the synthetic interrupt controller module for the physical processor (represented by a CPU_PLS data structure). The SynIC module of the physical processor is an array of an interrupt’s descriptors, which make the connection between a physical interrupt and a virtual interrupt. A hypervisor interrupt descriptor (IDT entry), as shown in Figure 9-18, contains the data needed for the SynIC to correctly dispatch the interrupt, in particular the entity the interrupt is delivered to (a partition, the hypervisor, a spurious interrupt), the target VP (root, a child, multiple VPs, or a synthetic interrupt), the interrupt vector, the target VTL, and some other interrupt characteristics.

Image

Figure 9-18 The hypervisor physical interrupt descriptor.

In default configurations, all the interrupts are delivered to the root partition in VTL 0 or to the hypervisor itself (in the second case, the interrupt entry is Hypervisor Reserved). External interrupts can be delivered to a guest partition only when a direct access device is mapped into a child partition; NVMe devices are a good example.

Every time the thread backing a VP is selected for being executed, the hypervisor checks whether one (or more) synthetic interrupt needs to be delivered. As discussed previously, synthetic interrupts aren’t generated by any hardware; they’re usually generated from the hypervisor itself (under certain conditions), and they are still managed by the SynIC, which is able to inject the virtual interrupt to the correct VP. Even though they’re extensively used by the NT kernel (the enlightened clock timer is a good example), synthetic interrupts are fundamental for the Virtual Secure Mode (VSM). We discuss them in in the section “The Secure Kernel” later in this chapter.

The root partition can send a customized virtual interrupt to a child by using the HvAssertVirtualInterrupt hypercall (documented in the TLFS).

Inter-partition communication

The synthetic interrupt controller also has the important role of providing inter-partition communication facilities to the virtual machines. The hypervisor provides two principal mechanisms for one partition to communicate with another: messages and events. In both cases, the notifications are sent to the target VP using synthetic interrupts. Messages and events are sent from a source partition to a target partition through a preallocated connection, which is associated with a destination port.

One of the most important components that uses the inter-partition communication services provided by the SynIC is VMBus. (VMBus architecture is discussed in the “Virtualization stack” section later in this chapter.) The VMBus root driver (Vmbusr.sys) in the root allocates a port ID (ports are identified by a 32-bit ID) and creates a port in the child partition by emitting the HvCreatePort hypercall through the services provided by the WinHv driver.

A port is allocated in the hypervisor from the receiver’s memory pool. When a port is created, the hypervisor allocates sixteen message buffers from the port memory. The message buffers are maintained in a queue associated with a SINT (synthetic interrupt source) in the virtual processor’s SynIC. The hypervisor exposes sixteen interrupt sources, which can allow the VMBus root driver to manage a maximum of 16 message queues. A synthetic message has the fixed size of 256 bytes and can transfer only 240 bytes (16 bytes are used as header). The caller of the HvCreatePort hypercall specifies which virtual processor and SINT to target.

To correctly receive messages, the WinHv driver allocates a synthetic interrupt message page (SIMP), which is then shared with the hypervisor. When a message is enqueued for a target partition, the hypervisor copies the message from its internal queue to the SIMP slot corresponding to the correct SINT. The VMBus root driver then creates a connection, which associates the port opened in the child VM to the parent, through the HvConnectPort hypercall. After the child has enabled the reception of synthetic interrupts in the correct SINT slot, the communication can start; the sender can post a message to the client by specifying a target Port ID and emitting the HvPostMessage hypercall. The hypervisor injects a synthetic interrupt to the target VP, which can read from the message page (SIMP) the content of the message.

The hypervisor supports ports and connections of three types:

  •     Message ports Transmit 240-byte messages from and to a partition. A message port is associated with a single SINT in the parent and child partition. Messages will be delivered in order through a single port message queue. This characteristic makes messages ideal for VMBus channel setup and teardown (further details are provided in the “Virtualization stack” section later in this chapter).

  •     Event ports Receive simple interrupts associated with a set of flags, set by the hypervisor when the opposite endpoint makes a HvSignalEvent hypercall. This kind of port is normally used as a synchronization mechanism. VMBus, for example, uses an event port to notify that a message has been posted on the ring buffer described by a particular channel. When the event interrupt is delivered to the target partition, the receiver knows exactly to which channel the interrupt is targeted thanks to the flag associated with the event.

  •     Monitor ports An optimization to the Event port. Causing a VMEXIT and a VM context switch for every single HvSignalEvent hypercall is an expensive operation. Monitor ports are set up by allocating a shared page (between the hypervisor and the partition) that contains a data structure indicating which event port is associated with a particular monitored notification flag (a bit in the page). In that way, when the source partition wants to send a synchronization interrupt, it can just set the corresponding flag in the shared page. Sooner or later the hypervisor will notice the bit set in the shared page and will trigger an interrupt to the event port.

The Windows hypervisor platform API and EXO partitions

Windows increasingly uses Hyper-V’s hypervisor for providing functionality not only related to running traditional VMs. In particular, as we will discuss discuss in the second part of this chapter, VSM, an important security component of modern Windows versions, leverages the hypervisor to enforce a higher level of isolation for features that provide critical system services or handle secrets such as passwords. Enabling these features requires that the hypervisor is running by default on a machine.

External virtualization products, like VMware, Qemu, VirtualBox, Android Emulator, and many others use the virtualization extensions provided by the hardware to build their own hypervisors, which is needed for allowing them to correctly run. This is clearly not compatible with Hyper-V, which launches its hypervisor before the Windows kernel starts up in the root partition (the Windows hypervisor is a native, or bare-metal hypervisor).

As for Hyper-V, external virtualization solutions are also composed of a hypervisor, which provides generic low-level abstractions for the processor’s execution and memory management of the VM, and a virtualization stack, which refers to the components of the virtualization solution that provide the emulated environment for the VM (like its motherboard, firmware, storage controllers, devices, and so on).

The Windows Hypervisor Platform API, which is documented at https://docs.microsoft.com/en-us/virtualizationhttps://learning.oreilly.com/api/, has the main goal to enable running third-party virtualization solutions on the Windows hypervisor. Specifically, a third-party virtualization product should be able to create, delete, start, and stop VMs with characteristics (firmware, emulated devices, storage controllers) defined by its own virtualization stack. The third-party virtualization stack, with its management interfaces, continues to run on Windows in the root partition, which allows for an unchanged use of its VMs by their client.

As shown in Figure 9-19, all the Windows hypervisor platform’s APIs run in user mode and are implemented on the top of the VID and WinHvr driver in two libraries: WinHvPlatform.dll and WinHvEmulation.dll (the latter implements the instruction emulator for MMIO).

Image

Figure 9-19 The Windows hypervisor platform API architecture.

A user mode application that wants to create a VM and its relative virtual processors usually should do the following:

  1. Create the partition in the VID library (Vid.dll) with the WHvCreatePartition API.

  2. Configure various internal partition’s properties—like its virtual processor count, the APIC emulation mode, the kind of requested VMEXITs, and so on—using the WHvSetPartitionProperty API.

  3. Create the partition in the VID driver and the hypervisor using the WHvSetupPartition API. (This kind of partition in the hypervisor is called an EXO partition, as described shortly.) The API also creates the partition’s virtual processors, which are created in a suspended state.

  4. Create the corresponding virtual processor(s) in the VID library through the WHvCreateVirtual-Processor API. This step is important because the API sets up and maps a message buffer into the user mode application, which is used for asynchronous communication with the hypervisor and the thread running the virtual CPUs.

  5. Allocate the address space of the partition by reserving a big range of virtual memory with the classic VirtualAlloc function (read more details in Chapter 5 of Part 1) and map it in the hypervisor through the WHvMapGpaRange API. A fine-grained protection of the guest physical memory can be specified when allocating guest physical memory in the guest virtual address space by committing different ranges of the reserved virtual memory.

  6. Create the page-tables and copy the initial firmware code in the committed memory.

  7. Set the initial VP’s registers content using the WHvSetVirtualProcessorRegisters API.

  8. Run the virtual processor by calling the WHvRunVirtualProcessor blocking API. The function returns only when the guest code executes an operation that requires handling in the virtualization stack (a VMEXIT in the hypervisor has been explicitly required to be managed by the third-party virtualization stack) or because of an external request (like the destroying of the virtual processor, for example).

The Windows hypervisor platform APIs are usually able to call services in the hypervisor by sending different IOCTLs to the \Device\VidExo device object, which is created by the VID driver at initialization time, only if the HKLM\System\CurrentControlSet\Services\Vid\Parameters\ExoDeviceEnabled registry value is set to 1. Otherwise, the system does not enable any support for the hypervisor APIs.

Some performance-sensitive hypervisor platform APIs (a good example is provided by WHvRunVirtualProcessor) can instead call directly into the hypervisor from user mode thanks to the Doorbell page, which is a special invalid guest physical page, that, when accessed, always causes a VMEXIT. The Windows hypervisor platform API obtains the address of the doorbell page from the VID driver. It writes to the doorbell page every time it emits a hypercall from user mode. The fault is identified and treated differently by the hypervisor thanks to the doorbell page’s physical address, which is marked as “special” in the SLAT page table. The hypervisor reads the hypercall’s code and parameters from the VP’s registers as per normal hypercalls, and ultimately transfers the execution to the hypercall’s handler routine. When the latter finishes its execution, the hypervisor finally performs a VMENTRY, landing on the instruction following the faulty one. This saves a lot of clock cycles to the thread backing the guest VP, which no longer has a need to enter the kernel for emitting a hypercall. Furthermore, the VMCALL and similar opcodes always require kernel privileges to be executed.

The virtual processors of the new third-party VM are dispatched using the root scheduler. In case the root scheduler is disabled, any function of the hypervisor platform API can’t run. The created partition in the hypervisor is an EXO partition. EXO partitions are minimal partitions that don’t include any synthetic functionality and have certain characteristics ideal for creating third-party VMs:

  •     They are always VA-backed types. (More details about VA-backed or micro VMs are provided later in the “Virtualization stack” section.) The partition’s memory-hosting process is the user mode application, which created the VM, and not a new instance of the VMMEM process.

  •     They do not have any partition’s privilege or support any VTL (virtual trust level) other than 0. All of a classical partition’s privileges refer to synthetic functionality, which is usually exposed by the hypervisor to the Hyper-V virtualization stack. EXO partitions are used for third-party virtualization stacks. They do not need the functionality brought by any of the classical partition’s privilege.

  •     They manually manage timing. The hypervisor does not provide any virtual clock interrupt source for EXO partition. The third-party virtualization stack must take over the responsibility of providing this. This means that every attempt to read the virtual processor’s time-stamp counter will cause a VMEXIT in the hypervisor, which will route the intercept to the user mode thread that runs the VP.

Image Note

EXO partitions include other minor differences compared to classical hypervisor partitions. For the sake of the discussion, however, those minor differences are irrelevant, so they are not mentioned in this book.

Nested virtualization

Large servers and cloud providers sometimes need to be able to run containers or additional virtual machines inside a guest partition. Figure 9-20 describes this scenario: The hypervisor that runs on the top of the bare-metal hardware, identified as the L0 hypervisor (L0 stands for Level 0), uses the virtualization extensions provided by the hardware to create a guest VM. Furthermore, the L0 hypervisor emulates the processor’s virtualization extensions and exposes them to the guest VM (the ability to expose virtualization extensions is called nested virtualization). The guest VM can decide to run another instance of the hypervisor (which, in this case, is identified as L1 hypervisor, where L1 stands for Level 1), by using the emulated virtualization extensions exposed by the L0 hypervisor. The L1 hypervisor creates the nested root partition and starts the L2 root operating system in it. In the same way, the L2 root can orchestrate with the L1 hypervisor to launch a nested guest VM. The final guest VM in this configuration takes the name of L2 guest.

Image

Figure 9-20 Nested virtualization scheme.

Nested virtualization is a software construction: the hypervisor must be able to emulate and manage virtualization extensions. Each virtualization instruction, while executed by the L1 guest VM, causes a VMEXIT to the L0 hypervisor, which, through its emulator, can reconstruct the instruction and perform the needed work to emulate it. At the time of this writing, only Intel and AMD hardware is supported. The nested virtualization capability should be explicitly enabled for the L1 virtual machine; otherwise, the L0 hypervisor injects a general protection exception in the VM in case a virtualization instruction is executed by the guest operating system.

On Intel hardware, Hyper-V allows nested virtualization to work thanks to two main concepts:

  •     Emulation of the VT-x virtualization extensions

  •     Nested address translation

As discussed previously in this section, for Intel hardware, the basic data structure that describes a virtual machine is the virtual machine control structure (VMCS). Other than the standard physical VMCS representing the L1 VM, when the L0 hypervisor creates a VP belonging to a partition that supports nested virtualization, it allocates some nested VMCS data structures (not to be confused with a virtual VMCS, which is a different concept). The nested VMCS is a software descriptor that contains all the information needed by the L0 hypervisor to start and run a nested VP for a L2 partition. As briefly introduced in the “Hypervisor startup” section, when the L1 hypervisor boots, it detects whether it’s running in a virtualized environment and, if so, enables various nested enlightenments, like the enlightened VMCS or the direct virtual flush (discussed later in this section).

As shown in Figure 9-21, for each nested VMCS, the L0 hypervisor also allocates a Virtual VMCS and a hardware physical VMCS, two similar data structures representing a VP running the L2 virtual machine. The virtual VMCS is important because it has the key role in maintaining the nested virtualized data. The physical VMCS instead is loaded by the L0 hypervisor when the L2 virtual machine is started; this happens when the L0 hypervisor intercepts a VMLAUNCH instruction executed by the L1 hypervisor.

Image

Figure 9-21 A L0 hypervisor running a L2 VM by virtual processor 2.

In the sample picture, the L0 hypervisor has scheduled the VP 2 for running a L2 VM managed by the L1 hypervisor (through the nested virtual processor 1). The L1 hypervisor can operate only on virtualization data replicated in the virtual VMCS.

Emulation of the VT-x virtualization extensions

On Intel hardware, the L0 hypervisor supports both enlightened and nonenlightened L1 hypervisors. The only official supported configuration is Hyper-V running on the top of Hyper-V, though.

In a nonenlightened hypervisor, all the VT-x instructions executed in the L1 guest causes a VMEXIT. After the L1 hypervisor has allocated the guest physical VMCS for describing the new L2 VM, it usually marks it as active (through the VMPTRLD instruction on Intel hardware). The L0 hypervisor intercepts the operation and associates an allocated nested VMCS with the guest physical VMCS specified by the L1 hypervisor. Furthermore, it fills the initial values for the virtual VMCS and sets the nested VMCS as active for the current VP. (It does not switch the physical VMCS though; the execution context should remain the L1 hypervisor.) Each subsequent read or write to the physical VMCS performed by the L1 hypervisor is always intercepted by the L0 hypervisor and redirected to the virtual VMCS (refer to Figure 9-21).

When the L1 hypervisor launches the VM (performing an operation called VMENTRY), it executes a specific hardware instruction (VMLAUNCH on Intel hardware), which is intercepted by the L0 hypervisor. For nonenlightened scenarios, the L0 hypervisor copies all the guest fields of the virtual VMCS to another physical VMCS representing the L2 VM, writes the host fields by pointing them to L0 hypervisor’s entry points, and sets it as active (by using the hardware VMPTRLD instruction on Intel platforms). In case the L1 hypervisor uses the second level address translation (EPT for Intel hardware), the L0 hypervisor then shadows the currently active L1 extended page tables (see the following section for more details). Finally, it performs the actual VMENTRY by executing the specific hardware instruction. As a result, the hardware executes the L2 VM’s code.

While executing the L2 VM, each operation that causes a VMEXIT switches the execution context back to the L0 hypervisor (instead of the L1). As a response, the L0 hypervisor performs another VMENTRY on the original physical VMCS representing the L1 hypervisor context, injecting a synthetic VMEXIT event. The L1 hypervisor restarts the execution and handles the intercepted event as for regular non-nested VMEXITs. When the L1 completes the internal handling of the synthetic VMEXIT event, it executes a VMRESUME operation, which will be intercepted again by the L0 hypervisor and managed in a similar way of the initial VMENTRY operation described earlier.

Producing a VMEXIT each time the L1 hypervisor executes a virtualization instruction is an expensive operation, which could definitively contribute in the general slowdown of the L2 VM. For overcoming this problem, the Hyper-V hypervisor supports the enlightened VMCS, an optimization that, when enabled, allows the L1 hypervisor to load, read, and write virtualization data from a memory page shared between the L1 and L0 hypervisor (instead of a physical VMCS). The shared page is called enlightened VMCS. When the L1 hypervisor manipulates the virtualization data belonging to a L2 VM, instead of using hardware instructions, which cause a VMEXIT into the L0 hypervisor, it directly reads and writes from the enlightened VMCS. This significantly improves the performance of the L2 VM.

In enlightened scenarios, the L0 hypervisor intercepts only VMENTRY and VMEXIT operations (and some others that are not relevant for this discussion). The L0 hypervisor manages VMENTRY in a similar way to the nonenlightened scenario, but, before doing anything described previously, it copies the virtualization data located in the shared enlightened VMCS memory page to the virtual VMCS representing the L2 VM.

Image Note

It is worth mentioning that for nonenlightened scenarios, the L0 hypervisor supports another technique for preventing VMEXITs while managing nested virtualization data, called shadow VMCS. Shadow VMCS is a hardware optimization very similar to the enlightened VMCS.

Nested address translation

As previously discussed in the “Partitions’ physical address space” section, the hypervisor uses the SLAT for providing an isolated guest physical address space to a VM and to translate GPAs to real SPAs. Nested virtual machines would require another hardware layer of translation on top of the two already existing. For supporting nested virtualization, the new layer should have been able to translate L2 GPAs to L1 GPAs. Due to the increased complexity in the electronics needed to build a processor’s MMU that manages three layers of translations, the Hyper-V hypervisor adopted another strategy for providing the additional layer of address translation, called shadow nested page tables. Shadow nested page tables use a technique similar to the shadow paging (see the previous section) for directly translating L2 GPAs to SPAs.

When a partition that supports nested virtualization is created, the L0 hypervisor allocates and initializes a nested page table shadowing domain. The data structure is used for storing a list of shadow nested page tables associated with the different L2 VMs created in the partition. Furthermore, it stores the partition’s active domain generation number (discussed later in this section) and nested memory statistics.

When the L0 hypervisor performs the initial VMENTRY for starting a L2 VM, it allocates the shadow nested page table associated with the VM and initializes it with empty values (the resulting physical address space is empty). When the L2 VM begins code execution, it immediately produces a VMEXIT to the L0 hypervisor due to a nested page fault (EPT violation in Intel hardware). The L0 hypervisor, instead of injecting the fault in the L1, walks the guest’s nested page tables built by the L1 hypervisor. If it finds a valid entry for the specified L2 GPA, it reads the corresponding L1 GPA, translates it to an SPA, and creates the needed shadow nested page table hierarchy to map it in the L2 VM. It then fills the leaf table entry with the valid SPA (the hypervisor uses large pages for mapping shadow nested pages) and resumes the execution directly to the L2 VM by setting the nested VMCS that describes it as active.

For the nested address translation to work correctly, the L0 hypervisor should be aware of any modifications that happen to the L1 nested page tables; otherwise, the L2 VM could run with stale entries. This implementation is platform specific; usually hypervisors protect the L2 nested page table for read-only access. In that way they can be informed when the L1 hypervisor modifies it. The Hyper-V hypervisor adopts another smart strategy, though. It guarantees that the shadow nested page table describing the L2 VM is always updated because of the following two premises:

  •     When the L1 hypervisor adds new entries in the L2 nested page table, it does not perform any other action for the nested VM (no intercepts are generated in the L0 hypervisor). An entry in the shadow nested page table is added only when a nested page fault causes a VMEXIT in the L0 hypervisor (the scenario described previously).

  •     As for non-nested VM, when an entry in the nested page table is modified or deleted, the hypervisor should always emit a TLB flush for correctly invalidating the hardware TLB. In case of nested virtualization, when the L1 hypervisor emits a TLB flush, the L0 intercepts the request and completely invalidates the shadow nested page table. The L0 hypervisor maintains a virtual TLB concept thanks to the generation IDs stored in both the shadow VMCS and the nested page table shadowing domain. (Describing the virtual TLB architecture is outside the scope of the book.)

Completely invalidating the shadow nested page table for a single address changed seems to be redundant, but it’s dictated by the hardware support. (The INVEPT instruction on Intel hardware does not allow specifying which single GPA to remove from the TLB.) In classical VMs, this is not a problem because modifications on the physical address space don’t happen very often. When a classical VM is started, all its memory is already allocated. (The “Virtualization stack” section will provide more details.) This is not true for VA-backed VMs and VSM, though.

For improving performance in nonclassical nested VMs and VSM scenarios (see the next section for details), the hypervisor supports the “direct virtual flush” enlightenment, which provides to the L1 hypervisor two hypercalls to directly invalidate the TLB. In particular, the HvFlushGuestPhysicalAddress List hypercall (documented in the TLFS) allows the L1 hypervisor to invalidate a single entry in the shadow nested page table, removing the performance penalties associated with the flushing of the entire shadow nested page table and the multiple VMEXIT needed to reconstruct it.

The Windows hypervisor on ARM64

Unlike the x86 and AMD64 architectures, where the hardware virtualization support was added long after their original design, the ARM64 architecture has been designed with hardware virtualization support. In particular, as shown in Figure 9-22, the ARM64 execution environment has been split in three different security domains (called Exception Levels). The EL determines the level of privilege; the higher the EL, the more privilege the executing code has. Although all the user mode applications run in EL0, the NT kernel (and kernel mode drivers) usually runs in EL1. In general, a piece of software runs only in a single exception level. EL2 is the privilege level designed for running the hypervisor (which, in ARM64 is also called “Virtual machine manager”) and is an exception to this rule. The hypervisor provides virtualization services and can run in Nonsecure World both in EL2 and EL1. (EL2 does not exist in the Secure World. ARM TrustZone will be discussed later in this section.)

Image

Figure 9-22 The ARM64 execution environment.

Unlike from the AMD64 architecture, where the CPU enters the root mode (the execution domain in which the hypervisor runs) only from the kernel context and under certain assumptions, when a standard ARM64 device boots, the UEFI firmware and the boot manager begin their execution in EL2. On those devices, the hypervisor loader (or Secure Launcher, depending on the boot flow) is able to start the hypervisor directly and, at later time, drop the exception level to EL1 (by emitting an exception return instruction, also known as ERET).

On the top of the exception levels, TrustZone technology enables the system to be partitioned between two execution security states: secure and non-secure. Secure software can generally access both secure and non-secure memory and resources, whereas normal software can only access non-secure memory and resources. The non-secure state is also referred to as the Normal World. This enables an OS to run in parallel with a trusted OS on the same hardware and provides protection against certain software attacks and hardware attacks. The secure state, also referred as Secure World, usually runs secure devices (their firmware and IOMMU ranges) and, in general, everything that requires the processor to be in the secure state.

To correctly communicate with the Secure World, the non-secure OS emits secure method calls (SMC), which provide a mechanism similar to standard OS syscalls. SMC are managed by the TrustZone. TrustZone usually provides separation between the Normal and the Secure Worlds through a thin memory protection layer, which is provided by well-defined hardware memory protection units (Qualcomm calls these XPUs). The XPUs are configured by the firmware to allow only specific execution environments to access specific memory locations. (Secure World memory can’t be accessed by Normal World software.)

In ARM64 server machines, Windows is able to directly start the hypervisor. Client machines often do not have XPUs, even though TrustZone is enabled. (The majority of the ARM64 client devices in which Windows can run are provided by Qualcomm.) In those client devices, the separation between the Secure and Normal Worlds is provided by a proprietary hypervisor, named QHEE, which provides memory isolation using stage-2 memory translation (this layer is the same as the SLAT layer used by the Windows hypervisor). QHEE intercepts each SMC emitted by the running OS: it can forward the SMC directly to TrustZone (after having verified the necessary access rights) or do some work on its behalf. In these devices, TrustZone also has the important responsibility to load and verify the authenticity of the machine firmware and coordinates with QHEE for correctly executing the Secure Launch boot method.

Although in Windows the Secure World is generally not used (a distinction between Secure/Non secure world is already provided by the hypervisor through VTL levels), the Hyper-V hypervisor still runs in EL2. This is not compatible with the QHEE hypervisor, which runs in EL2, too. To solve the problem correctly, Windows adopts a particular boot strategy; the Secure launch process is orchestrated with the aid of QHEE. When the Secure Launch terminates, the QHEE hypervisor unloads and gives up execution to the Windows hypervisor, which has been loaded as part of the Secure Launch. In later boot stages, after the Secure Kernel has been launched and the SMSS is creating the first user mode session, a new special trustlet is created (Qualcomm named it as “QcExt”). The trustlet acts as the original ARM64 hypervisor; it intercepts all the SMC requests, verifies the integrity of them, provides the needed memory isolations (through the services exposed by the Secure Kernel) and is able to send and receive commands from the Secure Monitor in EL3.

The SMC interception architecture is implemented in both the NT kernel and the ARM64 trustlet and is outside the scope of this book. The introduction of the new trustlet has allowed the majority of the client ARM64 machines to boot with Secure Launch and Virtual Secure Mode enabled by default. (VSM is discussed later in this chapter.)

The virtualization stack

Although the hypervisor provides isolation and the low-level services that manage the virtualization hardware, all the high-level implementation of virtual machines is provided by the virtualization stack. The virtualization stack manages the states of the VMs, provides memory to them, and virtualizes the hardware by providing a virtual motherboard, the system firmware, and multiple kind of virtual devices (emulated, synthetic, and direct access). The virtualization stack also includes VMBus, an important component that provides a high-speed communication channel between a guest VM and the root partition and can be accessed through the kernel mode client library (KMCL) abstraction layer.

In this section, we discuss some important services provided by the virtualization stack and analyze its components. Figure 9-23 shows the main components of the virtualization stack.

Image

Figure 9-23 Components of the virtualization stack.

Virtual machine manager service and worker processes

The virtual machine manager service (Vmms.exe) is responsible for providing the Windows Management Instrumentation (WMI) interface to the root partition, which allows managing the child partitions through a Microsoft Management Console (MMC) plug-in or through PowerShell. The VMMS service manages the requests received through the WMI interface on behalf of a VM (identified internally through a GUID), like start, power off, shutdown, pause, resume, reboot, and so on. It controls settings such as which devices are visible to child partitions and how the memory and processor allocation for each partition is defined. The VMMS manages the addition and removal of devices. When a virtual machine is started, the VMM Service also has the crucial role of creating a corresponding Virtual Machine Worker Process (VMWP.exe). The VMMS manages the VM snapshots by redirecting the snapshot requests to the VMWP process in case the VM is running or by taking the snapshot itself in the opposite case.

The VMWP performs various virtualization work that a typical monolithic hypervisor would perform (similar to the work of a software-based virtualization solution). This means managing the state machine for a given child partition (to allow support for features such as snapshots and state transitions), responding to various notifications coming in from the hypervisor, performing the emulation of certain devices exposed to child partitions (called emulated devices), and collaborating with the VM service and configuration component. The Worker process has the important role to start the virtual motherboard and to maintain the state of each virtual device that belongs to the VM. It also includes components responsible for remote management of the virtualization stack, as well as an RDP component that allows using the remote desktop client to connect to any child partition and remotely view its user interface and interact with it. The VM Worker process exposes the COM objects that provide the interface used by the Vmms (and the VmCompute service) to communicate with the VMWP instance that represents a particular virtual machine.

The VM host compute service (implemented in the Vmcompute.exe and Vmcompute.dll binaries) is another important component that hosts most of the computation-intensive operations that are not implemented in the VM Manager Service. Operation like the analysis of a VM’s memory report (for dynamic memory), management of VHD and VHDX files, and creation of the base layers for containers are implemented in the VM host compute service. The Worker Process and Vmms can communicate with the host compute service thanks the COM objects that it exposes.

The Virtual Machine Manager Service, the Worker Process, and the VM compute service are able to open and parse multiple configuration files that expose a list of all the virtual machines created in the system, and the configuration of each of them. In particular:

  •     The configuration repository stores the list of virtual machines installed in the system, their names, configuration file and GUID in the data.vmcx file located in C:\ProgramData\Microsoft\Windows Hyper-V.

  •     The VM Data Store repository (part of the VM host compute service) is able to open, read, and write the configuration file (usually with “.vmcx” extension) of a VM, which contains the list of virtual devices and the virtual hardware’s configuration.

The VM data store repository is also used to read and write the VM Save State file. The VM State file is generated while pausing a VM and contains the save state of the running VM that can be restored at a later time (state of the partition, content of the VM’s memory, state of each virtual device). The configuration files are formatted using an XML representation of key/value pairs. The plain XML data is stored compressed using a proprietary binary format, which adds a write-journal logic to make it resilient against power failures. Documenting the binary format is outside the scope of this book.

The VID driver and the virtualization stack memory manager

The Virtual Infrastructure Driver (VID.sys) is probably one of the most important components of the virtualization stack. It provides partition, memory, and processor management services for the virtual machines running in the child partition, exposing them to the VM Worker process, which lives in the root. The VM Worker process and the VMMS services use the VID driver to communicate with the hypervisor, thanks to the interfaces implemented in the Windows hypervisor interface driver (WinHv.sys and WinHvr.sys), which the VID driver imports. These interfaces include all the code to support the hypervisor’s hypercall management and allow the operating system (or generic kernel mode drivers) to access the hypervisor using standard Windows API calls instead of hypercalls.

The VID driver also includes the virtualization stack memory manager. In the previous section, we described the hypervisor memory manager, which manages the physical and virtual memory of the hypervisor itself. The guest physical memory of a VM is allocated and managed by the virtualization stack’s memory manager. When a VM is started, the spawned VM Worker process (VMWP.exe) invokes the services of the memory manager (defined in the IMemoryManager COM interface) for constructing the guest VM’s RAM. Allocating memory for a VM is a two-step process:

  1. The VM Worker process obtains a report of the global system’s memory state (by using services from the Memory Balancer in the VMMS process), and, based on the available system memory, determines the size of the physical memory blocks to request to the VID driver (through the VID_RESERVE IOCTL. Sizes of the block vary from 64 MB up to 4 GB). The blocks are allocated by the VID driver using MDL management functions (MmAllocatePartitionNodePagesForMdlEx in particular). For performance reasons, and to avoid memory fragmentation, the VID driver implements a best-effort algorithm to allocate huge and large physical pages (1 GB and 2 MB) before relying on standard small pages. After the memory blocks are allocated, their pages are deposited to an internal “reserve” bucket maintained by the VID driver. The bucket contains page lists ordered in an array based on their quality of service (QOS). The QOS is determined based on the page type (huge, large, and small) and the NUMA node they belong to. This process in the VID nomenclature is called “reserving physical memory” (not to be confused with the term “reserving virtual memory,” a concept of the NT memory manager).

  2. From the virtualization stack perspective, physical memory commitment is the process of emptying the reserved pages in the bucket and moving them in a VID memory block (VSMM_MEMORY_BLOCK data structure), which is created and owned by the VM Worker process using the VID driver’s services. In the process of creating a memory block, the VID driver first deposits additional physical pages in the hypervisor (through the Winhvr driver and the HvDepositMemory hypercall). The additional pages are needed for creating the SLAT table page hierarchy of the VM. The VID driver then requests to the hypervisor to map the physical pages describing the entire guest partition’s RAM. The hypervisor inserts valid entries in the SLAT table and sets their proper permissions. The guest physical address space of the partition is created. The GPA range is inserted in a list belonging to the VID partition. The VID memory block is owned by the VM Worker process. It’s also used for tracking guest memory and in DAX file-backed memory blocks. (See Chapter 11, “Caching and file system support,” for more details about DAX volumes and PMEM.) The VM Worker process can later use the memory block for multiple purposes—for example, to access some pages while managing emulated devices.

The birth of a Virtual Machine (VM)

The process of starting up a virtual machine is managed primarily by the VMMS and VMWP process. When a request to start a VM (internally identified by a GUID) is delivered to the VMMS service (through PowerShell or the Hyper-V Manager GUI application), the VMMS service begins the starting process by reading the VM’s configuration from the data store repository, which includes the VM’s GUID and the list of all the virtual devices (VDEVs) comprising its virtual hardware. It then verifies that the path containing the VHD (or VHDX) representing the VM’s virtual hard disk has the correct access control list (ACL, more details provided later). In case the ACL is not correct, if specified by the VM configuration, the VMMS service (which runs under a SYSTEM account) rewrites a new one, which is compatible with the new VMWP process instance. The VMMS uses COM services to communicate with the Host Compute Service to spawn a new VMWP process instance.

The Host Compute Service gets the path of the VM Worker process by querying its COM registration data located in the Windows registry (HKCU\CLSID\{f33463e0-7d59-11d9-9916-0008744f51f3} key). It then creates the new process using a well-defined access token, which is built using the virtual machine SID as the owner. Indeed, the NT Authority of the Windows Security model defines a well-known subauthority value (83) to identify VMs (more information on system security components are available in Part 1, Chapter 7, “Security”). The Host Compute Service waits for the VMWP process to complete its initialization (in this way the exposed COM interfaces become ready). The execution returns to the VMMS service, which can finally request the starting of the VM to the VMWP process (through the exposed IVirtualMachine COM interface).

As shown in Figure 9-24, the VM Worker process performs a “cold start” state transition for the VM. In the VM Worker process, the entire VM is managed through services exposed by the “Virtual Motherboard.” The Virtual Motherboard emulates an Intel i440BX motherboard on Generation 1 VMs, whereas on Generation 2, it emulates a proprietary motherboard. It manages and maintains the list of virtual devices and performs the state transitions for each of them. As covered in the next section, each virtual device is implemented as a COM object (exposing the IVirtualDevice interface) in a DLL. The Virtual Motherboard enumerates each virtual device from the VM’s configuration and loads the relative COM object representing the device.

Image

Figure 9-24 The VM Worker process and its interface for performing a “cold start” of a VM.

The VM Worker process begins the startup procedure by reserving the resources needed by each virtual device. It then constructs the VM guest physical address space (virtual RAM) by allocating physical memory from the root partition through the VID driver. At this stage, it can power up the virtual motherboard, which will cycle between each VDEV and power it up. The power-up procedure is different for each device: for example, synthetic devices usually communicate with their own Virtualization Service Provider (VSP) for the initial setup.

One virtual device that deserves a deeper discussion is the virtual BIOS (implemented in the Vmchipset.dll library). Its power-up method allows the VM to include the initial firmware executed when the bootstrap VP is started. The BIOS VDEV extracts the correct firmware for the VM (legacy BIOS in the case of Generation 1 VMs; UEFI otherwise) from the resource section of its own backing library, builds the volatile configuration part of the firmware (like the ACPI and the SRAT table), and injects it in the proper guest physical memory by using services provided by the VID driver. The VID driver is indeed able to map memory ranges described by the VID memory block in user mode memory, accessible by the VM Worker process (this procedure is internally called “memory aperture creation”).

After all the virtual devices have been successfully powered up, the VM Worker process can start the bootstrap virtual processor of the VM by sending a proper IOCTL to the VID driver, which will start the VP and its message pump (used for exchanging messages between the VID driver and the VM Worker process).

VMBus

VMBus is the mechanism exposed by the Hyper-V virtualization stack to provide interpartition communication between VMs. It is a virtual bus device that sets up channels between the guest and the host. These channels provide the capability to share data between partitions and set up paravirtualized (also known as synthetic) devices.

The root partition hosts Virtualization Service Providers (VSPs) that communicate over VMBus to handle device requests from child partitions. On the other end, child partitions (or guests) use Virtualization Service Consumers (VSCs) to redirect device requests to the VSP over VMBus. Child partitions require VMBus and VSC drivers to use the paravirtualized device stacks (more details on virtual hardware support are provided later in this chapter in the ”Virtual hardware support” section). VMBus channels allow VSCs and VSPs to transfer data primarily through two ring buffers: upstream and downstream. These ring buffers are mapped into both partitions thanks to the hypervisor, which, as discussed in the previous section, also provides interpartition communication services through the SynIC.

One of the first virtual devices (VDEV) that the Worker process starts while powering up a VM is the VMBus VDEV (implemented in Vmbusvdev.dll). Its power-on routine connects the VM Worker process to the VMBus root driver (Vmbusr.sys) by sending VMBUS_VDEV_SETUP IOCTL to the VMBus root device (named \Device\RootVmBus). The VMBus root driver orchestrates the parent endpoint of the bidirectional communication to the child VM. Its initial setup routine, which is invoked at the time the target VM isn’t still powered on, has the important role to create an XPartition data structure, which is used to represent the VMBus instance of the child VM and to connect the needed SynIC synthetic interrupt sources (also known as SINT, see the “Synthetic Interrupt Controller” section earlier in this chapter for more details). In the root partition, VMBus uses two synthetic interrupt sources: one for the initial message handshaking (which happens before the channel is created) and another one for the synthetic events signaled by the ring buffers. Child partitions use only one SINT, though. The setup routine allocates the main message port in the child VM and the corresponding connection in the root, and, for each virtual processor belonging to the VM, allocates an event port and its connection (used for receiving synthetic events from the child VM).

The two synthetic interrupt sources are mapped using two ISR routines, named KiVmbusInterrupt0 and KiVmbusInterrupt1. Thanks to these two routines, the root partition is ready to receive synthetic interrupts and messages from the child VM. When a message (or event) is received, the ISR queues a deferred procedure call (DPC), which checks whether the message is valid; if so, it queues a work item, which will be processed later by the system running at passive IRQL level (which has further implications on the message queue).

Once VMBus in the root partition is ready, each VSP driver in the root can use the services exposed by the VMBus kernel mode client library to allocate and offer a VMBus channel to the child VM. The VMBus kernel mode client library (abbreviated as KMCL) represents a VMBus channel through an opaque KMODE_CLIENT_CONTEXT data structure, which is allocated and initialized at channel creation time (when a VSP calls the VmbChannelAllocate API). The root VSP then normally offers the channel to the child VM by calling the VmbChannelEnabled API (this function in the child establishes the actual connection to the root by opening the channel). KMCL is implemented in two drivers: one running in the root partition (Vmbkmclr.sys) and one loaded in child partitions (Vmbkmcl.sys).

Offering a channel in the root is a relatively complex operation that involves the following steps:

  1. The KMCL driver communicates with the VMBus root driver through the file object initialized in the VDEV power-up routine. The VMBus driver obtains the XPartition data structure representing the child partition and starts the channel offering process.

  2. Lower-level services provided by the VMBus driver allocate and initialize a LOCAL_OFFER data structure representing a single “channel offer” and preallocate some SynIC predefined messages. VMBus then creates the synthetic event port in the root, from which the child can connect to signal events after writing data to the ring buffer. The LOCAL_OFFER data structure representing the offered channel is added to an internal server channels list.

  3. After VMBus has created the channel, it tries to send the OfferChannel message to the child with the goal to inform it of the new channel. However, at this stage, VMBus fails because the other end (the child VM) is not ready yet and has not started the initial message handshake.

After all the VSPs have completed the channel offering, and all the VDEV have been powered up (see the previous section for details), the VM Worker process starts the VM. For channels to be completely initialized, and their relative connections to be started, the guest partition should load and start the VMBus child driver (Vmbus.sys).

Initial VMBus message handshaking

In Windows, the VMBus child driver is a WDF bus driver enumerated and started by the Pnp manager and located in the ACPI root enumerator. (Another version of the VMBus child driver is also available for Linux. VMBus for Linux is not covered in this book, though.) When the NT kernel starts in the child VM, the VMBus driver begins its execution by initializing its own internal state (which means allocating the needed data structure and work items) and by creating the \Device\VmBus root functional device object (FDO). The Pnp manager then calls the VMBus’s resource assignment handler routine. The latter configures the correct SINT source (by emitting a HvSetVpRegisters hypercall on one of the HvRegisterSint registers, with the help of the WinHv driver) and connects it to the KiVmbusInterrupt2 ISR. Furthermore, it obtains the SIMP page, used for sending and receiving synthetic messages to and from the root partition (see the “Synthetic Interrupt Controller” section earlier in this chapter for more details), and creates the XPartition data structure representing the parent (root) partition.

When the request of starting the VMBus’ FDO comes from the Pnp manager, the VMBus driver starts the initial message handshaking. At this stage, each message is sent by emitting the HvPostMessage hypercall (with the help of the WinHv driver), which allows the hypervisor to inject a synthetic interrupt to a target partition (in this case, the target is the partition). The receiver acquires the message by simply reading from the SIMP page; the receiver signals that the message has been read from the queue by setting the new message type to MessageTypeNone. (See the hypervisor TLFS for more details.) The reader can think of the initial message handshake, which is represented in Figure 9-25, as a process divided in two phases.

Image

Figure 9-25 VMBus initial message handshake.

The first phase is represented by the Initiate Contact message, which is delivered once in the lifetime of the VM. This message is sent from the child VM to the root with the goal to negotiate the VMBus protocol version supported by both sides. At the time of this writing, there are five main VMBus protocol versions, with some additional slight variations. The root partition parses the message, asks the hypervisor to map the monitor pages allocated by the client (if supported by the protocol), and replies by accepting the proposed protocol version. Note that if this is not the case (which happens when the Windows version running in the root partition is lower than the one running in the child VM), the child VM restarts the process by downgrading the VMBus protocol version until a compatible version is established. At this point, the child is ready to send the Request Offers message, which causes the root partition to send the list of all the channels already offered by the VSPs. This allows the child partition to open the channels later in the handshaking protocol.

Figure 9-25 highlights the different synthetic messages delivered through the hypervisor for setting up the VMBus channel or channels. The root partition walks the list of the offered channels located in the Server Channels list (LOCAL_OFFER data structure, as discussed previously), and, for each of them, sends an Offer Channel message to the child VM. The message is the same as the one sent at the final stage of the channel offering protocol, which we discussed previously in the “VMBus” section. So, while the first phase of the initial message handshake happens only once per lifetime of the VM, the second phase can start any time when a channel is offered. The Offer Channel message includes important data used to uniquely identify the channel, like the channel type and instance GUIDs. For VDEV channels, these two GUIDs are used by the Pnp Manager to properly identify the associated virtual device.

The child responds to the message by allocating the client LOCAL_OFFER data structure representing the channel and the relative XInterrupt object, and by determining whether the channel requires a physical device object (PDO) to be created, which is usually always true for VDEVs’ channels. In this case, the VMBus driver creates an instance PDO representing the new channel. The created device is protected through a security descriptor that renders it accessible only from system and administrative accounts. The VMBus standard device interface, which is attached to the new PDO, maintains the association between the new VMBus channel (through the LOCAL_OFFER data structure) and the device object. After the PDO is created, the Pnp Manager is able to identify and load the correct VSC driver through the VDEV type and instance GUIDs included in the Offer Channel message. These interfaces become part of the new PDO and are visible through the Device Manager. See the following experiment for details. When the VSC driver is then loaded, it usually calls the VmbEnableChannel API (exposed by KMCL, as discussed previously) to “open” the channel and create the final ring buffer.

Opening a VMBus channel and creating the ring buffer

For correctly starting the interpartition communication and creating the ring buffer, a channel must be opened. Usually VSCs, after having allocated the client side of the channel (still through VmbChannel Allocate), call the VmbChannelEnable API exported from the KMCL driver. As introduced in the previous section, this API in the child partitions opens a VMBus channel, which has already been offered by the root. The KMCL driver communicates with the VMBus driver, obtains the channel parameters (like the channel’s type, instance GUID, and used MMIO space), and creates a work item for the received packets. It then allocates the ring buffer, which is shown in Figure 9-26. The size of the ring buffer is usually specified by the VSC through a call to the KMCL exported VmbClientChannelInitSetRingBufferPageCount API.

Image

Figure 9-26 An example of a 16-page ring buffer allocated in the child partition.

The ring buffer is allocated from the child VM’s non-paged pool and is mapped through a memory descriptor list (MDL) using a technique called double mapping. (MDLs are described in Chapter 5 of Part 1.) In this technique, the allocated MDL describes a double number of the incoming (or outgoing) buffer’s physical pages. The PFN array of the MDL is filled by including the physical pages of the buffer twice: one time in the first half of the array and one time in the second half. This creates a “ring buffer.”

For example, in Figure 9-26, the incoming and outgoing buffers are 16 pages (0x10) large. The outgoing buffer is mapped at address 0xFFFFCA803D8C0000. If the sender writes a 1-KB VMBus packet to a position close to the end of the buffer, let’s say at offset 0x9FF00, the write succeeds (no access violation exception is raised), but the data will be written partially in the end of the buffer and partially in the beginning. In Figure 9-26, only 256 (0x100) bytes are written at the end of the buffer, whereas the remaining 768 (0x300) bytes are written in the start.

Both the incoming and outgoing buffers are surrounded by a control page. The page is shared between the two endpoints and composes the VM ring control block. This data structure is used to keep track of the position of the last packet written in the ring buffer. It furthermore contains some bits to control whether to send an interrupt when a packet needs to be delivered.

After the ring buffer has been created, the KMCL driver sends an IOCTL to VMBus, requesting the creation of a GPA descriptor list (GPADL). A GPADL is a data structure very similar to an MDL and is used for describing a chunk of physical memory. Differently from an MDL, the GPADL contains an array of guest physical addresses (GPAs, which are always expressed as 64-bit numbers, differently from the PFNs included in a MDL). The VMBus driver sends different messages to the root partition for transferring the entire GPADL describing both the incoming and outcoming ring buffers. (The maximum size of a synthetic message is 240 bytes, as discussed earlier.) The root partition reconstructs the entire GPADL and stores it in an internal list. The GPADL is mapped in the root when the child VM sends the final Open Channel message. The root VMBus driver parses the received GPADL and maps it in its own physical address space by using services provided by the VID driver (which maintains the list of memory block ranges that comprise the VM physical address space).

At this stage the channel is ready: the child and the root partition can communicate by simply reading or writing data to the ring buffer. When a sender finishes writing its data, it calls the VmbChannelSend SynchronousRequest API exposed by the KMCL driver. The API invokes VMBus services to signal an event in the monitor page of the Xinterrupt object associated with the channel (old versions of the VMBus protocol used an interrupt page, which contained a bit corresponding to each channel), Alternatively, VMBus can signal an event directly in the channel’s event port, which depends only on the required latency.

Other than VSCs, other components use VMBus to implement higher-level interfaces. Good examples are provided by the VMBus pipes, which are implemented in two kernel mode libraries (Vmbuspipe.dll and Vmbuspiper.dll) and rely on services exposed by the VMBus driver (through IOCTLs). Hyper-V Sockets (also known as HvSockets) allow high-speed interpartition communication using standard network interfaces (sockets). A client connects an AF_HYPERV socket type to a target VM by specifying the target VM’s GUID and a GUID of the Hyper-V socket’s service registration (to use HvSockets, both endpoints must be registered in the HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\ Virtualization\GuestCommunicationServices registry key) instead of the target IP address and port. Hyper-V Sockets are implemented in multiple drivers: HvSocket.sys is the transport driver, which exposes low-level services used by the socket infrastructure; HvSocketControl.sys is the provider control driver used to load the HvSocket provider in case the VMBus interface is not present in the system; HvSocket.dll is a library that exposes supplementary socket interfaces (tied to Hyper-V sockets) callable from user mode applications. Describing the internal infrastructure of both Hyper-V Sockets and VMBus pipes is outside the scope of this book, but both are documented in Microsoft Docs.

Virtual hardware support

For properly run virtual machines, the virtualization stack needs to support virtualized devices. Hyper-V supports different kinds of virtual devices, which are implemented in multiple components of the virtualization stack. I/O to and from virtual devices is orchestrated mainly in the root OS. I/O includes storage, networking, keyboard, mouse, serial ports and GPU (graphics processing unit). The virtualization stack exposes three kinds of devices to the guest VMs:

  •     Emulated devices, also known—in industry-standard form—as fully virtualized devices

  •     Synthetic devices, also known as paravirtualized devices

  •     Hardware-accelerated devices, also known as direct-access devices

For performing I/O to physical devices, the processor usually reads and writes data from input and output ports (I/O ports), which belong to a device. The CPU can access I/O ports in two ways:

  •     Through a separate I/O address space, which is distinct from the physical memory address space and, on AMD64 platforms, consists of 64 thousand individually addressable I/O ports. This method is old and generally used for legacy devices.

  •     Through memory mapped I/O. Devices that respond like memory components can be accessed through the processor’s physical memory address space. This means that the CPU accesses memory through standard instructions: the underlying physical memory is mapped to a device.

Figure 9-27 shows an example of an emulated device (the virtual IDE controller used in Generation 1 VMs), which uses memory-mapped I/O for transferring data to and from the virtual processor.

Image

Figure 9-27 The virtual IDE controller, which uses emulated I/O to perform data transfer.

In this model, every time the virtual processor reads or writes to the device MMIO space or emits instructions to access the I/O ports, it causes a VMEXIT to the hypervisor. The hypervisor calls the proper intercept routine, which is dispatched to the VID driver. The VID driver builds a VID message and enqueues it in an internal queue. The queue is drained by an internal VMWP’s thread, which waits and dispatches the VP’s messages received from the VID driver; this thread is called the message pump thread and belongs to an internal thread pool initialized at VMWP creation time. The VM Worker process identifies the physical address causing the VMEXIT, which is associated with the proper virtual device (VDEV), and calls into one of the VDEV callbacks (usually read or write callback). The VDEV code uses the services provided by the instruction emulator to execute the faulting instruction and properly emulate the virtual device (an IDE controller in the example).

Image Note

The full instructions emulator located in the VM Worker process is also used for other different purposes, such as to speed up cases of intercept-intensive code in a child partition. The emulator in this case allows the execution context to stay in the Worker process between intercepts, as VMEXITs have serious performance overhead. Older versions of the hardware virtualization extensions prohibit executing real-mode code in a virtual machine; for those cases, the virtualization stack was using the emulator for executing real-mode code in a VM.

Paravirtualized devices

While emulated devices always produce VMEXITs and are quite slow, Figure 9-28 shows an example of a synthetic or paravirtualized device: the synthetic storage adapter. Synthetic devices know to run in a virtualized environment; this reduces the complexity of the virtual device and allows it to achieve higher performance. Some synthetic virtual devices exist only in virtual form and don’t emulate any real physical hardware (an example is synthetic RDP).

Image

Figure 9-28 The storage controller paravirtualized device.

Paravirtualized devices generally require three main components:

  •     A virtualization service provider (VSP) driver runs in the root partition and exposes virtualization-specific interfaces to the guest thanks to the services provided by VMBus (see the previous section for details on VMBus).

  •     A synthetic VDEV is mapped in the VM Worker process and usually cooperates only in the start-up, teardown, save, and restore of the virtual device. It is generally not used during the regular work of the device. The synthetic VDEV initializes and allocates device-specific resources (in the example, the SynthStor VDEV initializes the virtual storage adapter), but most importantly allows the VSP to offer a VMBus communication channel to the guest VSC. The channel will be used for communication with the root and for signaling device-specific notifications via the hypervisor.

  •     A virtualization service consumer (VSC) driver runs in the child partition, understands the virtualization-specific interfaces exposed by the VSP, and reads/writes messages and notifications from the shared memory exposed through VMBus by the VSP. This allows the virtual device to run in the child VM faster than an emulated device.

Hardware-accelerated devices

On server SKUs, hardware-accelerated devices (also known as direct-access devices) allow physical devices to be remapped in the guest partition, thanks to the services exposed by the VPCI infrastructure. When a physical device supports technologies like single-root input/output virtualization (SR IOV) or Discrete Device Assignment (DDA), it can be mapped to a guest partition. The guest partition can directly access the MMIO space associated with the device and can perform DMA to and from the guest memory directly without any interception by the hypervisor. The IOMMU provides the needed security and ensures that the device can initiate DMA transfers only in the physical memory that belong to the virtual machine.

Figure 9-29 shows the components responsible in managing the hardware-accelerated devices:

  •     The VPci VDEV (Vpcievdev.dll) runs in the VM Worker process. Its rule is to extract the list of hardware-accelerated devices from the VM configuration file, set up the VPCI virtual bus, and assign a device to the VSP.

  •     The PCI Proxy driver (Pcip.sys) is responsible for dismounting and mounting a DDA-compatible physical device from the root partition. Furthermore, it has the key role in obtaining the list of resources used by the device (through the SR-IOV protocol) like the MMIO space and interrupts. The proxy driver provides access to the physical configuration space of the device and renders an “unmounted” device inaccessible to the host OS.

  •     The VPCI virtual service provider (Vpcivsp.sys) creates and maintains the virtual bus object, which is associated to one or more hardware-accelerated devices (which in the VPCI VSP are called virtual devices). The virtual devices are exposed to the guest VM through a VMBus channel created by the VSP and offered to the VSC in the guest partition.

  •     The VPCI virtual service client (Vpci.sys) is a WDF bus driver that runs in the guest VM. It connects to the VMBus channel exposed by the VSP, receives the list of the direct access devices exposed to the VM and their resources, and creates a PDO (physical device object) for each of them. The devices driver can then attach to the created PDOs in the same way as they do in nonvirtualized environments.

Image

Figure 9-29 Hardware-accelerated devices.

When a user wants to map a hardware-accelerated device to a VM, it uses some PowerShell commands (see the following experiment for further details), which start by “unmounting” the device from the root partition. This action forces the VMMS service to communicate with the standard PCI driver (through its exposed device, called PciControl). The VMMS service sends a PCIDRIVE_ADD_VMPROXYPATH IOCTL to the PCI driver by providing the device descriptor (in form of bus, device, and function ID). The PCI driver checks the descriptor, and, if the verification succeeded, adds it in the HKLM\System\CurrentControlSet\Control\PnP\Pci\VmProxy registry value. The VMMS then starts a PNP device (re)enumeration by using services exposed by the PNP manager. In the enumeration phase, the PCI driver finds the new proxy device and loads the PCI proxy driver (Pcip.sys), which marks the device as reserved for the virtualization stack and renders it invisible to the host operating system.

The second step requires assigning the device to a VM. In this case, the VMMS writes the device descriptor in the VM configuration file. When the VM is started, the VPCI VDEV (vpcievdev.dll) reads the direct-access device’s descriptor from the VM configuration, and starts a complex configuration phase that is orchestrated mainly by the VPCI VSP (Vpcivsp.sys). Indeed, in its “power on” callback, the VPCI VDEV sends different IOCTLs to the VPCI VSP (which runs in the root partition), with the goal to perform the creation of the virtual bus and the assignment of hardware-accelerated devices to the guest VM.

A “virtual bus” is a data structure used by the VPCI infrastructure as a “glue” to maintain the connection between the root partition, the guest VM, and the direct-access devices assigned to it. The VPCI VSP allocates and starts the VMBus channel offered to the guest VM and encapsulates it in the virtual bus. Furthermore, the virtual bus includes some pointers to important data structures, like some allocated VMBus packets used for the bidirectional communication, the guest power state, and so on. After the virtual bus is created, the VPCI VSP performs the device assignment.

A hardware-accelerated device is internally identified by a LUID and is represented by a virtual device object, which is allocated by the VPCI VSP. Based on the device’s LUID, the VPCI VSP locates the proper proxy driver, which is also known as Mux driver—it’s usually Pcip.sys). The VPCI VSP queries the SR-IOV or DDA interfaces from the proxy driver and uses them to obtain the Plug and Play information (hardware descriptor) of the direct-access device and to collect the resource requirements (MMIO space, BAR registers, and DMA channels). At this point, the device is ready to be attached to the guest VM: the VPCI VSP uses the services exposed by the WinHvr driver to emit the HvAttachDevice hypercall to the hypervisor, which reconfigures the system IOMMU for mapping the device’s address space in the guest partition.

The guest VM is aware of the mapped device thanks to the VPCI VSC (Vpci.sys). The VPCI VSC is a WDF bus driver enumerated and launched by the VMBus bus driver located in the guest VM. It is composed of two main components: a FDO (functional device object) created at VM boot time, and one or more PDOs (physical device objects) representing the physical direct-access devices remapped in the guest VM. When the VPCI VSC bus driver is executed in the guest VM, it creates and starts the client part of the VMBus channel used to exchange messages with the VSP. “Send bus relations” is the first message sent by the VPCI VSC thorough the VMBus channel. The VSP in the root partition responds by sending the list of hardware IDs describing the hardware-accelerated devices currently attached to the VM. When the PNP manager requires the new device relations to the VPCI VSC, the latter creates a new PDO for each discovered direct-access device. The VSC driver sends another message to the VSP with the goal of requesting the resources used by the PDO.

After the initial setup is done, the VSC and VSP are rarely involved in the device management. The specific hardware-accelerated device’s driver in the guest VM attaches to the relative PDO and manages the peripheral as if it had been installed on a physical machine.

VA-backed virtual machines

Virtual machines are being used for multiple purposes. One of them is to properly run traditional software in isolated environments, called containers. (Server and application silos, which are two types of containers, have been introduced in Part 1, Chapter 3, “Processes and jobs.”) Fully isolated containers (internally named Xenon and Krypton) require a fast-startup type, low overhead, and the possibility of getting the lowest possible memory footprint. Guest physical memory of this type of VM is generally shared between multiple containers. Good examples of containers are provided by Windows Defender Application Guard, which uses a container to provide the full isolation of the browser, or by Windows Sandbox, which uses containers to provide a fully isolated virtual environment. Usually a container shares the same VM’s firmware, operating system, and, often, also some applications running in it (the shared components compose the base layer of a container). Running each container in its private guest physical memory space would not be feasible and would result in a high waste of physical memory.

To solve the problem, the virtualization stack provides support for VA-backed virtual machines. VA-backed VMs use the host’s operating system’s memory manager to provide to the guest partition’s physical memory advanced features like memory deduplication, memory trimming, direct maps, memory cloning and, most important, paging (all these concepts have been extensively covered in Chapter 5 of Part 1). For traditional VMs, guest memory is assigned by the VID driver by statically allocating system physical pages from the host and mapping them in the GPA space of the VM before any virtual processor has the chance to execute, but for VA-backed VMs, a new layer of indirection is added between the GPA space and SPA space. Instead of mapping SPA pages directly into the GPA space, the VID creates a GPA space that is initially blank, creates a user mode minimal process (called VMMEM) for hosting a VA space, and sets up GPA to VA mappings using MicroVM. MicroVM is a new component of the NT kernel tightly integrated with the NT memory manager that is ultimately responsible for managing the GPA to SPA mapping by composing the GPA to VA mapping (maintained by the VID) with the VA to SPA mapping (maintained by the NT memory manager).

The new layer of indirection allows VA-backed VMs to take advantage of most memory management features that are exposed to Windows processes. As discussed in the previous section, the VM Worker process, when it starts the VM, asks the VID driver to create the partition’s memory block. In case the VM is VA-backed, it creates the Memory Block Range GPA mapping bitmap, which is used to keep track of the allocated virtual pages backing the new VM’s RAM. It then creates the partition’s RAM memory, backed by a big range of VA space. The VA space is usually as big as the allocated amount of VM’s RAM memory (note that this is not a necessary condition: different VA-ranges can be mapped as different GPA ranges) and is reserved in the context of the VMMEM process using the native NtAllocateVirtualMemory API.

If the “deferred commit” optimization is not enabled (see the next section for more details), the VID driver performs another call to the NtAllocateVirtualMemory API with the goal of committing the entire VA range. As discussed in Chapter 5 of Part 1, committing memory charges the system commit limit but still doesn’t allocate any physical page (all the PTE entries describing the entire range are invalid demand-zero PTEs). The VID driver at this stage uses Winhvr to ask the hypervisor to map the entire partition’s GPA space to a special invalid SPA (by using the same HvMapGpaPages hypercall used for standard partitions). When the guest partition accesses guest physical memory that is mapped in the SLAT table by the special invalid SPA, it causes a VMEXIT to the hypervisor, which recognizes the special value and injects a memory intercept to the root partition.

The VID driver finally notifies MicroVM of the new VA-backed GPA range by invoking the VmCreateMemoryRange routine (MicroVM services are exposed by the NT kernel to the VID driver through a Kernel Extension). MicroVM allocates and initializes a VM_PROCESS_CONTEXT data structure, which contains two important RB trees: one describing the allocated GPA ranges in the VM and one describing the corresponding system virtual address (SVA) ranges in the root partition. A pointer to the allocated data structure is then stored in the EPROCESS of the VMMEM instance.

When the VM Worker process wants to write into the memory of the VA-backed VM, or when a memory intercept is generated due to an invalid GPA to SPA translation, the VID driver calls into the MicroVM page fault handler (VmAccessFault). The handler performs two important operations: first, it resolves the fault by inserting a valid PTE in the page table describing the faulting virtual page (more details in Chapter 5 of Part 1) and then updates the SLAT table of the child VM (by calling the WinHvr driver, which emits another HvMapGpaPages hypercall). Afterward, the VM’s guest physical pages can be paged out simply because private process memory is normally pageable. This has the important implication that it requires the majority of the MicroVM’s function to operate at passive IRQL.

Multiple services of the NT memory manager can be used for VA-backed VMs. In particular, clone templates allow the memory of two different VA-backed VMs to be quickly cloned; direct map allows shared executable images or data files to have their section objects mapped into the VMMEM process and into a GPA range pointing to that VA region. The underlying physical pages can be shared between different VMs and host processes, leading to improved memory density.

VA-backed VMs optimizations

As introduced in the previous section, the cost of a guest access to dynamically backed memory that isn’t currently backed, or does not grant the required permissions, can be quite expensive: when a guest access attempt is made to inaccessible memory, a VMEXIT occurs, which requires the hypervisor to suspend the guest VP, schedule the root partition’s VP, and inject a memory intercept message to it. The VID’s intercept callback handler is invoked at high IRQL, but processing the request and calling into MicroVM requires running at PASSIVE_LEVEL. Thus, a DPC is queued. The DPC routine sets an event that wakes up the appropriate thread in charge of processing the intercept. After the MicroVM page fault handler has resolved the fault and called the hypervisor to update the SLAT entry (through another hypercall, which produces another VMEXIT), it resumes the guest’s VP.

Large numbers of memory intercepts generated at runtime result in big performance penalties. With the goal to avoid this, multiple optimizations have been implemented in the form of guest enlightenments (or simple configurations):

  •     Memory zeroing enlightenments

  •     Memory access hints

  •     Enlightened page fault

  •     Deferred commit and other optimizations

Memory-zeroing enlightenments

To avoid information disclosure to a VM of memory artifacts previously in use by the root partition or another VM, memory-backing guest RAM is zeroed before being mapped for access by the guest. Typically, an operating system zeroes all physical memory during boot because on a physical system the contents are nondeterministic. For a VM, this means that memory may be zeroed twice: once by the virtualization host and again by the guest operating system. For physically backed VMs, this is at best a waste of CPU cycles. For VA-backed VMs, the zeroing by the guest OS generates costly memory intercepts. To avoid the wasted intercepts, the hypervisor exposes the memory-zeroing enlightenments.

When the Windows Loader loads the main operating system, it uses services provided by the UEFI firmware to get the machine’s physical memory map. When the hypervisor starts a VA-backed VM, it exposes the HvGetBootZeroedMemory hypercall, which the Windows Loader can use to query the list of physical memory ranges that are actually already zeroed. Before transferring the execution to the NT kernel, the Windows Loader merges the obtained zeroed ranges with the list of physical memory descriptors obtained through EFI services and stored in the Loader block (further details on startup mechanisms are available in Chapter 12). The NT kernel inserts the merged descriptor directly in the zeroed pages list by skipping the initial memory zeroing.

In a similar way, the hypervisor supports the hot-add memory zeroing enlightenment with a simple implementation: When the dynamic memory VSC driver (dmvsc.sys) initiates the request to add physical memory to the NT kernel, it specifies the MM_ADD_PHYSICAL_MEMORY_ALREADY_ZEROED flag, which hints the Memory Manager (MM) to add the new pages directly to the zeroed pages list.

Memory access hints

For physically backed VMs, the root partition has very limited information about how guest MM intends to use its physical pages. For these VMs, the information is mostly irrelevant because almost all memory and GPA mappings are created when the VM is started, and they remain statically mapped. For VA-backed VMs, this information can instead be very useful because the host memory manager manages the working set of the minimal process that contains the VM’s memory (VMMEM).

The hot hint allows the guest to indicate that a set of physical pages should be mapped into the guest because they will be accessed soon or frequently. This implies that the pages are added to the working set of the minimal process. The VID handles the hint by telling MicroVM to fault in the physical pages immediately and not to remove them from the VMMEM process’s working set.

In a similar way, the cold hint allows the guest to indicate that a set of physical pages should be unmapped from the guest because it will not be used soon. The VID driver handles the hint by forwarding it to MicroVM, which immediately removes the pages from the working set. Typically, the guest uses the cold hint for pages that have been zeroed by the background zero page thread (see Chapter 5 of Part 1 for more details).

The VA-backed guest partition specifies a memory hint for a page by using the HvMemoryHeatHint hypercall.

Enlightened page fault (EPF)

Enlightened page fault (EPF) handling is a feature that allows the VA-backed guest partition to reschedule threads on a VP that caused a memory intercept for a VA-backed GPA page. Normally, a memory intercept for such a page is handled by synchronously resolving the access fault in the root partition and resuming the VP upon access fault completion. When EPF is enabled and a memory intercept occurs for a VA-backed GPA page, the VID driver in the root partition creates a background worker thread that calls the MicroVM page fault handler and delivers a synchronous exception (not to be confused by an asynchronous interrupt) to the guest’s VP, with the goal to let it know that the current thread caused a memory intercept.

The guest reschedules the thread; meanwhile, the host is handling the access fault. Once the access fault has been completed, the VID driver will add the original faulting GPA to a completion queue and deliver an asynchronous interrupt to the guest. The interrupt causes the guest to check the completion queue and unblock any threads that were waiting on EPF completion.

Deferred commit and other optimizations

Deferred commit is an optimization that, if enabled, forces the VID driver not to commit each backing page until first access. This potentially allows more VMs to run simultaneously without increasing the size of the page file, but, since the backing VA space is only reserved, and not committed, the VMs may crash at runtime due to the commitment limit being reached in the root partition. In this case, there is no more free memory available.

Other optimizations are available to set the size of the pages which will be allocated by the MicroVM page fault handler (small versus large) and to pin the backing pages upon first access. This prevents aging and trimming, generally resulting in more consistent performance, but consumes more memory and reduces the memory density.

The VMMEM process

The VMMEM process exists mainly for two main reasons:

  •     Hosts the VP-dispatch thread loop when the root scheduler is enabled, which represents the guest VP schedulable unit

  •     Hosts the VA space for the VA-backed VMs

The VMMEM process is created by the VID driver while creating the VM’s partition. As for regular partitions (see the previous section for details), the VM Worker process initializes the VM setup through the VID.dll library, which calls into the VID through an IOCTL. If the VID driver detects that the new partition is VA-backed, it calls into the MicroVM (through the VsmmNtSlatMemoryProcessCreate function) to create the minimal process. MicroVM uses the PsCreateMinimalProcess function, which allocates the process, creates its address space, and inserts the process into the process list. It then reserves the bottom 4 GB of address space to ensure that no direct-mapped images end up there (this can reduce the entropy and security for the guest). The VID driver applies a specific security descriptor to the new VMMEM process; only the SYSTEM and the VM Worker process can access it. (The VM Worker process is launched with a specific token; the token’s owner is set to a SID generated from the VM’s unique GUID.) This is important because the virtual address space of the VMMEM process could have been accessible to anyone otherwise. By reading the process virtual memory, a malicious user could read the VM private guest physical memory.

Virtualization-based security (VBS)

As discussed in the previous section, Hyper-V provides the services needed for managing and running virtual machines on Windows systems. The hypervisor guarantees the necessary isolation between each partition. In this way, a virtual machine can’t interfere with the execution of another one. In this section, we describe another important component of the Windows virtualization infrastructure: the Secure Kernel, which provides the basic services for the virtualization-based security.

First, we list the services provided by the Secure Kernel and its requirements, and then we describe its architecture and basic components. Furthermore, we present some of its basic internal data structures. Then we discuss the Secure Kernel and Virtual Secure Mode startup method, describing its high dependency on the hypervisor. We conclude by analyzing the components that are built on the top of Secure Kernel, like the Isolated User Mode, Hypervisor Enforced Code Integrity, the secure software enclaves, secure devices, and Windows kernel hot-patching and microcode services.

Virtual trust levels (VTLs) and Virtual Secure Mode (VSM)

As discussed in the previous section, the hypervisor uses the SLAT to maintain each partition in its own memory space. The operating system that runs in a partition accesses memory using the standard way (guest virtual addresses are translated in guest physical addresses by using page tables). Under the cover, the hardware translates all the partition GPAs to real SPAs and then performs the actual memory access. This last translation layer is maintained by the hypervisor, which uses a separate SLAT table per partition. In a similar way, the hypervisor can use SLAT to create different security domains in a single partition. Thanks to this feature, Microsoft designed the Secure Kernel, which is the base of the Virtual Secure Mode.

Traditionally, the operating system has had a single physical address space, and the software running at ring 0 (that is, kernel mode) could have access to any physical memory address. Thus, if any software running in supervisor mode (kernel, drivers, and so on) becomes compromised, the entire system becomes compromised too. Virtual secure mode leverages the hypervisor to provide new trust boundaries for systems software. With VSM, security boundaries (described by the hypervisor using SLAT) can be put in place that limit the resources supervisor mode code can access. Thus, with VSM, even if supervisor mode code is compromised, the entire system is not compromised.

VSM provides these boundaries through the concept of virtual trust levels (VTLs). At its core, a VTL is a set of access protections on physical memory. Each VTL can have a different set of access protections. In this way, VTLs can be used to provide memory isolation. A VTL’s memory access protections can be configured to limit what physical memory a VTL can access. With VSM, a virtual processor is always running at a particular VTL and can access only physical memory that is marked as accessible through the hypervisor SLAT. For example, if a processor is running at VTL 0, it can only access memory as controlled by the memory access protections associated with VTL 0. This memory access enforcement happens at the guest physical memory translation level and thus cannot be changed by supervisor mode code in the partition.

VTLs are organized as a hierarchy. Higher levels are more privileged than lower levels, and higher levels can adjust the memory access protections for lower levels. Thus, software running at VTL 1 can adjust the memory access protections of VTL 0 to limit what memory VTL 0 can access. This allows software at VTL 1 to hide (isolate) memory from VTL 0. This is an important concept that is the basis of the VSM. Currently the hypervisor supports only two VTLs: VTL 0 represents the Normal OS execution environment, which the user interacts with; VTL 1 represents the Secure Mode, where the Secure Kernel and Isolated User Mode (IUM) runs. Because VTL 0 is the environment in which the standard operating system and applications run, it is often referred to as the normal mode.

Image Note

The VSM architecture was initially designed to support a maximum of 16 VTLs. At the time of this writing, only 2 VTLs are supported by the hypervisor. In the future, it could be possible that Microsoft will add one or more new VTLs. For example, latest versions of Windows Server running in Azure also support Confidential VMs, which run their Host Compatibility Layer (HCL) in VTL 2.

Each VTL has the following characteristics associated with it:

  •     Memory access protection As already discussed, each virtual trust level has a set of guest physical memory access protections, which defines how the software can access memory.

  •     Virtual processor state A virtual processor in the hypervisor share some registers with each VTL, whereas some other registers are private per each VTL. The private virtual processor state for a VTL cannot be accessed by software running at a lower VTL. This allows for isolation of the processor state between VTLs.

  •     Interrupt subsystem Each VTL has a unique interrupt subsystem (managed by the hypervisor synthetic interrupt controller). A VTL’s interrupt subsystem cannot be accessed by software running at a lower VTL. This allows for interrupts to be managed securely at a particular VTL without risk of a lower VTL generating unexpected interrupts or masking interrupts.

Figure 9-30 shows a scheme of the memory protection provided by the hypervisor to the Virtual Secure Mode. The hypervisor represents each VTL of the virtual processor through a different VMCS data structure (see the previous section for more details), which includes a specific SLAT table. In this way, software that runs in a particular VTL can access just the physical memory pages assigned to its level. The important concept is that the SLAT protection is applied to the physical pages and not to the virtual pages, which are protected by the standard page tables.

Image

Figure 9-30 Scheme of the memory protection architecture provided by the hypervisor to VSM.

Services provided by the VSM and requirements

Virtual Secure Mode, which is built on the top of the hypervisor, provides the following services to the Windows ecosystem:

  •     Isolation IUM provides a hardware-based isolated environment for each software that runs in VTL 1. Secure devices managed by the Secure Kernel are isolated from the rest of the system and run in VTL 1 user mode. Software that runs in VTL 1 usually stores secrets that can’t be intercepted or revealed in VTL 0. This service is used heavily by Credential Guard. Credential Guard is the feature that stores all the system credentials in the memory address space of the LsaIso trustlet, which runs in VTL 1 user mode.

  •     Control over VTL 0 The Hypervisor Enforced Code Integrity (HVCI) checks the integrity and the signing of each module that the normal OS loads and runs. The integrity check is done entirely in VTL 1 (which has access to all the VTL 0 physical memory). No VTL 0 software can interfere with the signing check. Furthermore, HVCI guarantees that all the normal mode memory pages that contain executable code are marked as not writable (this feature is called W^X. Both HVCI and W^X have been discussed in Chapter 7 of Part 1).

  •     Secure intercepts VSM provides a mechanism to allow a higher VTL to lock down critical system resources and prevent access to them by lower VTLs. Secure intercepts are used extensively by HyperGuard, which provides another protection layer for the VTL 0 kernel by stopping malicious modifications of critical components of the operating systems.

  •     VBS-based enclaves A security enclave is an isolated region of memory within the address space of a user mode process. The enclave memory region is not accessible even to higher privilege levels. The original implementation of this technology was using hardware facilities to properly encrypt memory belonging to a process. A VBS-based enclave is a secure enclave whose isolation guarantees are provided using VSM.

  •     Kernel Control Flow Guard VSM, when HVCI is enabled, provides Control Flow Guard (CFG) to each kernel module loaded in the normal world (and to the NT kernel itself). Kernel mode software running in normal world has read-only access to the bitmap, so an exploit can’t potentially modify it. Thanks to this reason, kernel CFG in Windows is also known as Secure Kernel CFG (SKCFG).

Image Note

CFG is the Microsoft implementation of Control Flow Integrity, a technique that prevents a wide variety of malicious attacks from redirecting the flow of the execution of a program. Both user mode and Kernel mode CFG have been discussed extensively in Chapter 7 of Part 1.

  •     Secure devices Secure devices are a new kind of devices that are mapped and managed entirely by the Secure Kernel in VTL 1. Drivers for these kinds of devices work entirely in VTL 1 user mode and use services provided by the Secure Kernel to map the device I/O space.

To be properly enabled and work correctly, the VSM has some hardware requirements. The host system must support virtualization extensions (Intel VT-x, AMD SVM, or ARM TrustZone) and the SLAT. VSM won’t work if one of the previous hardware features is not present in the system processor. Some other hardware features are not strictly necessary, but in case they are not present, some security premises of VSM may not be guaranteed:

  •     An IOMMU is needed to protect against physical device DMA attacks. If the system processors don’t have an IOMMU, VSM can still work but is vulnerable to these physical device attacks.

  •     A UEFI BIOS with Secure Boot enabled is needed for protecting the boot chain that leads to the startup of the hypervisor and the Secure Kernel. If Secure Boot is not enabled, the system is vulnerable to boot attacks, which can modify the integrity of the hypervisor and Secure Kernel before they have the chances to get executed.

Some other components are optional, but when they’re present they increase the overall security and responsiveness of the system. The TPM presence is a good example. It is used by the Secure Kernel to store the Master Encryption key and to perform Secure Launch (also known as DRTM; see Chapter 12 for more details). Another hardware component that can improve VSM responsiveness is the processor’s Mode-Based Execute Control (MBEC) hardware support: MBEC is used when HVCI is enabled to protect the execution state of user mode pages in kernel mode. With Hardware MBEC, the hypervisor can set the executable state of a physical memory page based on the CPL (kernel or user) domain of the specific VTL. In this way, memory that belongs to user mode application can be physically marked executable only by user mode code (kernel exploits can no longer execute their own code located in the memory of a user mode application). In case hardware MBEC is not present, the hypervisor needs to emulate it, by using two different SLAT tables for VTL 0 and switching them when the code execution changes the CPL security domain (going from user mode to kernel mode and vice versa produces a VMEXIT in this case). More details on HVCI have been already discussed in Chapter 7 of Part 1.

The Secure Kernel

The Secure Kernel is implemented mainly in the securekernel.exe file and is launched by the Windows Loader after the hypervisor has already been successfully started. As shown in Figure 9-31, the Secure Kernel is a minimal OS that works strictly with the normal kernel, which resides in VTL 0. As for any normal OS, the Secure Kernel runs in CPL 0 (also known as ring 0 or kernel mode) of VTL 1 and provides services (the majority of them through system calls) to the Isolated User Mode (IUM), which lives in CPL 3 (also known as ring 3 or user mode) of VTL 1. The Secure Kernel has been designed to be as small as possible with the goal to reduce the external attack surface. It’s not extensible with external device drivers like the normal kernel. The only kernel modules that extend their functionality are loaded by the Windows Loader before VSM is launched and are imported from securekernel.exe:

  •     Skci.dll Implements the Hypervisor Enforced Code Integrity part of the Secure Kernel

  •     Cng.sys Provides the cryptographic engine to the Secure Kernel

  •     Vmsvcext.dll Provides support for the attestation of the Secure Kernel components in Intel TXT (Trusted Boot) environments (more information about Trusted Boot is available in Chapter 12)

Image

Figure 9-31 Virtual Secure Mode Architecture scheme, built on top of the hypervisor.

While the Secure Kernel is not extensible, the Isolated User Mode includes specialized processes called Trustlets. Trustlets are isolated among each other and have specialized digital signature requirements. They can communicate with the Secure Kernel through syscalls and with the normal world through Mailslots and ALPC. Isolated User Mode is discussed later in this chapter.

Virtual interrupts

When the hypervisor configures the underlying virtual partitions, it requires that the physical processors produce a VMEXIT every time an external interrupt is raised by the CPU physical APIC (Advanced Programmable Interrupt Controller). The hardware’s virtual machine extensions allow the hypervisor to inject virtual interrupts to the guest partitions (more details are in the Intel, AMD, and ARM user manuals). Thanks to these two facts, the hypervisor implements the concept of a Synthetic Interrupt Controller (SynIC). A SynIC can manage two kind of interrupts. Virtual interrupts are interrupts delivered to a guest partition’s virtual APIC. A virtual interrupt can represent and be associated with a physical hardware interrupt, which is generated by the real hardware. Otherwise, a virtual interrupt can represent a synthetic interrupt, which is generated by the hypervisor itself in response to certain kinds of events. The SynIC can map physical interrupts to virtual ones. A VTL has a SynIC associated with each virtual processor in which the VTL runs. At the time of this writing, the hypervisor has been designed to support 16 different synthetic interrupt vectors (only 2 are actually in use, though).

When the system starts (phase 1 of the NT kernel’s initialization) the ACPI driver maps each interrupt to the correct vector using services provided by the HAL. The NT HAL is enlightened and knows whether it’s running under VSM. In that case, it calls into the hypervisor for mapping each physical interrupt to its own VTL. Even the Secure Kernel could do the same. At the time of this writing, though, no physical interrupts are associated with the Secure Kernel (this can change in the future; the hypervisor already supports this feature). The Secure Kernel instead asks the hypervisor to receive only the following virtual interrupts: Secure Timers, Virtual Interrupt Notification Assist (VINA), and Secure Intercepts.

Image Note

It’s important to understand that the hypervisor requires the underlying hardware to produce a VMEXIT while managing interrupts that are only of external types. Exceptions are still managed in the same VTL the processor is executing at (no VMEXIT is generated). If an instruction causes an exception, the latter is still managed by the structured exception handling (SEH) code located in the current VTL.

To understand the three kinds of virtual interrupts, we must first introduce how interrupts are managed by the hypervisor.

In the hypervisor, each VTL has been designed to securely receive interrupts from devices associated with its own VTL, to have a secure timer facility which can’t be interfered with by less secure VTLs, and to be able to prevent interrupts directed to lower VTLs while executing code at a higher VTL. Furthermore, a VTL should be able to send IPI interrupts to other processors. This design produces the following scenarios:

  •     When running at a particular VTL, reception of interrupts targeted at the current VTL results in standard interrupt handling (as determined by the virtual APIC controller of the VP).

  •     When an interrupt is received that is targeted at a higher VTL, receipt of the interrupt results in a switch to the higher VTL to which the interrupt is targeted if the IRQL value for the higher VTL would allow the interrupt to be presented. If the IRQL value of the higher VTL does not allow the interrupt to be delivered, the interrupt is queued without switching the current VTL. This behavior allows a higher VTL to selectively mask interrupts when returning to a lower VTL. This could be useful if the higher VTL is running an interrupt service routine and needs to return to a lower VTL for assistance in processing the interrupt.

  •     When an interrupt is received that is targeted at a lower VTL than the current executing VTL of a virtual processor, the interrupt is queued for future delivery to the lower VTL. An interrupt targeted at a lower VTL will never preempt execution of the current VTL. Instead, the interrupt is presented when the virtual processor next transitions to the targeted VTL.

Preventing interrupts directed to lower VTLs is not always a great solution. In many cases, it could lead to the slowing down of the normal OS execution (especially in mission-critical or game environments). To better manage these conditions, the VINA has been introduced. As part of its normal event dispatch loop, the hypervisor checks whether there are pending interrupts queued to a lower VTL. If so, the hypervisor injects a VINA interrupt to the current executing VTL. The Secure Kernel has a handler registered for the VINA vector in its virtual IDT. The handler (ShvlVinaHandler function) executes a normal call (NORMALKERNEL_VINA) to VTL 0 (Normal and Secure Calls are discussed later in this chapter). This call forces the hypervisor to switch to the normal kernel (VTL 0). As long as the VTL is switched, all the queued interrupts will be correctly dispatched. The normal kernel will reenter VTL 1 by emitting a SECUREKERNEL_RESUMETHREAD Secure Call.

Secure IRQLs

The VINA handler will not always be executed in VTL 1. Similar to the NT kernel, this depends on the actual IRQL the code is executing into. The current executing code’s IRQL masks all the interrupts that are associated with an IRQL that’s less than or equal to it. The mapping between an interrupt vector and the IRQL is maintained by the Task Priority Register (TPR) of the virtual APIC, like in case of real physical APICs (consult the Intel Architecture Manual for more information). As shown in Figure 9-32, the Secure Kernel supports different levels of IRQL compared to the normal kernel. Those IRQL are called Secure IRQL.

Image

Figure 9-32 Secure Kernel interrupts request levels (IRQL).

The first three secure IRQL are managed by the Secure Kernel in a way similar to the normal world. Normal APCs and DPCs (targeting VTL 0) still can’t preempt code executing in VTL 1 through the hypervisor, but the VINA interrupt is still delivered to the Secure Kernel (the operating system manages the three software interrupts by writing in the target processor’s APIC Task-Priority Register, an operation that causes a VMEXIT to the hypervisor. For more information about the APIC TPR, see the Intel, AMD, or ARM manuals). This means that if a normal-mode DPC is targeted at a processor while it is executing VTL 1 code (at a compatible secure IRQL, which should be less than Dispatch), the VINA interrupt will be delivered and will switch the execution context to VTL 0. As a matter of fact, this executes the DPC in the normal world and raises for a while the normal kernel’s IRQL to dispatch level. When the DPC queue is drained, the normal kernel’s IRQL drops. Execution flow returns to the Secure Kernel thanks to the VSM communication loop code that is located in the VslpEnterIumSecureMode routine. The loop processes each normal call originated from the Secure Kernel.

The Secure Kernel maps the first three secure IRQLs to the same IRQL of the normal world. When a Secure call is made from code executing at a particular IRQL (still less or equal to dispatch) in the normal world, the Secure Kernel switches its own secure IRQL to the same level. Vice versa, when the Secure Kernel executes a normal call to enter the NT kernel, it switches the normal kernel’s IRQL to the same level as its own. This works only for the first three levels.

The normal raised level is used when the NT kernel enters the secure world at an IRQL higher than the DPC level. In those cases, the Secure Kernel maps all of the normal-world IRQLs, which are above DPC, to its normal raised secure level. Secure Kernel code executing at this level can’t receive any VINA for any kind of software IRQLs in the normal kernel (but it can still receive a VINA for hardware interrupts). Every time the NT kernel enters the secure world at a normal IRQL above DPC, the Secure Kernel raises its secure IRQL to normal raised.

Secure IRQLs equal to or higher than VINA can never be preempted by any code in the normal world. This explains why the Secure Kernel supports the concept of secure, nonpreemptable timers and Secure Intercepts. Secure timers are generated from the hypervisor’s clock interrupt service routine (ISR). This ISR, before injecting a synthetic clock interrupt to the NT kernel, checks whether there are one or more secure timers that are expired. If so, it injects a synthetic secure timer interrupt to VTL 1. Then it proceeds to forward the clock tick interrupt to the normal VTL.

Secure intercepts

There are cases where the Secure Kernel may need to prevent the NT kernel, which executes at a lower VTL, from accessing certain critical system resources. For example, writes to some processor’s MSRs could potentially be used to mount an attack that would disable the hypervisor or subvert some of its protections. VSM provides a mechanism to allow a higher VTL to lock down critical system resources and prevent access to them by lower VTLs. The mechanism is called secure intercepts.

Secure intercepts are implemented in the Secure Kernel by registering a synthetic interrupt, which is provided by the hypervisor (remapped in the Secure Kernel to vector 0xF0). The hypervisor, when certain events cause a VMEXIT, injects a synthetic interrupt to the higher VTL on the virtual processor that triggered the intercept. At the time of this writing, the Secure Kernel registers with the hypervisor for the following types of intercepted events:

  •     Write to some vital processor’s MSRs (Star, Lstar, Cstar, Efer, Sysenter, Ia32Misc, and APIC base on AMD64 architectures) and special registers (GDT, IDT, LDT)

  •     Write to certain control registers (CR0, CR4, and XCR0)

  •     Write to some I/O ports (ports 0xCF8 and 0xCFC are good examples; the intercept manages the reconfiguration of PCI devices)

  •     Invalid access to protected guest physical memory

When VTL 0 software causes an intercept that will be raised in VTL 1, the Secure Kernel needs to recognize the intercept type from its interrupt service routine. For this purpose, the Secure Kernel uses the message queue allocated by the SynIC for the “Intercept” synthetic interrupt source (see the “Inter-partition communication” section previously in this section for more details about the SynIC and SINT). The Secure Kernel is able to discover and map the physical memory page by checking the SIMP synthetic MSR, which is virtualized by the hypervisor. The mapping of the physical page is executed at the Secure Kernel initialization time in VTL 1. The Secure Kernel’s startup is described later in this chapter.

Intercepts are used extensively by HyperGuard with the goal to protect sensitive parts of the normal NT kernel. If a malicious rootkit installed in the NT kernel tries to modify the system by writing a particular value to a protected register (for example to the syscall handlers, CSTAR and LSTAR, or model-specific registers), the Secure Kernel intercept handler (ShvlpInterceptHandler) filters the new register’s value, and, if it discovers that the value is not acceptable, it injects a General Protection Fault (GPF) nonmaskable exception to the NT kernel in VLT 0. This causes an immediate bugcheck resulting in the system being stopped. If the value is acceptable, the Secure Kernel writes the new value of the register using the hypervisor through the HvSetVpRegisters hypercall (in this case, the Secure Kernel is proxying the access to the register).

Control over hypercalls

The last intercept type that the Secure Kernel registers with the hypervisor is the hypercall intercept. The hypercall intercept’s handler checks that the hypercall emitted by the VTL 0 code to the hypervisor is legit and is originated from the operating system itself, and not through some external modules. Every time in any VTL a hypercall is emitted, it causes a VMEXIT in the hypervisor (by design). Hypercalls are the base service used by kernel components of each VTL to request services between each other (and to the hypervisor itself). The hypervisor injects a synthetic intercept interrupt to the higher VTL only for hypercalls used to request services directly to the hypervisor, skipping all the hypercalls used for secure and normal calls to and from the Secure Kernel.

If the hypercall is not recognized as valid, it won’t be executed: the Secure Kernel in this case updates the lower VTL’s registers with the goal to signal the hypercall error. The system is not crashed (although this behavior can change in the future); the calling code can decide how to manage the error.

VSM system calls

As we have introduced in the previous sections, VSM uses hypercalls to request services to and from the Secure Kernel. Hypercalls were originally designed as a way to request services to the hypervisor, but in VSM the model has been extended to support new types of system calls:

  •     Secure calls are emitted by the normal NT kernel in VTL 0 to require services to the Secure Kernel.

  •     Normal calls are requested by the Secure Kernel in VTL 1 when it needs services provided by the NT kernel, which runs in VTL 0. Furthermore, some of them are used by secure processes (trustlets) running in Isolated User Mode (IUM) to request services from the Secure Kernel or the normal NT kernel.

These kinds of system calls are implemented in the hypervisor, the Secure Kernel, and the normal NT kernel. The hypervisor defines two hypercalls for switching between different VTLs: HvVtlCall and HvVtlReturn. The Secure Kernel and NT kernel define the dispatch loop used for dispatching Secure and Normal Calls.

Furthermore, the Secure Kernel implements another type of system call: secure system calls. They provide services only to secure processes (trustlets), which run in IUM. These system calls are not exposed to the normal NT kernel. The hypervisor is not involved at all while processing secure system calls.

Virtual processor state

Before delving into the Secure and Normal calls architecture, it is necessary to analyze how the virtual processor manages the VTL transition. Secure VTLs always operate in long mode (which is the execution model of AMD64 processors where the CPU accesses 64-bit-only instructions and registers), with paging enabled. Any other execution model is not supported. This simplifies launch and management of secure VTLs and also provides an extra level of protection for code running in secure mode. (Some other important implications are discussed later in the chapter.)

For efficiency, a virtual processor has some registers that are shared between VTLs and some other registers that are private to each VTL. The state of the shared registers does not change when switching between VTLs. This allows a quick passing of a small amount of information between VTLs, and it also reduces the context switch overhead when switching between VTLs. Each VTL has its own instance of private registers, which could only be accessed by that VTL. The hypervisor handles saving and restoring the contents of private registers when switching between VTLs. Thus, when entering a VTL on a virtual processor, the state of the private registers contains the same values as when the virtual processor last ran that VTL.

Most of a virtual processor’s register state is shared between VTLs. Specifically, general purpose registers, vector registers, and floating-point registers are shared between all VTLs with a few exceptions, such as the RIP and the RSP registers. Private registers include some control registers, some architectural registers, and hypervisor virtual MSRs. The secure intercept mechanism (see the previous section for details) is used to allow the Secure environment to control which MSR can be accessed by the normal mode environment. Table 9-3 summarizes which registers are shared between VTLs and which are private to each VTL.

Table 9-3 Virtual processor per-VTL register states

Type

General Registers

MSRs

Shared

Rax, Rbx, Rcx, Rdx, Rsi, Rdi, Rbp

CR2

R8 – R15

DR0 – DR5

X87 floating point state

XMM registers

AVX registers

XCR0 (XFEM)

DR6 (processor-dependent)

HV_X64_MSR_TSC_FREQUENCY

HV_X64_MSR_VP_INDEX

HV_X64_MSR_VP_RUNTIME

HV_X64_MSR_RESET

HV_X64_MSR_TIME_REF_COUNT

HV_X64_MSR_GUEST_IDLE

HV_X64_MSR_DEBUG_DEVICE_OPTIONS

HV_X64_MSR_BELOW_1MB_PAGE

HV_X64_MSR_STATS_PARTITION_RETAIL_PAGE

HV_X64_MSR_STATS_VP_RETAIL_PAGE

MTRR’s and PAT

MCG_CAP

MCG_STATUS

Private

RIP, RSP

RFLAGS

CR0, CR3, CR4

DR7

IDTR, GDTR

CS, DS, ES, FS, GS, SS, TR, LDTR

TSC

DR6 (processor-dependent)

SYSENTER_CS, SYSENTER_ESP, SYSENTER_EIP, STAR, LSTAR, CSTAR, SFMASK, EFER, KERNEL_GSBASE, FS.BASE, GS.BASE

HV_X64_MSR_HYPERCALL

HV_X64_MSR_GUEST_OS_ID

HV_X64_MSR_REFERENCE_TSC

HV_X64_MSR_APIC_FREQUENCY

HV_X64_MSR_EOI

HV_X64_MSR_ICR

HV_X64_MSR_TPR

HV_X64_MSR_APIC_ASSIST_PAGE

HV_X64_MSR_NPIEP_CONFIG

HV_X64_MSR_SIRBP

HV_X64_MSR_SCONTROL

HV_X64_MSR_SVERSION

HV_X64_MSR_SIEFP

HV_X64_MSR_SIMP

HV_X64_MSR_EOM

HV_X64_MSR_SINT0 – HV_X64_MSR_SINT15

HV_X64_MSR_STIMER0_CONFIG – HV_X64_MSR_STIMER3_CONFIG

HV_X64_MSR_STIMER0_COUNT -HV_X64_MSR_STIMER3_COUNT

Local APIC registers (including CR8/TPR)

Secure calls

When the NT kernel needs services provided by the Secure Kernel, it uses a special function, VslpEnterIumSecureMode. The routine accepts a 104-byte data structure (called SKCALL), which is used to describe the kind of operation (invoke service, flush TB, resume thread, or call enclave), the secure call number, and a maximum of twelve 8-byte parameters. The function raises the processor’s IRQL, if necessary, and determines the value of the Secure Thread cookie. This value communicates to the Secure Kernel which secure thread will process the request. It then (re)starts the secure calls dispatch loop. The executability state of each VTL is a state machine that depends on the other VTL.

The loop described by the VslpEnterIumSecureMode function manages all the operations shown on the left side of Figure 9-33 in VTL 0 (except the case of Secure Interrupts). The NT kernel can decide to enter the Secure Kernel, and the Secure Kernel can decide to enter the normal NT kernel. The loop starts by entering the Secure Kernel through the HvlSwitchToVsmVtl1 routine (specifying the operation requested by the caller). The latter function, which returns only if the Secure Kernel requests a VTL switch, saves all the shared registers and copies the entire SKCALL data structure in some well-defined CPU registers: RBX and the SSE registers XMM10 through XMM15. Finally, it emits an HvVtlCall hypercall to the hypervisor. The hypervisor switches to the target VTL (by loading the saved per-VTL VMCS) and writes a VTL secure call entry reason to the VTL control page. Indeed, to be able to determine why a secure VTL was entered, the hypervisor maintains an informational memory page that is shared by each secure VTL. This page is used for bidirectional communication between the hypervisor and the code running in a secure VTL on a virtual processor.

Image

Figure 9-33 The VSM dispatch loop.

The virtual processor restarts the execution in VTL 1 context, in the SkCallNormalMode function of the Secure Kernel. The code reads the VTL entry reason; if it’s not a Secure Interrupt, it loads the current processor SKPRCB (Secure Kernel processor control block), selects a thread on which to run (starting from the secure thread cookie), and copies the content of the SKCALL data structure from the CPU shared registers to a memory buffer. Finally, it calls the IumInvokeSecureService dispatcher routine, which will process the requested secure call, by dispatching the call to the correct function (and implements part of the dispatch loop in VTL 1).

An important concept to understand is that the Secure Kernel can map and access VTL 0 memory, so there’s no need to marshal and copy any eventual data structure, pointed by one or more parameters, to the VTL 1 memory. This concept won’t apply to a normal call, as we will discuss in the next section.

As we have seen in the previous section, Secure Interrupts (and intercepts) are dispatched by the hypervisor, which preempts any code executing in VTL 0. In this case, when the VTL 1 code starts the execution, it dispatches the interrupt to the right ISR. After the ISR finishes, the Secure Kernel immediately emits a HvVtlReturn hypercall. As a result, the code in VTL 0 restarts the execution at the point in which it has been previously interrupted, which is not located in the secure calls dispatch loop. Therefore, Secure Interrupts are not part of the dispatch loop even if they still produce a VTL switch.

Normal calls

Normal calls are managed similarly to the secure calls (with an analogous dispatch loop located in VTL 1, called normal calls loop), but with some important differences:

  •     All the shared VTL registers are securely cleaned up by the Secure Kernel before emitting the HvVtlReturn to the hypervisor for switching the VTL. This prevents leaking any kind of secure data to normal mode.

  •     The normal NT kernel can’t read secure VTL 1 memory. For correctly passing the syscall parameters and data structures needed for the normal call, a memory buffer that both the Secure Kernel and the normal kernel can share is required. The Secure Kernel allocates this shared buffer using the ALLOCATE_VM normal call (which does not require passing any pointer as a parameter). The latter is dispatched to the MmAllocateVirtualMemory function in the NT normal kernel. The allocated memory is remapped in the Secure Kernel at the same virtual address and has become part of the Secure process’s shared memory pool.

  •     As we will discuss later in the chapter, the Isolated User Mode (IUM) was originally designed to be able to execute special Win32 executables, which should have been capable of running indifferently in the normal world or in the secure world. The standard unmodified Ntdll.dll and KernelBase.dll libraries are mapped even in IUM. This fact has the important consequence of requiring almost all the native NT APIs (which Kernel32.dll and many other user mode libraries depend on) to be proxied by the Secure Kernel.

To correctly deal with the described problems, the Secure Kernel includes a marshaler, which identifies and correctly copies the data structures pointed by the parameters of an NT API in the shared buffer. The marshaler is also able to determine the size of the shared buffer, which will be allocated from the secure process memory pool. The Secure Kernel defines three types of normal calls:

  •     A disabled normal call is not implemented in the Secure Kernel and, if called from IUM, it simply fails with a STATUS_INVALID_SYSTEM_SERVICE exit code. This kind of call can’t be called directly by the Secure Kernel itself.

  •     An enabled normal call is implemented only in the NT kernel and is callable from IUM in its original Nt or Zw version (through Ntdll.dll). Even the Secure Kernel can request an enabled normal call—but only through a little stub code that loads the normal call number—set the highest bit in the number, and call the normal call dispatcher (IumGenericSyscall routine). The highest bit identifies the normal call as required by the Secure Kernel itself and not by the Ntdll.dll module loaded in IUM.

  •     A special normal call is implemented partially or completely in Secure Kernel (VTL 1), which can filter the original function’s results or entirely redesign its code.

Enabled and special normal calls can be marked as KernelOnly. In the latter case, the normal call can be requested only from the Secure Kernel itself (and not from secure processes). We’ve already provided the list of enabled and special normal calls (which are callable from software running in VSM) in Chapter 3 of Part 1, in the section named “Trustlet-accessible system calls.”

Figure 9-34 shows an example of a special normal call. In the example, the LsaIso trustlet has called the NtQueryInformationProcess native API to request information of a particular process. The Ntdll.dll mapped in IUM prepares the syscall number and executes a SYSCALL instruction, which transfers the execution flow to the KiSystemServiceStart global system call dispatcher, residing in the Secure Kernel (VTL 1). The global system call dispatcher recognizes that the system call number belongs to a normal call and uses the number to access the IumSyscallDispatchTable array, which represents the normal calls dispatch table.

Image

Figure 9-34 A trustlet performing a special normal call to the NtQueryInformationProcess API.

The normal calls dispatch table contains an array of compacted entries, which are generated in phase 0 of the Secure Kernel startup (discussed later in this chapter). Each entry contains an offset to a target function (calculated relative to the table itself) and the number of its arguments (with some flags). All the offsets in the table are initially calculated to point to the normal call dispatcher routine (IumGenericSyscall). After the first initialization cycle, the Secure Kernel startup routine patches each entry that represents a special call. The new offset is pointed to the part of code that implements the normal call in the Secure Kernel.

As a result, in Figure 9-34, the global system calls dispatcher transfers execution to the NtQueryInformationProcess function’s part implemented in the Secure Kernel. The latter checks whether the requested information class is one of the small subsets exposed to the Secure Kernel and, if so, uses a small stub code to call the normal call dispatcher routine (IumGenericSyscall).

Figure 9-35 shows the syscall selector number for the NtQueryInformationProcess API. Note that the stub sets the highest bit (N bit) of the syscall number to indicate that the normal call is requested by the Secure Kernel. The normal call dispatcher checks the parameters and calls the marshaler, which is able to marshal each argument and copy it in the right offset of the shared buffer. There is another bit in the selector that further differentiates between a normal call or a secure system call, which is discussed later in this chapter.

Image

Figure 9-35 The Syscall selector number of the Secure Kernel.

The marshaler works thanks to two important arrays that describe each normal call: the descriptors array (shown in the right side of Figure 9-34) and the arguments descriptors array. From these arrays, the marshaler can fetch all the information that it needs: normal call type, marshalling function index, argument type, size, and type of data pointed to (if the argument is a pointer).

After the shared buffer has been correctly filled by the marshaler, the Secure Kernel compiles the SKCALL data structure and enters the normal call dispatcher loop (SkCallNormalMode). This part of the loop saves and clears all the shared virtual CPU registers, disables interrupts, and moves the thread context to the PRCB thread (more about thread scheduling later in the chapter). It then copies the content of the SKCALL data structure in some shared register. As a final stage, it calls the hypervisor through the HvVtlReturn hypercall.

Then the code execution resumes in the secure call dispatch loop in VTL 0. If there are some pending interrupts in the queue, they are processed as normal (only if the IRQL allows it). The loop recognizes the normal call operation request and calls the NtQueryInformationProcess function implemented in VTL 0. After the latter function finished its processing, the loop restarts and reenters the Secure Kernel again (as for Secure Calls), still through the HvlSwitchToVsmVtl1 routine, but with a different operation request: Resume thread. This, as the name implies, allows the Secure Kernel to switch to the original secure thread and to continue the execution that has been preempted for executing the normal call.

The implementation of enabled normal calls is the same except for the fact that those calls have their entries in the normal calls dispatch table, which point directly to the normal call dispatcher routine, IumGenericSyscall. In this way, the code will transfer directly to the handler, skipping any API implementation code in the Secure Kernel.

Secure system calls

The last type of system calls available in the Secure Kernel is similar to standard system calls provided by the NT kernel to VTL 0 user mode software. The secure system calls are used for providing services only to the secure processes (trustlets). VTL 0 software can’t emit secure system calls in any way. As we will discuss in the “Isolated User Mode” section later in this chapter, every trustlet maps the IUM Native Layer Dll (Iumdll.dll) in its address space. Iumdll.dll has the same job as its counterpart in VTL 0, Ntdll.dll: implement the native syscall stub functions for user mode application. The stub copies the syscall number in a register and emits the SYSCALL instruction (the instruction uses different opcodes depending on the platform).

Secure system calls numbers always have the twenty-eighth bit set to 1 (on AMD64 architectures, whereas ARM64 uses the sixteenth bit). In this way, the global system call dispatcher (KiSystemServiceStart) recognizes that the syscall number belongs to a secure system call (and not a normal call) and switches to the SkiSecureServiceTable, which represents the secure system calls dispatch table. As in the case of normal calls, the global dispatcher verifies that the call number is in the limit, allocates stack space for the arguments (if needed), calculates the system call final address, and transfers the code execution to it.

Overall, the code execution remains in VTL 1, but the current privilege level of the virtual processor raises from 3 (user mode) to 0 (kernel mode). The dispatch table for secure system calls is compacted—similarly to the normal calls dispatch table—at phase 0 of the Secure Kernel startup. However, entries in this table are all valid and point to functions implemented in the Secure Kernel.

Secure threads and scheduling

As we will describe in the “Isolated User Mode” section, the execution units in VSM are the secure threads, which live in the address space described by a secure process. Secure threads can be kernel mode or user mode threads. VSM maintains a strict correspondence between each user mode secure thread and normal thread living in VTL 0.

Indeed, the Secure Kernel thread scheduling depends completely on the normal NT kernel; the Secure Kernel doesn’t include a proprietary scheduler (by design, the Secure Kernel attack surface needs to be small). In Chapter 3 of Part 1, we described how the NT kernel creates a process and the relative initial thread. In the section that describes Stage 4, “Creating the initial thread and its stack and context,” we explain that a thread creation is performed in two parts:

  •     The executive thread object is created; its kernel and user stack are allocated. The KeInitThread routine is called for setting up the initial thread context for user mode threads. KiStartUserThread is the first routine that will be executed in the context of the new thread, which will lower the thread’s IRQL and call PspUserThreadStartup.

  •     The execution control is then returned to NtCreateUserProcess, which, at a later stage, calls PspInsertThread to complete the initialization of the thread and insert it into the object manager namespace.

As a part of its work, when PspInsertThread detects that the thread belongs to a secure process, it calls VslCreateSecureThread, which, as the name implies, uses the Create Thread secure service call to ask to the Secure Kernel to create an associated secure thread. The Secure Kernel verifies the parameters and gets the process’s secure image data structure (more details about this later in this chapter). It then allocates the secure thread object and its TEB, creates the initial thread context (the first routine that will run is SkpUserThreadStartup), and finally makes the thread schedulable. Furthermore, the secure service handler in VTL 1, after marking the thread as ready to run, returns a specific thread cookie, which is stored in the ETHREAD data structure.

The new secure thread still starts in VTL 0. As described in the “Stage 7” section of Chapter 3 of Part 1, PspUserThreadStartup performs the final initialization of the user thread in the new context. In case it determines that the thread’s owning process is a trustlet, PspUserThreadStartup calls the VslStartSecureThread function, which invokes the secure calls dispatch loop through the VslpEnterIumSecureMode routine in VTL 0 (passing the secure thread cookie returned by the Create Thread secure service handler). The first operation that the dispatch loop requests to the Secure Kernel is to resume the execution of the secure thread (still through the HvVtlCall hypercall).

The Secure Kernel, before the switch to VTL 0, was executing code in the normal call dispatcher loop (SkCallNormalMode). The hypercall executed by the normal kernel restarts the execution in the same loop routine. The VTL 1 dispatcher loop recognizes the new thread resume request; it switches its execution context to the new secure thread, attaches to its address spaces, and makes it runnable. As part of the context switching, a new stack is selected (which has been previously initialized by the Create Thread secure call). The latter contains the address of the first secure thread system function, SkpUserThreadStartup, which, similarly to the case of normal NT threads, sets up the initial thunk context to run the image-loader initialization routine (LdrInitializeThunk in Ntdll.dll).

After it has started, the new secure thread can return to normal mode for two main reasons: it emits a normal call, which needs to be processed in VTL 0, or the VINA interrupts preempt the code execution. Even though the two cases are processed in a slightly different way, they both result in executing the normal call dispatcher loop (SkCallNormalMode).

As previously discussed in Part 1, Chapter 4, “Threads,” the NT scheduler works thanks to the processor clock, which generates an interrupt every time the system clock fires (usually every 15.6 milliseconds). The clock interrupt service routine updates the processor times and calculates whether the thread quantum expires. The interrupt is targeted to VTL 0, so, when the virtual processor is executing code in VTL 1, the hypervisor injects a VINA interrupt to the Secure Kernel, as shown in Figure 9-36. The VINA interrupt preempts the current executing code, lowers the IRQL to the previous preempted code’s IRQL value, and emits the VINA normal call for entering VTL 0.

Image

Figure 9-36 Secure threads scheduling scheme.

As the standard process of normal call dispatching, before the Secure Kernel emits the HvVtlReturn hypercall, it deselects the current execution thread from the virtual processor’s PRCB. This is important: The VP in VTL 1 is not tied to any thread context anymore and, on the next loop cycle, the Secure Kernel can switch to a different thread or decide to reschedule the execution of the current one.

After the VTL switch, the NT kernel resumes the execution in the secure calls dispatch loop and still in the context of the new thread. Before it has any chance to execute any code, the code is preempted by the clock interrupt service routine, which can calculate the new quantum value and, if the latter has expired, switch the execution of another thread. When a context switch occurs, and another thread enters VTL 1, the normal call dispatch loop schedules a different secure thread depending on the value of the secure thread cookie:

  •     A secure thread from the secure thread pool if the normal NT kernel has entered VTL 1 for dispatching a secure call (in this case, the secure thread cookie is 0).

  •     The newly created secure thread if the thread has been rescheduled for execution (the secure thread cookie is a valid value). As shown in Figure 9-36, the new thread can be also rescheduled by another virtual processor (VP 3 in the example).

With the described schema, all the scheduling decisions are performed only in VTL 0. The secure call loop and normal call loops cooperate to correctly switch the secure thread context in VTL 1. All the secure threads have an associated a thread in the normal kernel. The opposite is not true, though; if a normal thread in VTL 0 decides to emit a secure call, the Secure Kernel dispatches the request by using an arbitrary thread context from a thread pool.

The Hypervisor Enforced Code Integrity

Hypervisor Enforced Code Integrity (HVCI) is the feature that powers Device Guard and provides the W^X (pronounced double-you xor ex) characteristic of the VTL 0 kernel memory. The NT kernel can’t map and executes any kind of executable memory in kernel mode without the aid of the Secure Kernel. The Secure Kernel allows only proper digitally signed drivers to run in the machine’s kernel. As we discuss in the next section, the Secure Kernel keeps track of every virtual page allocated in the normal NT kernel; memory pages marked as executable in the NT kernel are considered privileged pages. Only the Secure Kernel can write to them after the SKCI module has correctly verified their content.

You can read more about HVCI in Chapter 7 of Part 1, in the “Device Guard” and “Credential Guard” sections.

UEFI runtime virtualization

Another service provided by the Secure Kernel (when HVCI is enabled) is the ability to virtualize and protect the UEFI runtime services. As we discuss in Chapter 12, the UEFI firmware services are mainly implemented by using a big table of function pointers. Part of the table will be deleted from memory after the OS takes control and calls the ExitBootServices function, but another part of the table, which represents the Runtime services, will remain mapped even after the OS has already taken full control of the machine. Indeed, this is necessary because sometimes the OS needs to interact with the UEFI configuration and services.

Every hardware vendor implements its own UEFI firmware. With HVCI, the firmware should cooperate to provide the nonwritable state of each of its executable memory pages (no firmware page can be mapped in VTL 0 with read, write, and execute state). The memory range in which the UEFI firmware resides is described by multiple MEMORY_DESCRIPTOR data structures located in the EFI memory map. The Windows Loader parses this data with the goal to properly protect the UEFI firmware’s memory. Unfortunately, in the original implementation of UEFI, the code and data were stored mixed in a single section (or multiple sections) and were described by relative memory descriptors. Furthermore, some device drivers read or write configuration data directly from the UEFI’s memory regions. This clearly was not compatible with HVCI.

For overcoming this problem, the Secure Kernel employs the following two strategies:

  •     New versions of the UEFI firmware (which adhere to UEFI 2.6 and higher specifications) maintain a new configuration table (linked in the boot services table), called memory attribute table (MAT). The MAT defines fine-grained sections of the UEFI Memory region, which are subsections of the memory descriptors defined by the EFI memory map. Each section never has both the executable and writable protection attribute.

  •     For old firmware, the Secure Kernel maps in VTL 0 the entire UEFI firmware region’s physical memory with a read-only access right.

In the first strategy, at boot time, the Windows Loader merges the information found both in the EFI memory map and in the MAT, creating an array of memory descriptors that precisely describe the entire firmware region. It then copies them in a reserved buffer located in VTL 1 (used in the hibernation path) and verifies that each firmware’s section doesn’t violate the W^X assumption. If so, when the Secure Kernel starts, it applies a proper SLAT protection for every page that belongs to the underlying UEFI firmware region. The physical pages are protected by the SLAT, but their virtual address space in VTL 0 is still entirely marked as RWX. Keeping the virtual memory’s RWX protection is important because the Secure Kernel must support resume-from-hibernation in a scenario where the protection applied in the MAT entries can change. Furthermore, this maintains the compatibility with older drivers, which read or write directly from the UEFI memory region, assuming that the write is performed in the correct sections. (Also, the UEFI code should be able to write in its own memory, which is mapped in VTL 0.) This strategy allows the Secure Kernel to avoid mapping any firmware code in VTL 1; the only part of the firmware that remains in VTL 1 is the Runtime function table itself. Keeping the table in VTL 1 allows the resume-from-hibernation code to update the UEFI runtime services’ function pointer directly.

The second strategy is not optimal and is used only for allowing old systems to run with HVCI enabled. When the Secure Kernel doesn’t find any MAT in the firmware, it has no choice except to map the entire UEFI runtime services code in VTL 1. Historically, multiple bugs have been discovered in the UEFI firmware code (in SMM especially). Mapping the firmware in VTL 1 could be dangerous, but it’s the only solution compatible with HVCI. (New systems, as stated before, never map any UEFI firmware code in VTL 1.) At startup time, the NT Hal detects that HVCI is on and that the firmware is entirely mapped in VTL 1. So, it switches its internal EFI service table’s pointer to a new table, called UEFI wrapper table. Entries of the wrapper table contain stub routines that use the INVOKE_EFI_RUNTIME_SERVICE secure call to enter in VTL 1. The Secure Kernel marshals the parameters, executes the firmware call, and yields the results to VTL 0. In this case, all the physical memory that describes the entire UEFI firmware is still mapped in read-only mode in VTL 0. The goal is to allow drivers to correctly read information from the UEFI firmware memory region (like ACPI tables, for example). Old drivers that directly write into UEFI memory regions are not compatible with HVCI in this scenario.

When the Secure Kernel resumes from hibernation, it updates the in-memory UEFI service table to point to the new services’ location. Furthermore, in systems that have the new UEFI firmware, the Secure Kernel reapplies the SLAT protection on each memory region mapped in VTL 0 (the Windows Loader is able to change the regions’ virtual addresses if needed).

VSM startup

Although we describe the entire Windows startup and shutdown mechanism in Chapter 12, this section describes the way in which the Secure Kernel and all the VSM infrastructure is started. The Secure Kernel is dependent on the hypervisor, the Windows Loader, and the NT kernel to properly start up. We discuss the Windows Loader, the hypervisor loader, and the preliminary phases by which the Secure Kernel is initialized in VTL 0 by these two modules in Chapter 12. In this section, we focus on the VSM startup method, which is implemented in the securekernel.exe binary.

The first code executed by the securekernel.exe binary is still running in VTL 0; the hypervisor already has been started, and the page tables used for VTL 1 have been built. The Secure Kernel initializes the following components in VTL 0:

  •     The memory manager’s initialization function stores the PFN of the VTL 0 root-level page-level structure, saves the code integrity data, and enables HVCI, MBEC (Mode-Based Execution Control), kernel CFG, and hot patching.

  •     Shared architecture-specific CPU components, like the GDT and IDT.

  •     Normal calls and secure system calls dispatch tables (initialization and compaction).

  •     The boot processor. The process of starting the boot processor requires the Secure Kernel to allocate its kernel and interrupt stacks; initialize the architecture-specific components, which can’t be shared between different processors (like the TSS); and finally allocate the processor’s SKPRCB. The latter is an important data structure, which, like the PRCB data structure in VTL 0, is used to store important information associated to each CPU.

The Secure Kernel initialization code is ready to enter VTL 1 for the first time. The hypervisor subsystem initialization function (ShvlInitSystem routine) connects to the hypervisor (through the hypervisor CPUID classes; see the previous section for more details) and checks the supported enlightenments. Then it saves the VTL 1’s page table (previously created by the Windows Loader) and the allocated hypercall pages (used for holding hypercall parameters). It finally initializes and enters VTL 1 in the following way:

  1. Enables VTL 1 for the current hypervisor partition through the HvEnablePartitionVtl hypercall. The hypervisor copies the existing SLAT table of normal VTL to VTL 1 and enables MBEC and the new VTL 1 for the partition.

  2. Enables VTL 1 for the boot processor through HvEnableVpVtl hypercall. The hypervisor initializes a new per-level VMCS data structure, compiles it, and sets the SLAT table.

  3. Asks the hypervisor for the location of the platform-dependent VtlCall and VtlReturn hypercall code. The CPU opcodes needed for performing VSM calls are hidden from the Secure Kernel implementation. This allows most of the Secure Kernel’s code to be platform-independent. Finally, the Secure Kernel executes the transition to VTL 1, through the HvVtlCall hypercall. The hypervisor loads the VMCS for the new VTL and switches to it (making it active). This basically renders the new VTL runnable.

The Secure Kernel starts a complex initialization procedure in VTL 1, which still depends on the Windows Loader and also on the NT kernel. It is worth noting that, at this stage, VTL 1 memory is still identity-mapped in VTL 0; the Secure Kernel and its dependent modules are still accessible to the normal world. After the switch to VTL 1, the Secure Kernel initializes the boot processor:

  1. Gets the virtual address of the Synthetic Interrupt controller shared page, TSC, and VP assist page, which are provided by the hypervisor for sharing data between the hypervisor and VTL 1 code. Maps in VTL 1 the Hypercall page.

  2. Blocks the possibility for other system virtual processors to be started by a lower VTL and requests the memory to be zero-filled on reboot to the hypervisor.

  3. Initializes and fills the boot processor Interrupt Descriptor Table (IDT). Configures the IPI, callbacks, and secure timer interrupt handlers and sets the current secure thread as the default SKPRCB thread.

  4. Starts the VTL 1 secure memory manager, which creates the boot table mapping and maps the boot loader’s memory in VTL 1, creates the secure PFN database and system hyperspace, initializes the secure memory pool support, and reads the VTL 0 loader block to copy the module descriptors of the Secure Kernel’s imported images (Skci.dll, Cnf.sys, and Vmsvcext.sys). It finally walks the NT loaded module list to establish each driver state, creating a NAR (normal address range) data structure for each one and compiling an Normal Table Entry (NTE) for every page composing the boot driver’s sections. Furthermore, the secure memory manager initialization function applies the correct VTL 0 SLAT protection to each driver’s sections.

  5. Initializes the HAL, the secure threads pool, the process subsystem, the synthetic APIC, Secure PNP, and Secure PCI.

  6. Applies a read-only VTL 0 SLAT protection for the Secure Kernel pages, configures MBEC, and enables the VINA virtual interrupt on the boot processor.

When this part of the initialization ends, the Secure Kernel unmaps the boot-loaded memory. The secure memory manager, as we discuss in the next section, depends on the VTL 0 memory manager for being able to allocate and free VTL 1 memory. VTL 1 does not own any physical memory; at this stage, it relies on some previously allocated (by the Windows Loader) physical pages for being able to satisfy memory allocation requests. When the NT kernel later starts, the Secure Kernel performs normal calls for requesting memory services to the VTL 0 memory manager. As a result, some parts of the Secure Kernel initialization must be deferred after the NT kernel is started. Execution flow returns to the Windows Loader in VTL 0. The latter loads and starts the NT kernel. The last part of the Secure Kernel initialization happens in phase 0 and phase 1 of the NT kernel initialization (see Chapter 12 for further details).

Phase 0 of the NT kernel initialization still has no memory services available, but this is the last moment in which the Secure Kernel fully trusts the normal world. Boot-loaded drivers still have not been initialized and the initial boot process should have been already protected by Secure Boot. The PHASE3_INIT secure call handler modifies the SLAT protections of all the physical pages belonging to Secure Kernel and to its depended modules, rendering them inaccessible to VTL 0. Furthermore, it applies a read-only protection to the kernel CFG bitmaps. At this stage, the Secure Kernel enables the support for pagefile integrity, creates the initial system process and its address space, and saves all the “trusted” values of the shared CPU registers (like IDT, GDT, Syscall MSR, and so on). The data structures that the shared registers point to are verified (thanks to the NTE database). Finally, the secure thread pool is started and the object manager, the secure code integrity module (Skci.dll), and HyperGuard are initialized (more details on HyperGuard are available in Chapter 7 of Part 1).

When the execution flow is returned to VTL 0, the NT kernel can then start all the other application processors (APs). When the Secure Kernel is enabled, the AP’s initialization happens in a slightly different way (we discuss AP initialization in the next section).

As part of the phase 1 of the NT kernel initialization, the system starts the I/O manager. The I/O manager, as discussed in Part 1, Chapter 6, “I/O system,” is the core of the I/O system and defines the model within which I/O requests are delivered to device drivers. One of the duties of the I/O manager is to initialize and start the boot-loaded and ELAM drivers. Before creating the special sections for mapping the user mode system DLLs, the I/O manager initialization function emits a PHASE4_INIT secure call to start the last initialization phase of the Secure Kernel. At this stage, the Secure Kernel does not trust the VTL 0 anymore, but it can use the services provided by the NT memory manager. The Secure Kernel initializes the content of the Secure User Shared data page (which is mapped both in VTL 1 user mode and kernel mode) and finalizes the executive subsystem initialization. It reclaims any resources that were reserved during the boot process, calls each of its own dependent module entry points (in particular, cng.sys and vmsvcext.sys, which start before any normal boot drivers). It allocates the necessary resources for the encryption of the hibernation, crash-dump, paging files, and memory-page integrity. It finally reads and maps the API set schema file in VTL 1 memory. At this stage, VSM is completely initialized.

Application processors (APs) startup

One of the security features provided by the Secure Kernel is the startup of the application processors (APs), which are the ones not used to boot up the system. When the system starts, the Intel and AMD specifications of the x86 and AMD64 architectures define a precise algorithm that chooses the boot strap processor (BSP) in multiprocessor systems. The boot processor always starts in 16-bit real mode (where it’s able to access only 1 MB of physical memory) and usually executes the machine’s firmware code (UEFI in most cases), which needs to be located at a specific physical memory location (the location is called reset vector). The boot processor executes almost all of the initialization of the OS, hypervisor, and Secure Kernel. For starting other non-boot processors, the system needs to send a special IPI (inter-processor interrupt) to the local APICs belonging to each processor. The startup IPI (SIPI) vector contains the physical memory address of the processor start block, a block of code that includes the instructions for performing the following basic operations:

  1. Load a GDT and switch from 16-bit real-mode to 32-bit protected mode (with no paging enabled).

  2. Set a basic page table, enable paging, and enter 64-bit long mode.

  3. Load the 64-bit IDT and GDT, set the proper processor registers, and jump to the OS startup function (KiSystemStartup).

This process is vulnerable to malicious attacks. The processor startup code could be modified by external entities while it is executing on the AP processor (the NT kernel has no control at this point). In this case, all the security promises brought by VSM could be easily fooled. When the hypervisor and the Secure Kernel are enabled, the application processors are still started by the NT kernel but using the hypervisor.

KeStartAllProcessors, which is the function called by phase 1 of the NT kernel initialization (see Chapter 12 for more details), with the goal of starting all the APs, builds a shared IDT and enumerates all the available processors by consulting the Multiple APIC Description Table (MADT) ACPI table. For each discovered processor, it allocates memory for the PRCB and all the private CPU data structures for the kernel and DPC stack. If the VSM is enabled, it then starts the AP by sending a START_PROCESSOR secure call to the Secure Kernel. The latter validates that all the data structures allocated and filled for the new processor are valid, including the initial values of the processor registers and the startup routine (KiSystemStartup) and ensures that the APs startups happen sequentially and only once per processor. It then initializes the VTL 1 data structures needed for the new application processor (the SKPRCB in particular). The PRCB thread, which is used for dispatching the Secure Calls in the context of the new processor, is started, and the VTL 0 CPU data structures are protected by using the SLAT. The Secure Kernel finally enables VTL 1 for the new application processor and starts it by using the HvStartVirtualProcessor hypercall. The hypervisor starts the AP in a similar way described in the beginning of this section (by sending the startup IPI). In this case, however, the AP starts its execution in the hypervisor context, switches to 64-bit long mode execution, and returns to VTL 1.

The first function executed by the application processor resides in VTL 1. The Secure Kernel’s CPU initialization routine maps the per-processor VP assist page and SynIC control page, configures MBEC, and enables the VINA. It then returns to VTL 0 through the HvVtlReturn hypercall. The first routine executed in VTL 0 is KiSystemStartup, which initializes the data structures needed by the NT kernel to manage the AP, initializes the HAL, and jumps to the idle loop (read more details in Chapter 12). The Secure Call dispatch loop is initialized later by the normal NT kernel when the first secure call is executed.

An attacker in this case can’t modify the processor startup block or any initial value of the CPU’s registers and data structures. With the described secure AP start-up, any modification would have been detected by the Secure Kernel and the system bug checked to defeat any attack effort.

The Secure Kernel memory manager

The Secure Kernel memory manager heavily depends on the NT memory manager (and on the Windows Loader memory manager for its startup code). Entirely describing the Secure Kernel memory manager is outside the scope of this book. Here we discuss only the most important concepts and data structures used by the Secure Kernel.

As mentioned in the previous section, the Secure Kernel memory manager initialization is divided into three phases. In phase 1, the most important, the memory manager performs the following:

  1. Maps the boot loader firmware memory descriptor list in VTL 1, scans the list, and determines the first physical page that it can use for allocating the memory needed for its initial startup (this memory type is called SLAB). Maps the VTL 0’s page tables in a virtual address that is located exactly 512 GB before the VTL 1’s page table. This allows the Secure Kernel to perform a fast conversion between an NT virtual address and one from the Secure Kernel.

  2. Initializes the PTE range data structures. A PTE range contains a bitmap that describes each chunk of allocated virtual address range and helps the Secure Kernel to allocate PTEs for its own address space.

  3. Creates the Secure PFN database and initializes the Memory pool.

  4. Initializes the sparse NT address table. For each boot-loaded driver, it creates and fills a NAR, verifies the integrity of the binary, fills the hot patch information, and, if HVCI is on, protects each executable section of driver using the SLAT. It then cycles between each PTE of the memory image and writes an NT Address Table Entry (NTE) in the NT address table.

  5. Initializes the page bundles.

The Secure Kernel keeps track of the memory that the normal NT kernel uses. The Secure Kernel memory manager uses the NAR data structure for describing a kernel virtual address range that contains executable code. The NAR contains some information of the range (such as its base address and size) and a pointer to a SECURE_IMAGE data structure, which is used for describing runtime drivers (in general, images verified using Secure HVCI, including user mode images used for trustlets) loaded in VTL 0. Boot-loaded drivers do not use the SECURE_IMAGE data structure because they are treated by the NT memory manager as private pages that contain executable code. The latter data structure contains information regarding a loaded image in the NT kernel (which is verified by SKCI), like the address of its entry point, a copy of its relocation tables (used also for dealing with Retpoline and Import Optimization), the pointer to its shared prototype PTEs, hot-patch information, and a data structure that specifies the authorized use of its memory pages. The SECURE_IMAGE data structure is very important because it’s used by the Secure Kernel to track and verify the shared memory pages that are used by runtime drivers.

For tracking VTL 0 kernel private pages, the Secure Kernel uses the NTE data structure. An NTE exists for every virtual page in the VTL 0 address space that requires supervision from the Secure Kernel; it’s often used for private pages. An NTE tracks a VTL 0 virtual page’s PTE and stores the page state and protection. When HVCI is enabled, the NTE table divides all the virtual pages between privileged and non-privileged. A privileged page represents a memory page that the NT kernel is not able to touch on its own because it’s protected through SLAT and usually corresponds to an executable page or to a kernel CFG read-only page. A nonprivileged page represents all the other types of memory pages that the NT kernel has full control over. The Secure Kernel uses invalid NTEs to represent nonprivileged pages. When HVCI is off, all the private pages are nonprivileged (the NT kernel has full control of all its pages indeed).

In HVCI-enabled systems, the NT memory manager can’t modify any protected pages. Otherwise, an EPT violation exception will raise in the hypervisor, resulting in a system crash. After those systems complete their boot phase, the Secure Kernel has already processed all the nonexecutable physical pages by SLAT-protecting them only for read and write access. In this scenario, new executable pages can be allocated only if the target code has been verified by Secure HVCI.

When the system, an application, or the Plug and Play manager require the loading of a new runtime driver, a complex procedure starts that involves the NT and the Secure Kernel memory manager, summarized here:

  1. The NT memory manager creates a section object, allocates and fills a new Control area (more details about the NT memory manager are available in Chapter 5 of Part 1), reads the first page of the binary, and calls the Secure Kernel with the goal to create the relative secure image, which describe the new loaded module.

  2. The Secure Kernel creates the SECURE_IMAGE data structure, parses all the sections of the binary file, and fills the secure prototype PTEs array.

  3. The NT kernel reads the entire binary in nonexecutable shared memory (pointed by the prototype PTEs of the control area). Calls the Secure Kernel, which, using Secure HVCI, cycles between each section of the binary image and calculates the final image hash.

  4. If the calculated file hash matches the one stored in the digital signature, the NT memory walks the entire image and for each page calls the Secure Kernel, which validates the page (each page hash has been already calculated in the previous phase), applies the needed relocations (ASLR, Retpoline, and Import Optimization), and applies the new SLAT protection, allowing the page to be executable but not writable anymore.

  5. The Section object has been created. The NT memory manager needs to map the driver in its address space. It calls the Secure Kernel for allocating the needed privileged PTEs for describing the driver’s virtual address range. The Secure Kernel creates the NAR data structure. It then maps the physical pages of the driver, which have been previously verified, using the MiMapSystemImage routine.

Image Note

When a NARs is initialized for a runtime driver, part of the NTE table is filled for describing the new driver address space. The NTEs are not used for keeping track of a runtime driver’s virtual address range (its virtual pages are shared and not private), so the relative part of the NT address table is filled with invalid “reserved” NTEs.

While VTL 0 kernel virtual address ranges are represented using the NAR data structure, the Secure Kernel uses secure VADs (virtual address descriptors) to track user-mode virtual addresses in VTL 1. Secure VADs are created every time a new private virtual allocation is made, a binary image is mapped in the address space of a trustlet (secure process), and when a VBS-enclave is created or a module is mapped into its address space. A secure VAD is similar to the NT kernel VAD and contains a descriptor of the VA range, a reference counter, some flags, and a pointer to the Secure section, which has been created by SKCI. (The secure section pointer is set to 0 in case of secure VADs describing private virtual allocations.) More details about Trustlets and VBS-based enclaves will be discussed later in this chapter.

Page identity and the secure PFN database

After a driver is loaded and mapped correctly into VTL 0 memory, the NT memory manager needs to be able to manage its memory pages (for various reasons, like the paging out of a pageable driver’s section, the creation of private pages, the application of private fixups, and so on; see Chapter 5 in Part 1 for more details). Every time the NT memory manager operates on protected memory, it needs the cooperation of the Secure Kernel. Two main kinds of secure services are offered to the NT memory manager for operating with privileged memory: protected pages copy and protected pages removal.

A PAGE_IDENTITY data structure is the glue that allows the Secure Kernel to keep track of all the different kinds of pages. The data structure is composed of two fields: an Address Context and a Virtual Address. Every time the NT kernel calls the Secure Kernel for operating on privileged pages, it needs to specify the physical page number along with a valid PAGE_IDENTITY data structure describing what the physical page is used for. Through this data structure, the Secure Kernel can verify the requested page usage and decide whether to allow the request.

Table 9-4 shows the PAGE_IDENTITY data structure (second and third columns), and all the types of verification performed by the Secure Kernel on different memory pages:

  •     If the Secure Kernel receives a request to copy or to release a shared executable page of a runtime driver, it validates the secure image handle (specified by the caller) and gets its relative data structure (SECURE_IMAGE). It then uses the relative virtual address (RVA) as an index into the secure prototype array to obtain the physical page frame (PFN) of the driver’s shared page. If the found PFN is equal to the caller’s specified one, the Secure Kernel allows the request; otherwise it blocks it.

  •     In a similar way, if the NT kernel requests to operate on a trustlet or an enclave page (more details about trustlets and secure enclaves are provided later in this chapter), the Secure Kernel uses the caller’s specified virtual address to verify that the secure PTE in the secure process page table contains the correct PFN.

  •     As introduced earlier in the section ”The Secure Kernel memory manager” , for private kernel pages, the Secure Kernel locates the NTE starting from the caller’s specified virtual address and verifies that it contains a valid PFN, which must be the same as the caller’s specified one.

  •     Placeholder pages are free pages that are SLAT protected. The Secure Kernel verifies the state of a placeholder page by using the PFN database.

Table 9-4 Different page identities managed by the Secure Kernel

Page Type

Address Context

Virtual Address

Verification Structure

Kernel Shared

Secure Image Handle

RVA of the page

Secure Prototype PTE

Trustlet/Enclave

Secure Process Handle

Virtual Address of the Secure Process

Secure PTE

Kernel Private

0

Kernel Virtual Address of the page

NT address table entry (NTE)

Placeholder

0

0

PFN entry

The Secure Kernel memory manager maintains a PFN database to represent the state of each physical page. A PFN entry in the Secure Kernel is much smaller compared to its NT equivalent; it basically contains the page state and the share counter. A physical page, from the Secure Kernel perspective, can be in one of the following states: invalid, free, shared, I/O, secured, or image (secured NT private).

The secured state is used for physical pages that are private to the Secure Kernel (the NT kernel can never claim them) or for physical pages that have been allocated by the NT kernel and later SLAT-protected by the Secure Kernel for storing executable code verified by Secure HVCI. Only secured nonprivate physical pages have a page identity.

When the NT kernel is going to page out a protected page, it asks the Secure Kernel for a page removal operation. The Secure Kernel analyzes the specified page identity and does its verification (as explained earlier). In case the page identity refers to an enclave or a trustlet page, the Secure Kernel encrypts the page’s content before releasing it to the NT kernel, which will then store the page in the paging file. In this way, the NT kernel still has no chance to intercept the real content of the private memory.

Secure memory allocation

As discussed in previous sections, when the Secure Kernel initially starts, it parses the firmware’s memory descriptor lists, with the goal of being able to allocate physical memory for its own use. In phase 1 of its initialization, the Secure Kernel can’t use the memory services provided by the NT kernel (the NT kernel indeed is still not initialized), so it uses free entries of the firmware’s memory descriptor lists for reserving 2-MB SLABs. A SLAB is a 2-MB contiguous physical memory, which is mapped by a single nested page table directory entry in the hypervisor. All the SLAB pages have the same SLAT protection. SLABs have been designed for performance considerations. By mapping a 2-MB chunk of physical memory using a single nested page entry in the hypervisor, the additional hardware memory address translation is faster and results in less cache misses on the SLAT table.

The first Secure Kernel page bundle is filled with 1 MB of the allocated SLAB memory. A page bundle is the data structure shown in Figure 9-37, which contains a list of contiguous free physical page frame numbers (PFNs). When the Secure Kernel needs memory for its own purposes, it allocates physical pages from a page bundle by removing one or more free page frames from the tail of the bundle’s PFNs array. In this case, the Secure Kernel doesn’t need to check the firmware memory descriptors list until the bundle has been entirely consumed. When the phase 3 of the Secure Kernel initialization is done, memory services of the NT kernel become available, and so the Secure Kernel frees any boot memory descriptor lists, retaining physical memory pages previously located in bundles.

Image

Figure 9-37 A secure page bundle with 80 available pages. A bundle is composed of a header and a free PFNs array.

Future secure memory allocations use normal calls provided by the NT kernel. Page bundles have been designed to minimize the number of normal calls needed for memory allocation. When a bundle gets fully allocated, it contains no pages (all its pages are currently assigned), and a new one will be generated by asking the NT kernel for 1 MB of contiguous physical pages (through the ALLOC_PHYSICAL_PAGES normal call). The physical memory will be allocated by the NT kernel from the proper SLAB.

In the same way, every time the Secure Kernel frees some of its private memory, it stores the corresponding physical pages in the correct bundle by growing its PFN array until the limit of 256 free pages. When the array is entirely filled, and the bundle becomes free, a new work item is queued. The work item will zero-out all the pages and will emit a FREE_PHYSICAL_PAGES normal call, which ends up in executing the MmFreePagesFromMdl function of the NT memory manager.

Every time enough pages are moved into and out of a bundle, they are fully protected in VTL 0 by using the SLAT (this procedure is called “securing the bundle”). The Secure Kernel supports three kinds of bundles, which all allocate memory from different SLABs: No access, Read-only, and Read-Execute.

Hot patching

Several years ago, the 32-bit versions of Windows were supporting the hot patch of the operating-system’s components. Patchable functions contained a redundant 2-byte opcode in their prolog and some padding bytes located before the function itself. This allowed the NT kernel to dynamically replace the initial opcode with an indirect jump, which uses the free space provided by the padding, to divert the code to a patched function residing in a different module. The feature was heavily used by Windows Update, which allowed the system to be updated without the need for an immediate reboot of the machine. When moving to 64-bit architectures, this was no longer possible due to various problems. Kernel patch protection was a good example; there was no longer a reliable way to modify a protected kernel mode binary and to allow PatchGuard to be updated without exposing some of its private interfaces, and exposed PatchGuard interfaces could have been easily exploited by an attacker with the goal to defeat the protection.

The Secure Kernel has solved all the problems related to 64-bit architectures and has reintroduced to the OS the ability of hot patching kernel binaries. While the Secure Kernel is enabled, the following types of executable images can be hot patched:

  •     VTL 0 user-mode modules (both executables and libraries)

  •     Kernel mode drivers, HAL, and the NT kernel binary, protected or not by PatchGuard

  •     The Secure Kernel binary and its dependent modules, which run in VTL 1 Kernel mode

  •     The hypervisor (Intel, AMD, and the ARM version).

Patch binaries created for targeting software running in VTL 0 are called normal patches, whereas the others are called secure patches. If the Secure Kernel is not enabled, only user mode applications can be patched.

A hot patch image is a standard Portable Executable (PE) binary that includes the hot patch table, the data structure used for tracking the patch functions. The hot patch table is linked in the binary through the image load configuration data directory. It contains one or more descriptors that describe each patchable base image, which is identified by its checksum and time date stamp. (In this way, a hot patch is compatible only with the correct base images. The system can’t apply a patch to the wrong image.) The hot patch table also includes a list of functions or global data chunks that needs to be updated in the base or in the patch image; we describe the patch engine shortly. Every entry in this list contains the functions’ offsets in the base and patch images and the original bytes of the base function that will be replaced.

Multiple patches can be applied to a base image, but the patch application is idempotent. The same patch may be applied multiple times, or different patches may be applied in sequence. Regardless, the last applied patch will be the active patch for the base image. When the system needs to apply a hot patch, it uses the NtManageHotPatch system call, which is employed to install, remove, or manage hot patches. (The system call supports different “patch information” classes for describing all the possible operations.) A hot patch can be installed globally for the entire system, or, if a patch is for user mode code (VTL 0), for all the processes that belong to a specific user session.

When the system requests the application of a patch, the NT kernel locates the hot patch table in the patch binary and validates it. It then uses the DETERMINE_HOT_PATCH_TYPE secure call to securely determine the type of patch. In the case of a secure patch, only the Secure Kernel can apply it, so the APPLY_HOT_PATCH secure call is used; no other processing by the NT kernel is needed. In all the other cases, the NT kernel first tries to apply the patch to a kernel driver. It cycles between each loaded kernel module, searching for a base image that has the same checksum described by one of the patch image’s hot patch descriptors.

Hot patching is enabled only if the HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\HotPatchTableSize registry value is a multiple of a standard memory page size (4,096). Indeed, when hot patching is enabled, every image that is mapped in the virtual address space needs to have a certain amount of virtual address space reserved immediately after the image itself. This reserved space is used for the image’s hot patch address table (HPAT, not to be confused with the hot patch table). The HPAT is used to minimize the amount of padding necessary for each function to be patched by storing the address of the new function in the patched image. When patching a function, the HPAT location will be used to perform an indirect jump from the original function in the base image to the patched function in the patch image (note that for Retpoline compatibility, another kind of Retpoline routine is used instead of an indirect jump).

When the NT kernel finds a kernel mode driver suitable for the patch, it loads and maps the patch binary in the kernel address space and creates the related loader data table entry (for more details, see Chapter 12). It then scans each memory page of both the base and the patch images and locks in memory the ones involved in the hot patch (this is important; in this way, the pages can’t be paged out to disk while the patch application is in progress). It finally emits the APPLY_HOT_PATCH secure call.

The real patch application process starts in the Secure Kernel. The latter captures and verifies the hot patch table of the patch image (by remapping the patch image also in VTL 1) and locates the base image’s NAR (see the previous section, “The Secure Kernel memory manager” for more details about the NARs), which also tells the Secure Kernel whether the image is protected by PatchGuard. The Secure Kernel then verifies whether enough reserved space is available in the image HPAT. If so, it allocates one or more free physical pages (getting them from the secure bundle or using the ALLOC_PHYSICAL_PAGES normal call) that will be mapped in the reserved space. At this point, if the base image is protected, the Secure Kernel starts a complex process that updates the PatchGuard’s internal state for the new patched image and finally calls the patch engine.

The kernel’s patch engine performs the following high-level operations, which are all described by a different entry type in the hot patch table:

  1. Patches all calls from patched functions in the patch image with the goal to jump to the corresponding functions in the base image. This ensures that all unpatched code always executes in the original base image. For example, if function A calls B in the base image and the patch changes function A but not function B, then the patch engine will update function B in the patch to jump to function B in the base image.

  2. Patches the necessary references to global variables in patched functions to point to the corresponding global variables in the base image.

  3. Patches the necessary import address table (IAT) references in the patch image by copying the corresponding IAT entries from the base image.

  4. Atomically patches the necessary functions in the base image to jump to the corresponding function in the patch image. As soon as this is done for a given function in the base image, all new invocations of that function will execute the new patched function code in the patch image. When the patched function returns, it will return to the caller of the original function in the base image.

Since the pointers of the new functions are 64 bits (8 bytes) wide, the patch engine inserts each pointer in the HPAT, which is located at the end of the binary. In this way, it needs only 5 bytes for placing the indirect jump in the padding space located in the beginning of each function (the process has been simplified. Retpoline compatible hot-patches requires a compatible Retpoline. Furthermore, the HPAT is split in code and data page).

As shown in Figure 9-38, the patch engine is compatible with different kinds of binaries. If the NT kernel has not found any patchable kernel mode module, it restarts the search through all the user mode processes and applies a procedure similar to properly hot patching a compatible user mode executable or library.

Image

Figure 9-38 A schema of the hot patch engine executing on different types of binaries.

Isolated User Mode

Isolated User Mode (IUM), the services provided by the Secure Kernel to its secure processes (trustlets), and the trustlets general architecture are covered in Chapter 3 of Part 1. In this section, we continue the discussion starting from there, and we move on to describe some services provided by the Isolated User Mode, like the secure devices and the VBS enclaves.

As introduced in Chapter 3 of Part 1, when a trustlet is created in VTL 1, it usually maps in its address space the following libraries:

  •     Iumdll.dll The IUM Native Layer DLL implements the secure system call stub. It’s the equivalent of Ntdll.dll of VTL 0.

  •     Iumbase.dll The IUM Base Layer DLL is the library that implements most of the secure APIs that can be consumed exclusively by VTL 1 software. It provides various services to each secure process, like secure identification, communication, cryptography, and secure memory management. Trustlets do not usually call secure system calls directly, but they pass through Iumbase.dll, which is the equivalent of kernelbase.dll in VTL 0.

  •     IumCrypt.dll Exposes public/private key encryption functions used for signing and integrity verification. Most of the crypto functions exposed to VTL 1 are implemented in Iumbase.dll; only a small number of specialized encryption routines are implemented in IumCrypt. LsaIso is the main consumer of the services exposed by IumCrypt, which is not loaded in many other trustlets.

  •     Ntdll.dll, Kernelbase.dll, and Kernel32.dll A trustlet can be designed to run both in VTL 1 and VTL 0. In that case, it should only use routines implemented in the standard VTL 0 API surface. Not all the services available to VTL 0 are also implemented in VTL 1. For example, a trustlet can never do any registry I/O and any file I/O, but it can use synchronization routines, ALPC, thread APIs, and structured exception handling, and it can manage virtual memory and section objects. Almost all the services offered by the kernelbase and kernel32 libraries perform system calls through Ntdll.dll. In VTL 1, these kinds of system calls are “translated” in normal calls and redirected to the VTL 0 kernel. (We discussed normal calls in detail earlier in this chapter.) Normal calls are often used by IUM functions and by the Secure Kernel itself. This explains why ntdll.dll is always mapped in every trustlet.

  •     Vertdll.dll The VSM enclave runtime DLL is the DLL that manages the lifetime of a VBS enclave. Only limited services are provided by software executing in a secure enclave. This library implements all the enclave services exposed to the software enclave and is normally not loaded for standard VTL 1 processes.

With this knowledge in mind, let’s look at what is involved in the trustlet creation process, starting from the CreateProcess API in VTL 0, for which its execution flow has already been described in detail in Chapter 3.

Trustlets creation

As discussed multiple times in the previous sections, the Secure Kernel depends on the NT kernel for performing various operations. Creating a trustlet follows the same rule: It is an operation that is managed by both the Secure Kernel and NT kernel. In Chapter 3 of Part 1, we presented the trustlet structure and its signing requirement, and we described its important policy metadata. Furthermore, we described the detailed flow of the CreateProcess API, which is still the starting point for the trustlet creation.

To properly create a trustlet, an application should specify the CREATE_SECURE_PROCESS creation flag when calling the CreateProcess API. Internally, the flag is converted to the PS_CP_SECURE_ PROCESS NT attribute and passed to the NtCreateUserProcess native API. After the NtCreateUserProcess has successfully opened the image to be executed, it creates the section object of the image by specifying a special flag, which instructs the memory manager to use the Secure HVCI to validate its content. This allows the Secure Kernel to create the SECURE_IMAGE data structure used to describe the PE image verified through Secure HVCI.

The NT kernel creates the required process’s data structures and initial VTL 0 address space (page directories, hyperspace, and working set) as for normal processes, and if the new process is a trustlet, it emits a CREATE_PROCESS secure call. The Secure Kernel manages the latter by creating the secure process object and relative data structure (named SEPROCESS). The Secure Kernel links the normal process object (EPROCESS) with the new secure one and creates the initial secure address space by allocating the secure page table and duplicating the root entries that describe the kernel portion of the secure address space in the upper half of it.

The NT kernel concludes the setup of the empty process address space and maps the Ntdll library into it (see Stage 3D of Chapter 3 of Part 1 for more details). When doing so for secure processes, the NT kernel invokes the INITIALIZE_PROCESS secure call to finish the setup in VTL 1. The Secure Kernel copies the trustlet identity and trustlet attributes specified at process creation time into the new secure process, creates the secure handle table, and maps the secure shared page into the address space.

The last step needed for the secure process is the creation of the secure thread. The initial thread object is created as for normal processes in the NT kernel: When the NtCreateUserProcess calls PspInsertThread, it has already allocated the thread kernel stack and inserted the necessary data to start from the KiStartUserThread kernel function (see Stage 4 in Chapter 3 of Part 1 for further details). If the process is a trustlet, the NT kernel emits a CREATE_THREAD secure call for performing the final secure thread creation. The Secure Kernel attaches to the new secure process’s address space and allocates and initializes a secure thread data structure, a thread’s secure TEB, and kernel stack. The Secure Kernel fills the thread’s kernel stack by inserting the thread-first initial kernel routine: SkpUserThreadStart. It then initializes the machine-dependent hardware context for the secure thread, which specifies the actual image start address and the address of the first user mode routine. Finally, it associates the normal thread object with the new created secure one, inserts the thread into the secure threads list, and marks the thread as runnable.

When the normal thread object is selected to run by the NT kernel scheduler, the execution still starts in the KiStartUserThread function in VTL 0. The latter lowers the thread’s IRQL and calls the system initial thread routine (PspUserThreadStartup). The execution proceeds as for normal threads, until the NT kernel sets up the initial thunk context. Instead of doing that, it starts the Secure Kernel dispatch loop by calling the VslpEnterIumSecureMode routine and specifying the RESUMETHREAD secure call. The loop will exit only when the thread is terminated. The initial secure call is processed by the normal call dispatcher loop in VTL 1, which identifies the “resume thread” entry reason to VTL 1, attaches to the new process’s address space, and switches to the new secure thread stack. The Secure Kernel in this case does not call the IumInvokeSecureService dispatcher function because it knows that the initial thread function is on the stack, so it simply returns to the address located in the stack, which points to the VTL 1 secure initial routine, SkpUserThreadStart.

SkpUserThreadStart, similarly to standard VTL 0 threads, sets up the initial thunk context to run the image loader initialization routine (LdrInitializeThunk in Ntdll.dll), as well as the system-wide thread startup stub (RtlUserThreadStart in Ntdll.dll). These steps are done by editing the context of the thread in place and then issuing an exit from system service operation, which loads the specially crafted user context and returns to user mode. The newborn secure thread initialization proceeds as for normal VTL 0 threads; the LdrInitializeThunk routine initializes the loader and its needed data structures. Once the function returns, NtContinue restores the new user context. Thread execution now truly starts: RtlUserThreadStart uses the address of the actual image entry point and the start parameter and calls the application’s entry point.

Image Note

A careful reader may have noticed that the Secure Kernel doesn’t do anything to protect the new trustlet’s binary image. This is because the shared memory that describes the trustlet’s base binary image is still accessible to VTL 0 by design.

Let’s assume that a trustlet wants to write private data located in the image’s global data. The PTEs that map the writable data section of the image global data are marked as copy-on-write. So, an access fault will be generated by the processor. The fault belongs to a user mode address range (remember that no NAR are used to track shared pages). The Secure Kernel page fault handler transfers the execution to the NT kernel (through a normal call), which will allocate a new page, copy the content of the old one in it, and protect it through the SLAT (using a protected copy operation; see the section “The Secure Kernel memory manager” earlier in this chapter for further details).

Secure devices

VBS provides the ability for drivers to run part of their code in the secure environment. The Secure Kernel itself can’t be extended to support kernel drivers; its attack surface would become too large. Furthermore, Microsoft wouldn’t allow external companies to introduce possible bugs in a component used primarily for security purposes.

The User-Mode Driver Framework (UMDF) solves the problem by introducing the concept of driver companions, which can run both in user mode VTL 0 or VTL 1. In this case, they take the name of secure companions. A secure companion takes the subset of the driver’s code that needs to run in a different mode (in this case IUM) and loads it as an extension, or companion, of the main KMDF driver. Standard WDM drivers are also supported, though. The main driver, which still runs in VTL 0 kernel mode, continues to manage the device’s PnP and power state, but it needs the ability to reach out to its companion to perform tasks that must be performed in IUM.

Although the Secure Driver Framework (SDF) mentioned in Chapter 3 is deprecated, Figure 9-39 shows the architecture of the new UMDF secure companion model, which is still built on top of the same UMDF core framework (Wudfx02000.dll) used in VTL 0 user mode. The latter leverages services provided by the UMDF secure companion host (WUDFCompanionHost.exe) for loading and managing the driver companion, which is distributed through a DLL. The UMDF secure companion host manages the lifetime of the secure companion and encapsulates many UMDF functions that deal specifically with the IUM environment.

Image

Figure 9-39 The WDF driver’s secure companion architecture.

A secure companion usually comes associated with the main driver that runs in the VTL 0 kernel. It must be properly signed (including the IUM EKU in the signature, as for every trustlet) and must declare its capabilities in its metadata section. A secure companion has the full ownership of its managed device (this explains why the device is often called secure device). A secure device controller by a secure companion supports the following features:

  •     Secure DMA The driver can instruct the device to perform DMA transfer directly in protected VTL 1 memory, which is not accessible to VTL 0. The secure companion can process the data sent or received through the DMA interface and can then transfer part of the data to the VTL 0 driver through the standard KMDF communication interface (ALPC). The IumGetDmaEnabler and IumDmaMapMemory secure system calls, exposed through Iumbase.dll, allow the secure companion to map physical DMA memory ranges directly in VTL 1 user mode.

  •     Memory mapped IO (MMIO) The secure companion can request the device to map its accessible MMIO range in VTL 1 (user mode). It can then access the memory-mapped device’s registers directly in IUM. MapSecureIo and the ProtectSecureIo APIs expose this feature.

  •     Secure sections The companion can create (through the CreateSecureSection API) and map secure sections, which represent memory that can be shared between trustlets and the main driver running in VTL 0. Furthermore, the secure companion can specify a different type of SLAT protection in case the memory is accessed through the secure device (via DMA or MMIO).

A secure companion can’t directly respond to device interrupts, which need to be mapped and managed by the associated kernel mode driver in VTL 0. In the same way, the kernel mode driver still needs to act as the high-level interface for the system and user mode applications by managing all the received IOCTLs. The main driver communicates with its secure companion by sending WDF tasks using the UMDF Task Queue object, which internally uses the ALPC facilities exposed by the WDF framework.

A typical KMDF driver registers its companion via INF directives. WDF automatically starts the driver’s companion in the context of the driver’s call to WdfDeviceCreate—which, for plug and play drivers usually happens in the AddDevice callback— by sending an ALPC message to the UMDF driver manager service, which spawns a new WUDFCompanionHost.exe trustlet by calling the NtCreateUserProcess native API. The UMDF secure companion host then loads the secure companion DLL in its address space. Another ALPC message is sent from the UMDF driver manager to the WUDFCompanionHost, with the goal to actually start the secure companion. The DriverEntry routine of the companion performs the driver’s secure initialization and creates the WDFDRIVER object through the classic WdfDriverCreate API.

The framework then calls the AddDevice callback routine of the companion in VTL 1, which usually creates the companion’s device through the new WdfDeviceCompanionCreate UMDF API. The latter transfers the execution to the Secure Kernel (through the IumCreateSecureDevice secure system call), which creates the new secure device. From this point on, the secure companion has full ownership of its managed device. Usually, the first thing that the companion does after the creation of the secure device is to create the task queue object (WDFTASKQUEUE) used to process any incoming tasks delivered by its associated VTL 0 driver. The execution control returns to the kernel mode driver, which can now send new tasks to its secure companion.

This model is also supported by WDM drivers. WDM drivers can use the KMDF’s miniport mode to interact with a special filter driver, WdmCompanionFilter.sys, which is attached in a lower-level position of the device’s stack. The Wdm Companion filter allows WDM drivers to use the task queue object for sending tasks to the secure companion.

VBS-based enclaves

In Chapter 5 of Part 1, we discuss the Software Guard Extension (SGX), a hardware technology that allows the creation of protected memory enclaves, which are secure zones in a process address space where code and data are protected (encrypted) by the hardware from code running outside the enclave. The technology, which was first introduced in the sixth generation Intel Core processors (Skylake), has suffered from some problems that prevented its broad adoption. (Furthermore, AMD released another technology called Secure Encrypted Virtualization, which is not compatible with SGX.)

To overcome these issues, Microsoft released VBS-based enclaves, which are secure enclaves whose isolation guarantees are provided using the VSM infrastructure. Code and data inside of a VBS-based enclave is visible only to the enclave itself (and the VSM Secure Kernel) and is inaccessible to the NT kernel, VTL 0 processes, and secure trustlets running in the system.

A secure VBS-based enclave is created by establishing a single virtual address range within a normal process. Code and data are then loaded into the enclave, after which the enclave is entered for the first time by transferring control to its entry point via the Secure Kernel. The Secure Kernel first verifies that all code and data are authentic and are authorized to run inside the enclave by using image signature verification on the enclave image. If the signature checks pass, then the execution control is transferred to the enclave entry point, which has access to all of the enclave’s code and data. By default, the system only supports the execution of enclaves that are properly signed. This precludes the possibility that unsigned malware can execute on a system outside the view of anti-malware software, which is incapable of inspecting the contents of any enclave.

During execution, control can transfer back and forth between the enclave and its containing process. Code executing inside of an enclave has access to all data within the virtual address range of the enclave. Furthermore, it has read and write access of the containing unsecure process address space. All memory within the enclave’s virtual address range will be inaccessible to the containing process. If multiple enclaves exist within a single host process, each enclave will be able to access only its own memory and the memory that is accessible to the host process.

As for hardware enclaves, when code is running in an enclave, it can obtain a sealed enclave report, which can be used by a third-party entity to validate that the code is running with the isolation guarantees of a VBS enclave, and which can further be used to validate the specific version of code running. This report includes information about the host system, the enclave itself, and all DLLs that may have been loaded into the enclave, as well as information indicating whether the enclave is executing with debugging capabilities enabled.

A VBS-based enclave is distributed as a DLL, which has certain specific characteristics:

  •     It is signed with an authenticode signature, and the leaf certificate includes a valid EKU that permits the image to be run as an enclave. The root authority that has emitted the digital certificate should be Microsoft, or a third-party signing authority covered by a certificate manifest that’s countersigned by Microsoft. This implies that third-party companies could sign and run their own enclaves. Valid digital signature EKUs are the IUM EKU (1.3.6.1.4.1.311.10.3.37) for internal Windows-signed enclaves or the Enclave EKU (1.3.6.1.4.1.311.10.3.42) for all the third-party enclaves.

  •     It includes an enclave configuration section (represented by an IMAGE_ENCLAVE_CONFIG data structure), which describes information about the enclave and which is linked to its image’s load configuration data directory.

  •     It includes the correct Control Flow Guard (CFG) instrumentation.

The enclave’s configuration section is important because it includes important information needed to properly run and seal the enclave: the unique family ID and image ID, which are specified by the enclave’s author and identify the enclave binary, the secure version number and the enclave’s policy information (like the expected virtual size, the maximum number of threads that can run, and the debuggability of the enclave). Furthermore, the enclave’s configuration section includes the list of images that may be imported by the enclave, included with their identity information. An enclave’s imported module can be identified by a combination of the family ID and image ID, or by a combination of the generated unique ID, which is calculated starting from the hash of the binary, and author ID, which is derived from the certificate used to sign the enclave. (This value expresses the identity of who has constructed the enclave.) The imported module descriptor must also include the minimum secure version number.

The Secure Kernel offers some basic system services to enclaves through the VBS enclave runtime DLL, Vertdll.dll, which is mapped in the enclave address space. These services include: a limited subset of the standard C runtime library, the ability to allocate or free secure memory within the address range of the enclave, synchronization services, structured exception handling support, basic cryptographic functions, and the ability to seal data.

Enclave lifecycle

In Chapter 5 of Part 1, we discussed the lifecycle of a hardware enclave (SGX-based). The lifecycle of a VBS-based enclave is similar; Microsoft has enhanced the already available enclave APIs to support the new type of VBS-based enclaves.

Step 1: Creation

An application creates a VBS-based enclave by specifying the ENCLAVE_TYPE_VBS flag to the CreateEnclave API. The caller should specify an owner ID, which identifies the owner of the enclave. The enclave creation code, in the same way as for hardware enclaves, ends up calling the NtCreateEnclave in the kernel. The latter checks the parameters, copies the passed-in structures, and attaches to the target process in case the enclave is to be created in a different process than the caller’s. The MiCreateEnclave function allocates an enclave-type VAD describing the enclave virtual memory range and selects a base virtual address if not specified by the caller. The kernel allocates the memory manager’s VBS enclave data structure and the per-process enclave hash table, used for fast lookup of the enclave starting by its number. If the enclave is the first created for the process, the system also creates an empty secure process (which acts as a container for the enclaves) in VTL 1 by using the CREATE_PROCESS secure call (see the earlier section “Trustlets creation” for further details).

The CREATE_ENCLAVE secure call handler in VTL 1 performs the actual work of the enclave creation: it allocates the secure enclave key data structure (SKMI_ENCLAVE), sets the reference to the container secure process (which has just been created by the NT kernel), and creates the secure VAD describing the entire enclave virtual address space (the secure VAD contains similar information to its VTL 0 counterpart). This VAD is inserted in the containing process’s VAD tree (and not in the enclave itself). An empty virtual address space for the enclave is created in the same way as for its containing process: the page table root is filled by system entries only.

Step 2: Loading modules into the enclave

Differently from hardware-based enclaves, the parent process can load only modules into the enclave but not arbitrary data. This will cause each page of the image to be copied into the address space in VTL 1. Each image’s page in the VTL 1 enclave will be a private copy. At least one module (which acts as the main enclave image) needs to be loaded into the enclave; otherwise, the enclave can’t be initialized. After the VBS enclave has been created, an application calls the LoadEnclaveImage API, specifying the enclave base address and the name of the module that must be loaded in the enclave. The Windows Loader code (in Ntdll.dll) searches the specified DLL name, opens and validates its binary file, and creates a section object that is mapped with read-only access right in the calling process.

After the loader maps the section, it parses the image’s import address table with the goal to create a list of the dependent modules (imported, delay loaded, and forwarded). For each found module, the loader checks whether there is enough space in the enclave for mapping it and calculates the correct image base address. As shown in Figure 9-40, which represents the System Guard Runtime Attestation enclave, modules in the enclave are mapped using a top-down strategy. This means that the main image is mapped at the highest possible virtual address, and all the dependent ones are mapped in lower addresses one next to each other. At this stage, for each module, the Windows Loader calls the NtLoadEnclaveData kernel API.

Image

Figure 9-40 The System Guard Runtime Attestation secure enclave (note the empty space at the base of the enclave).

For loading the specified image in the VBS enclave, the kernel starts a complex process that allows the shared pages of its section object to be copied in the private pages of the enclave in VTL 1. The MiMapImageForEnclaveUse function gets the control area of the section object and validates it through SKCI. If the validation fails, the process is interrupted, and an error is returned to the caller. (All the enclave’s modules should be correctly signed as discussed previously.) Otherwise, the system attaches to the secure system process and maps the image’s section object in its address space in VTL 0. The shared pages of the module at this time could be valid or invalid; see Chapter 5 of Part 1 for further details. It then commits the virtual address space of the module in the containing process. This creates private VTL 0 paging data structures for demand-zero PTEs, which will be later populated by the Secure Kernel when the image is loaded in VTL 1.

The LOAD_ENCLAVE_MODULE secure call handler in VTL 1 obtains the SECURE_IMAGE of the new module (created by SKCI) and verifies whether the image is suitable for use in a VBS-based enclave (by verifying the digital signature characteristics). It then attaches to the secure system process in VTL 1 and maps the secure image at the same virtual address previously mapped by the NT kernel. This allows the sharing of the prototype PTEs from VTL 0. The Secure Kernel then creates the secure VAD that describes the module and inserts it into the VTL 1 address space of the enclave. It finally cycles between each module’s section prototype PTE. For each nonpresent prototype PTE, it attaches to the secure system process and uses the GET_PHYSICAL_PAGE normal call to invoke the NT page fault handler (MmAccessFault), which brings in memory the shared page. The Secure Kernel performs a similar process for the private enclave pages, which have been previously committed by the NT kernel in VTL 0 by demand-zero PTEs. The NT page fault handler in this case allocates zeroed pages. The Secure Kernel copies the content of each shared physical page into each new private page and applies the needed private relocations if needed.

The loading of the module in the VBS-based enclave is complete. The Secure Kernel applies the SLAT protection to the private enclave pages (the NT kernel has no access to the image’s code and data in the enclave), unmaps the shared section from the secure system process, and yields the execution to the NT kernel. The Loader can now proceed with the next module.

Step 3: Enclave initialization

After all the modules have been loaded into the enclave, an application initializes the enclave using the InitializeEnclave API, and specifies the maximum number of threads supported by the enclave (which will be bound to threads able to perform enclave calls in the containing process). The Secure Kernel’s INITIALIZE_ENCLAVE secure call’s handler verifies that the policies specified during enclave creation are compatible with the policies expressed in the configuration information of the primary image, verifies that the enclave’s platform library is loaded (Vertdll.dll), calculates the final 256-bit hash of the enclave (used for generating the enclave sealed report), and creates all the secure enclave threads. When the execution control is returned to the Windows Loader code in VTL 0, the system performs the first enclave call, which executes the initialization code of the platform DLL.

Step 4: Enclave calls (inbound and outbound)

After the enclave has been correctly initialized, an application can make an arbitrary number of calls into the enclave. All the callable functions in the enclave need to be exported. An application can call the standard GetProcAddress API to get the address of the enclave’s function and then use the CallEnclave routine for transferring the execution control to the secure enclave. In this scenario, which describes an inbound call, the NtCallEnclave kernel routine performs the thread selection algorithm, which binds the calling VTL 0 thread to an enclave thread, according to the following rules:

  •     If the normal thread was not previously called by the enclave (enclaves support nested calls), then an arbitrary idle enclave thread is selected for execution. In case no idle enclave threads are available, the call blocks until an enclave thread becomes available (if specified by the caller; otherwise the call simply fails).

  •     In case the normal thread was previously called by the enclave, then the call into the enclave is made on the same enclave thread that issued the previous call to the host.

A list of enclave thread’s descriptors is maintained by both the NT and Secure Kernel. When a normal thread is bound to an enclave thread, the enclave thread is inserted in another list, which is called the bound threads list. Enclave threads tracked by the latter are currently running and are not available anymore.

After the thread selection algorithm succeeds, the NT kernel emits the CALLENCLAVE secure call. The Secure Kernel creates a new stack frame for the enclave and returns to user mode. The first user mode function executed in the context of the enclave is RtlEnclaveCallDispatcher. The latter, in case the enclave call was the first one ever emitted, transfers the execution to the initialization routine of the VSM enclave runtime DLL (Vertdll.dll), which initializes the CRT, the loader, and all the services provided to the enclave; it finally calls the DllMain function of the enclave’s main module and of all its dependent images (by specifying a DLL_PROCESS_ATTACH reason).

In normal situations, where the enclave platform DLL has been already initialized, the enclave dispatcher invokes the DllMain of each module by specifying a DLL_THREAD_ATTACH reason, verifies whether the specified address of the target enclave’s function is valid, and, if so, finally calls the target function. When the target enclave’s routine finishes its execution, it returns to VTL 0 by calling back into the containing process. For doing this, it still relies on the enclave platform DLL, which again calls the NtCallEnclave kernel routine. Even though the latter is implemented slightly differently in the Secure Kernel, it adopts a similar strategy for returning to VTL 0. The enclave itself can emit enclave calls for executing some function in the context of the unsecure containing process. In this scenario (which describes an outbound call), the enclave code uses the CallEnclave routine and specifies the address of an exported function in the containing process’s main module.

Step 5: Termination and destruction

When termination of an entire enclave is requested through the TerminateEnclave API, all threads executing inside the enclave will be forced to return to VTL 0. Once termination of an enclave is requested, all further calls into the enclave will fail. As threads terminate, their VTL1 thread state (including thread stacks) is destroyed. Once all threads have stopped executing, the enclave can be destroyed. When the enclave is destroyed, all remaining VTL 1 state associated with the enclave is destroyed, too (including the entire enclave address space), and all pages are freed in VTL 0. Finally, the enclave VAD is deleted and all committed enclave memory is freed. Destruction is triggered when the containing process calls VirtualFree with the base of the enclave’s address range. Destruction is not possible unless the enclave has been terminated or was never initialized.

Image Note

As we have discussed previously, all the memory pages that are mapped into the enclave address space are private. This has multiple implications. No memory pages that belong to the VTL 0 containing process are mapped in the enclave address space, though (and also no VADs describing the containing process’s allocation is present). So how can the enclave access all the memory pages of the containing process?

The answer is in the Secure Kernel page fault handler (SkmmAccessFault). In its code, the fault handler checks whether the faulting process is an enclave. If it is, the fault handler checks whether the fault happens because the enclave tried to execute some code outside its region. In this case, it raises an access violation error. If the fault is due to a read or write access outside the enclave’s address space, the secure page fault handler emits a GET_PHYSICAL_PAGE normal service, which results in the VTL 0 access fault handler to be called. The VTL 0 handler checks the containing process VAD tree, obtains the PFN of the page from its PTE—by bringing it in memory if needed—and returns it to VTL 1. At this stage, the Secure Kernel can create the necessary paging structures to map the physical page at the same virtual address (which is guaranteed to be available thanks to the property of the enclave itself) and resumes the execution. The page is now valid in the context of the secure enclave.

Sealing and attestation

VBS-based enclaves, like hardware-based enclaves, support both the sealing and attestation of the data. The term sealing refers to the encryption of arbitrary data using one or more encryption keys that aren’t visible to the enclave’s code but are managed by the Secure Kernel and tied to the machine and to the enclave’s identity. Enclaves will never have access to those keys; instead, the Secure Kernel offers services for sealing and unsealing arbitrary contents (through the EnclaveSealData and EnclaveUnsealData APIs) using an appropriate key designated by the enclave. At the time the data is sealed, a set of parameters is supplied that controls which enclaves are permitted to unseal the data. The following policies are supported:

  •     Security version number (SVN) of the Secure Kernel and of the primary image No enclave can unseal any data that was sealed by a later version of the enclave or the Secure Kernel.

  •     Exact code The data can be unsealed only by an enclave that maps the same identical modules of the enclave that has sealed it. The Secure Kernel verifies the hash of the Unique ID of every image mapped in the enclave to allow a proper unsealing.

  •     Same image, family, or author The data can be unsealed only by an enclave that has the same author ID, family ID, and/or image ID.

  •     Runtime policy The data can be unsealed only if the unsealing enclave has the same debugging policy of the original one (debuggable versus nondebuggable).

It is possible for every enclave to attest to any third party that it is running as a VBS enclave with all the protections offered by the VBS-enclave architecture. An enclave attestation report provides proof that a specific enclave is running under the control of the Secure Kernel. The attestation report contains the identity of all code loaded into the enclave as well as policies controlling how the enclave is executing.

Describing the internal details of the sealing and attestation operations is outside the scope of this book. An enclave can generate an attestation report through the EnclaveGetAttestationReport API. The memory buffer returned by the API can be transmitted to another enclave, which can “attest” the integrity of the environment in which the original enclave ran by verifying the attestation report through the EnclaveVerifyAttestationReport function.

System Guard runtime attestation

System Guard runtime attestation (SGRA) is an operating system integrity component that leverages the aforementioned VBS-enclaves—together with a remote attestation service component—to provide strong guarantees around its execution environment. This environment is used to assert sensitive system properties at runtime and allows for a relying party to observe violations of security promises that the system provides. The first implementation of this new technology was introduced in Windows 10 April 2018 Update (RS4).

SGRA allows an application to view a statement about the security posture of the device. This statement is composed of three parts:

  •     A session report, which includes a security level describing the attestable boot-time properties of the device

  •     A runtime report, which describes the runtime state of the device

  •     A signed session certificate, which can be used to verify the reports

The SGRA service, SgrmBroker.exe, hosts a component (SgrmEnclave_secure.dll) that runs in a VTL 1 as a VBS enclave that continually asserts the system for runtime violations of security features. These assertions are surfaced in the runtime report, which can be verified on the backend by a relying part. As the assertions run in a separate domain-of-trust, attacking the contents of the runtime report directly becomes difficult.

SGRA internals

Figure 9-41 shows a high-level overview of the architecture of Windows Defender System Guard runtime attestation, which consists of the following client-side components:

  •     The VTL-1 assertion engine: SgrmEnclave_secure.dll

  •     A VTL-0 kernel mode agent: SgrmAgent.sys

  •     A VTL-0 WinTCB Protected broker process hosting the assertion engine: SgrmBroker.exe

  •     A VTL-0 LPAC process used by the WinTCBPP broker process to interact with the networking stack: SgrmLpac.exe

Image

Figure 9-41 Windows Defender System Guard runtime attestation’s architecture.

To be able to rapidly respond to threats, SGRA includes a dynamic scripting engine (Lua) forming the core of the assertion mechanism that executes in a VTL 1 enclave—an approach that allows frequent assertion logic updates.

Due to the isolation provided by the VBS enclave, threads executing in VTL 1 are limited in terms of their ability to access VTL 0 NT APIs. Therefore, for the runtime component of SGRA to perform meaningful work, a way of working around the limited VBS enclave API surface is necessary.

An agent-based approach is implemented to expose VTL 0 facilities to the logic running in VTL 1; these facilities are termed assists and are serviced by the SgrmBroker user mode component or by an agent driver running in VTL 0 kernel mode (SgrmAgent.sys). The VTL 1 logic running in the enclave can call out to these VTL 0 components with the goal of requesting assists that provide a range of facilities, including NT kernel synchronize primitives, page mapping capabilities, and so on.

As an example of how this mechanism works, SGRA is capable of allowing the VTL 1 assertion engine to directly read VTL 0–owned physical pages. The enclave requests a mapping of an arbitrary page via an assist. The page would then be locked and mapped into the SgrmBroker VTL 0 address space (making it resident). As VBS enclaves have direct access to the host process address space, the secure logic can read directly from the mapped virtual addresses. These reads must be synchronized with the VTL 0 kernel itself. The VTL 0 resident broker agent (SgrmAgent.sys driver) is also used to perform synchronization.

Assertion logic

As mentioned earlier, SGRA asserts system security properties at runtime. These assertions are executed within the assertion engine hosted in the VBS-based enclave. Signed Lua bytecode describing the assertion logic is provided to the assertion engine during start up.

Assertions are run periodically. When a violation of an asserted property is discovered (that is, when the assertion “fails”), the failure is recorded and stored within the enclave. This failure will be exposed to a relying party in the runtime report that is generated and signed (with the session certificate) within the enclave.

An example of the assertion capabilities provided by SGRA are the asserts surrounding various executive process object attributes—for example, the periodic enumeration of running processes and the assertion of the state of a process’s protection bits that govern protected process policies.

The flow for the assertion engine performing this check can be approximated to the following steps:

  1. The assertion engine running within VTL 1 calls into its VTL 0 host process (SgrmBroker) to request that an executive process object be referenced by the kernel.

  2. The broker process forwards this request to the kernel mode agent (SgrmAgent), which services the request by obtaining a reference to the requested executive process object.

  3. The agent notifies the broker that the request has been serviced and passes any necessary metadata down to the broker.

  4. The broker forwards this response to the requesting VTL 1 assertion logic.

  5. The logic can then elect to have the physical page backing the referenced executive process object locked and mapped into its accessible address space; this is done by calling out of the enclave using a similar flow as steps 1 through 4.

  6. Once the page is mapped, the VTL 1 engine can read it directly and check the executive process object protection bit against its internally held context.

  7. The VTL 1 logic again calls out to VTL 0 to unwind the page mapping and kernel object reference.

Reports and trust establishment

A WinRT-based API is exposed to allow relying parties to obtain the SGRA session certificate and the signed session and runtime reports. This API is not public and is available under NDA to vendors that are part of the Microsoft Virus Initiative (note that Microsoft Defender Advanced Threat Protection is currently the only in-box component that interfaces directly with SGRA via this API).

The flow for obtaining a trusted statement from SGRA is as follows:

  1. A session is created between the relying party and SGRA. Establishment of the session requires a network connection. The SgrmEnclave assertion engine (running in VTL-1) generates a public-private key pair, and the SgrmBroker protected process retrieves the TCG log and the VBS attestation report, sending them to Microsoft’s System Guard attestation service with the public component of the key generated in the previous step.

  2. The attestation service verifies the TCG log (from the TPM) and the VBS attestation report (as proof that the logic is running within a VBS enclave) and generates a session report describing the attested boot time properties of the device. It signs the public key with an SGRA attestation service intermediate key to create a certificate that will be used to verify runtime reports.

  3. The session report and the certificate are returned to the relying party. From this point, the relying party can verify the validity of the session report and runtime certificate.

  4. Periodically, the relying party can request a runtime report from SGRA using the established session: the SgrmEnclave assertion engine generates a runtime report describing the state of the assertions that have been run. The report will be signed using the paired private key generated during session creation and returned to the relying party (the private key never leaves the enclave).

  5. The relying party can verify the validity of the runtime report against the runtime certificate obtained earlier and make a policy decision based on both the contents of the session report (boot-time attested state) and the runtime report (asserted state).

SGRA provides some API that relying parties can use to attest to the state of the device at a point in time. The API returns a runtime report that details the claims that Windows Defender System Guard runtime attestation makes about the security posture of the system. These claims include assertions, which are runtime measurements of sensitive system properties. For example, an app could ask Windows Defender System Guard to measure the security of the system from the hardware-backed enclave and return a report. The details in this report can be used by the app to decide whether it performs a sensitive financial transaction or displays personal information.

As discussed in the previous section, a VBS-based enclave can also expose an enclave attestation report signed by a VBS-specific signing key. If Windows Defender System Guard can obtain proof that the host system is running with VSM active, it can use this proof with a signed session report to ensure that the particular enclave is running. Establishing the trust necessary to guarantee that the runtime report is authentic, therefore, requires the following:

  1. Attesting to the boot state of the machine; the OS, hypervisor, and Secure Kernel (SK) binaries must be signed by Microsoft and configured according to a secure policy.

  2. Binding trust between the TPM and the health of the hypervisor to allow trust in the Measured Boot Log.

  3. Extracting the needed key (VSM IDKs) from the Measured Boot Log and using these to verify the VBS enclave signature (see Chapter 12 for further details).

  4. Signing of the public component of an ephemeral key-pair generated within the enclave with a trusted Certificate Authority to issue a session certificate.

  5. Signing of the runtime report with the ephemeral private key.

Networking calls between the enclave and the Windows Defender System Guard attestation service are made from VTL 0. However, the design of the attestation protocol ensures that it is resilient against tampering even over untrusted transport mechanisms.

Numerous underlying technologies are required before the chain of trust described earlier can be sufficiently established. To inform a relying party of the level of trust in the runtime report that they can expect on any particular configuration, a security level is assigned to each Windows Defender System Guard attestation service-signed session report. The security level reflects the underlying technologies enabled on the platform and attributes a level of trust based on the capabilities of the platform. Microsoft is mapping the enablement of various security technologies to security levels and will share this when the API is published for third-party use. The highest level of trust is likely to require the following features, at the very least:

  •     VBS-capable hardware and OEM configuration.

  •     Dynamic root-of-trust measurements at boot.

  •     Secure boot to verify hypervisor, NT, and SK images.

  •     Secure policy ensuring Hypervisor Enforced Code Integrity (HVCI) and kernel mode code integrity (KMCI), test-signing is disabled, and kernel debugging is disabled.

  •     The ELAM driver is present.

Conclusion

Windows is able to manage and run multiple virtual machines thanks to the Hyper-V hypervisor and its virtualization stack, which, combined together, support different operating systems running in a VM. Over the years, the two components have evolved to provide more optimizations and advanced features for the VMs, like nested virtualization, multiple schedulers for the virtual processors, different types of virtual hardware support, VMBus, VA-backed VMs, and so on.

Virtualization-based security provides to the root operating system a new level of protection against malware and stealthy rootkits, which are no longer able to steal private and confidential information from the root operating system’s memory. The Secure Kernel uses the services supplied by the Windows hypervisor to create a new execution environment (VTL 1) that is protected and not accessible to the software running in the main OS. Furthermore, the Secure Kernel delivers multiple services to the Windows ecosystem that help to maintain a more secure environment.

The Secure Kernel also defines the Isolated User Mode, allowing user mode code to be executed in the new protected environment through trustlets, secure devices, and enclaves. The chapter ended with the analysis of System Guard Runtime Attestation, a component that uses the services exposed by the Secure Kernel to measure the workstation’s execution environment and to provide strong guarantees about its integrity.

In the next chapter, we look at the management and diagnostics components of Windows and discuss important mechanisms involved with their infrastructure: the registry, services, Task scheduler, Windows Management Instrumentation (WMI), kernel Event Tracing, and so on.

Comments

Popular Posts

CHAPTER 12 Startup and shutdown

Chapter 8 System mechanisms

Chapter 1. Concepts and tools

CHAPTER 11 Caching and file systems