• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1# Architecture
2
3The principle characteristics of crosvm are:
4
5- A process per virtual device, made using fork
6- Each process is sandboxed using [minijail]
7- Takes full advantage of KVM and low-level Linux syscalls, and so only runs on Linux
8- Written in Rust for security and safety
9
10A typical session of crosvm starts in `main.rs` where command line parsing is done to build up a
11`Config` structure. The `Config` is used by `run_config` in `linux/mod.rs` to setup and execute a
12VM. Broken down into rough steps:
13
141. Load the linux kernel from an ELF file.
151. Create a handful of control sockets used by the virtual devices.
161. Invoke the architecture specific VM builder `Arch::build_vm` (located in `x86_64/src/lib.rs` or
17   `aarch64/src/lib.rs`).
181. `Arch::build_vm` will itself invoke the provided `create_devices` function from `linux/mod.rs`
191. `create_devices` creates every PCI device, including the virtio devices, that were configured in
20   `Config`, along with matching [minijail] configs for each.
211. `Arch::generate_pci_root`, using a list of every PCI device with optional `Minijail`, will
22   finally jail the PCI devices and construct a `PciRoot` that communicates with them.
231. Once the VM has been built, it's contained within a `RunnableLinuxVm` object that is used by the
24   VCPUs and control loop to service requests until shutdown.
25
26## Forking
27
28During the device creation routine, each device will be created and then wrapped in a `ProxyDevice`
29which will internally `fork` (but not `exec`) and [minijail] the device, while dropping it for the
30main process. The only interaction that the device is capable of having with the main process is via
31the proxied trait methods of `BusDevice`, shared memory mappings such as the guest memory, and file
32descriptors that were specifically allowed by that device's security policy. This can lead to some
33surprising behavior to be aware of such as why some file descriptors which were once valid are now
34invalid.
35
36## Sandboxing Policy
37
38Every sandbox is made with [minijail] and starts with `create_base_minijail` in
39`linux/jail_helpers.rs` which set some very restrictive settings. Linux namespaces and seccomp
40filters are used extensively. Each seccomp policy can be found under
41`seccomp/{arch}/{device}.policy` and should start by `@include`-ing the `common_device.policy`. With
42the exception of architecture specific devices (such as `Pl030` on ARM or `I8042` on x86_64), every
43device will need a different policy for each supported architecture.
44
45## The VM Control Sockets
46
47For the operations that devices need to perform on the global VM state, such as mapping into guest
48memory address space, there are the vm control sockets. There are a few kinds, split by the type of
49request and response that the socket will process. This also proves basic security privilege
50separation in case a device becomes compromised by a malicious guest. For example, a rogue device
51that is able to allocate MSI routes would not be able to use the same socket to (de)register guest
52memory. During the device initialization stage, each device that requires some aspect of VM control
53will have a constructor that requires the corresponding control socket. The control socket will get
54preserved when the device is sandboxed and the other side of the socket will be waited on in the
55main process's control loop.
56
57The socket exposed by crosvm with the `--socket` command line argument is another form of the VM
58control socket. Because the protocol of the control socket is internal and unstable, the only
59supported way of using that resulting named unix domain socket is via crosvm command line
60subcommands such as `crosvm stop`.
61
62## GuestMemory
63
64`GuestMemory` and its friends `VolatileMemory`, `VolatileSlice`, `MemoryMapping`, and
65`SharedMemory`, are common types used throughout crosvm to interact with guest memory. Know which
66one to use in what place using some guidelines
67
68- `GuestMemory` is for sending around references to all of the guest memory. It can be cloned
69  freely, but the underlying guest memory is always the same. Internally, it's implemented using
70  `MemoryMapping` and `SharedMemory`. Note that `GuestMemory` is mapped into the host address space,
71  but it is non-contiguous. Device memory, such as mapped DMA-Bufs, are not present in
72  `GuestMemory`.
73- `SharedMemory` wraps a `memfd` and can be mapped using `MemoryMapping` to access its data.
74  `SharedMemory` can't be cloned.
75- `VolatileMemory` is a trait that exposes generic access to non-contiguous memory. `GuestMemory`
76  implements this trait. Use this trait for functions that operate on a memory space but don't
77  necessarily need it to be guest memory.
78- `VolatileSlice` is analogous to a Rust slice, but unlike those, a `VolatileSlice` has data that
79  changes asynchronously by all those that reference it. Exclusive mutability and data
80  synchronization are not available when it comes to a `VolatileSlice`. This type is useful for
81  functions that operate on contiguous shared memory, such as a single entry from a scatter gather
82  table, or for safe wrappers around functions which operate on pointers, such as a `read` or
83  `write` syscall.
84- `MemoryMapping` is a safe wrapper around anonymous and file mappings. Provides RAII and does
85  munmap after use. Access via Rust references is forbidden, but indirect reading and writing is
86  available via `VolatileSlice` and several convenience functions. This type is most useful for
87  mapping memory unrelated to `GuestMemory`.
88
89### Device Model
90
91### `Bus`/`BusDevice`
92
93The root of the crosvm device model is the `Bus` structure and its friend the `BusDevice` trait. The
94`Bus` structure is a virtual computer bus used to emulate the memory-mapped I/O bus and also I/O
95ports for x86 VMs. On a read or write to an address on a VM's bus, the corresponding `Bus` object is
96queried for a `BusDevice` that occupies that address. `Bus` will then forward the read/write to the
97`BusDevice`. Because of this behavior, only one `BusDevice` may exist at any given address. However,
98a `BusDevice` may be placed at more than one address range. Depending on how a `BusDevice` was
99inserted into the `Bus`, the forwarded read/write will be relative to 0 or to the start of the
100address range that the `BusDevice` occupies (which would be ambiguous if the `BusDevice` occupied
101more than one range).
102
103Only the base address of a multi-byte read/write is used to search for a device, so a device
104implementation should be aware that the last address of a single read/write may be outside its
105address range. For example, if a `BusDevice` was inserted at base address 0x1000 with a length of
1060x40, a 4-byte read by a VCPU at 0x39 would be forwarded to that `BusDevice`.
107
108Each `BusDevice` is reference counted and wrapped in a mutex, so implementations of `BusDevice` need
109not worry about synchronizing their access across multiple VCPUs and threads. Each VCPU will get a
110complete copy of the `Bus`, so there is no contention for querying the `Bus` about an address. Once
111the `BusDevice` is found, the `Bus` will acquire an exclusive lock to the device and forward the
112VCPU's read/write. The implementation of the `BusDevice` will block execution of the VCPU that
113invoked it, as well as any other VCPU attempting access, until it returns from its method.
114
115Most devices in crosvm do not implement `BusDevice` directly, but some are examples are `i8042` and
116`Serial`. With the exception of PCI devices, all devices are inserted by architecture specific code
117(which may call into the architecture-neutral `arch` crate). A `BusDevice` can be proxied to a
118sandboxed process using `ProxyDevice`, which will create the second process using a fork, with no
119exec.
120
121### `PciConfigIo`/`PciConfigMmio`
122
123In order to use the more complex PCI bus, there are a couple adapters that implement `BusDevice` and
124call into a `PciRoot` with higher level calls to `config_space_read`/`config_space_write`. The
125`PciConfigMmio` is a `BusDevice` for insertion into the MMIO `Bus` for ARM devices. For x86_64,
126`PciConfigIo` is inserted into the I/O port `Bus`. There is only one implementation of `PciRoot`
127that is used by either of the `PciConfig*` structures. Because these devices are very simple, they
128have very little code or state. They aren't sandboxed and are run as part of the main process.
129
130### `PciRoot`/`PciDevice`/`VirtioPciDevice`
131
132The `PciRoot`, analogous to `BusDevice` for `Bus`s, contains all the `PciDevice` trait objects.
133Because of a shortcut (or hack), the `ProxyDevice` only supports jailing `BusDevice` traits.
134Therefore, `PciRoot` only contains `BusDevice`s, even though they also implement `PciDevice`. In
135fact, every `PciDevice` also implements `BusDevice` because of a blanket implementation
136(`impl<T: PciDevice> BusDevice for T { … }`). There are a few PCI related methods in `BusDevice` to
137allow the `PciRoot` to still communicate with the underlying `PciDevice` (yes, this abstraction is
138very leaky). Most devices will not implement `PciDevice` directly, instead using the
139`VirtioPciDevice` implementation for virtio devices, but the xHCI (USB) controller is an example
140that implements `PciDevice` directly. The `VirtioPciDevice` is an implementation of `PciDevice` that
141wraps a `VirtioDevice`, which is how the virtio specified PCI transport is adapted to a transport
142agnostic `VirtioDevice` implementation.
143
144### `VirtioDevice`
145
146The `VirtioDevice` is the most widely implemented trait among the device traits. Each of the
147different virtio devices (block, rng, net, etc.) implement this trait directly and they follow a
148similar pattern. Most of the trait methods are easily filled in with basic information about the
149specific device, but `activate` will be the heart of the implementation. It's called by the virtio
150transport after the guest's driver has indicated the device has been configured and is ready to run.
151The virtio device implementation will receive the run time related resources (`GuestMemory`,
152`Interrupt`, etc.) for processing virtio queues and associated interrupts via the arguments to
153`activate`, but `activate` can't spend its time actually processing the queues. A VCPU will be
154blocked as long as `activate` is running. Every device uses `activate` to launch a worker thread
155that takes ownership of run time resources to do the actual processing. There is some subtlety in
156dealing with virtio queues, so the smart thing to do is copy a simpler device and adapt it, such as
157the rng device (`rng.rs`).
158
159## Communication Framework
160
161Because of the multi-process nature of crosvm, communication is done over several IPC primitives.
162The common ones are shared memory pages, unix sockets, anonymous pipes, and various other file
163descriptor variants (DMA-buf, eventfd, etc.). Standard methods (`read`/`write`) of using these
164primitives may be used, but crosvm has developed some helpers which should be used where applicable.
165
166### `PollContext`/`EpollContext`
167
168Most threads in crosvm will have a wait loop using a `PollContext`, which is a wrapper around
169Linux's `epoll` primitive for selecting over file descriptors. `EpollContext` is very similar but
170has slightly fewer features, but is usable by multiple threads at once. In either case, each FD is
171added to the context along with an associated token, whose type is the type parameter of
172`PollContext`. This token must be convertible to and from a `u64`, which is a limitation imposed by
173how `epoll` works. There is a custom derive `#[derive(PollToken)]` which can be applied to an `enum`
174declaration that makes it easy to use your own enum in a `PollContext`.
175
176Note that the limitations of `PollContext` are the same as the limitations of `epoll`. The same FD
177can not be inserted more than once, and the FD will be automatically removed if the process runs out
178of references to that FD. A `dup`/`fork` call will increment that reference count, so closing the
179original FD will not actually remove it from the `PollContext`. It is possible to receive tokens
180from `PollContext` for an FD that was closed because of a race condition in which an event was
181registered in the background before the `close` happened. Best practice is to remove an FD before
182closing it so that events associated with it can be reliably eliminated.
183
184### `serde` with Descriptors.
185
186Using raw sockets and pipes to communicate is very inconvenient for rich data types. To help make
187this easier and less error prone, crosvm uses the `serde` crate. To allow transmitting types with
188embedded descriptors (FDs on Linux or HANDLEs on Windows), a module is provided for sending and
189receiving descriptors alongside the plain old bytes that serde consumes.
190
191## Code Map
192
193Source code is organized into crates, each with their own unit tests.
194
195- `./src/` - The top-level binary front-end for using crosvm.
196- `aarch64` - Support code specific to 64 bit ARM architectures.
197- `base` - Safe wrappers for small system facilities which provides cross-platform-compatible
198  interfaces. For Linux, this is basically a thin wrapper of `sys_util`.
199- `bin` - Scripts for code health such as wrappers of `rustfmt` and `clippy`.
200- `ci` - Scripts for continuous integration.
201- `cros_async` - Runtime for async/await programming. This crate provides a `Future` executor based
202  on `io_uring` and one based on `epoll`.
203- `devices` - Virtual devices exposed to the guest OS.
204- `disk` - Library to create and manipulate several types of disks such as raw disk, [qcow], etc.
205- `hypervisor` - Abstract layer to interact with hypervisors. For Linux, this crate is a wrapper of
206  `kvm`.
207- `integration_tests` - End-to-end tests that run a crosvm VM.
208- `kernel_loader` - Loads elf64 kernel files to a slice of memory.
209- `kvm_sys` - Low-level (mostly) auto-generated structures and constants for using KVM.
210- `kvm` - Unsafe, low-level wrapper code for using `kvm_sys`.
211- `media/libvda` - Safe wrapper of [libvda], a Chrome OS HW-accelerated video decoding/encoding
212  library.
213- `net_sys` - Low-level (mostly) auto-generated structures and constants for creating TUN/TAP
214  devices.
215- `net_util` - Wrapper for creating TUN/TAP devices.
216- `qcow_util` - A library and a binary to manipulate [qcow] disks.
217- `seccomp` - Contains minijail seccomp policy files for each sandboxed device. Because some
218  syscalls vary by architecture, the seccomp policies are split by architecture.
219- `sync` - Our version of `std::sync::Mutex` and `std::sync::Condvar`.
220- `sys_util` - Mostly safe wrappers for small system facilities such as `eventfd` or `syslog`.
221- `third_party` - Third-party libraries which we are maintaining on the Chrome OS tree or the AOSP
222  tree.
223- `vfio_sys` - Low-level (mostly) auto-generated structures, constants and ioctls for [VFIO].
224- `vhost` - Wrappers for creating vhost based devices.
225- `virtio_sys` - Low-level (mostly) auto-generated structures and constants for interfacing with
226  kernel vhost support.
227- `vm_control` - IPC for the VM.
228- `vm_memory` - Vm-specific memory objects.
229- `x86_64` - Support code specific to 64 bit intel machines.
230
231[libvda]: https://chromium.googlesource.com/chromiumos/platform2/+/refs/heads/main/arc/vm/libvda/
232[minijail]: https://android.googlesource.com/platform/external/minijail
233[qcow]: https://en.wikipedia.org/wiki/Qcow
234[vfio]: https://www.kernel.org/doc/html/latest/driver-api/vfio.html
235