1# Kernel Self-Protection 2 3Kernel self-protection is the design and implementation of systems and 4structures within the Linux kernel to protect against security flaws in 5the kernel itself. This covers a wide range of issues, including removing 6entire classes of bugs, blocking security flaw exploitation methods, 7and actively detecting attack attempts. Not all topics are explored in 8this document, but it should serve as a reasonable starting point and 9answer any frequently asked questions. (Patches welcome, of course!) 10 11In the worst-case scenario, we assume an unprivileged local attacker 12has arbitrary read and write access to the kernel's memory. In many 13cases, bugs being exploited will not provide this level of access, 14but with systems in place that defend against the worst case we'll 15cover the more limited cases as well. A higher bar, and one that should 16still be kept in mind, is protecting the kernel against a _privileged_ 17local attacker, since the root user has access to a vastly increased 18attack surface. (Especially when they have the ability to load arbitrary 19kernel modules.) 20 21The goals for successful self-protection systems would be that they 22are effective, on by default, require no opt-in by developers, have no 23performance impact, do not impede kernel debugging, and have tests. It 24is uncommon that all these goals can be met, but it is worth explicitly 25mentioning them, since these aspects need to be explored, dealt with, 26and/or accepted. 27 28 29## Attack Surface Reduction 30 31The most fundamental defense against security exploits is to reduce the 32areas of the kernel that can be used to redirect execution. This ranges 33from limiting the exposed APIs available to userspace, making in-kernel 34APIs hard to use incorrectly, minimizing the areas of writable kernel 35memory, etc. 36 37### Strict kernel memory permissions 38 39When all of kernel memory is writable, it becomes trivial for attacks 40to redirect execution flow. To reduce the availability of these targets 41the kernel needs to protect its memory with a tight set of permissions. 42 43#### Executable code and read-only data must not be writable 44 45Any areas of the kernel with executable memory must not be writable. 46While this obviously includes the kernel text itself, we must consider 47all additional places too: kernel modules, JIT memory, etc. (There are 48temporary exceptions to this rule to support things like instruction 49alternatives, breakpoints, kprobes, etc. If these must exist in a 50kernel, they are implemented in a way where the memory is temporarily 51made writable during the update, and then returned to the original 52permissions.) 53 54In support of this are (the poorly named) CONFIG_DEBUG_RODATA and 55CONFIG_DEBUG_SET_MODULE_RONX, which seek to make sure that code is not 56writable, data is not executable, and read-only data is neither writable 57nor executable. 58 59#### Function pointers and sensitive variables must not be writable 60 61Vast areas of kernel memory contain function pointers that are looked 62up by the kernel and used to continue execution (e.g. descriptor/vector 63tables, file/network/etc operation structures, etc). The number of these 64variables must be reduced to an absolute minimum. 65 66Many such variables can be made read-only by setting them "const" 67so that they live in the .rodata section instead of the .data section 68of the kernel, gaining the protection of the kernel's strict memory 69permissions as described above. 70 71For variables that are initialized once at __init time, these can 72be marked with the (new and under development) __ro_after_init 73attribute. 74 75What remains are variables that are updated rarely (e.g. GDT). These 76will need another infrastructure (similar to the temporary exceptions 77made to kernel code mentioned above) that allow them to spend the rest 78of their lifetime read-only. (For example, when being updated, only the 79CPU thread performing the update would be given uninterruptible write 80access to the memory.) 81 82#### Segregation of kernel memory from userspace memory 83 84The kernel must never execute userspace memory. The kernel must also never 85access userspace memory without explicit expectation to do so. These 86rules can be enforced either by support of hardware-based restrictions 87(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains). 88By blocking userspace memory in this way, execution and data parsing 89cannot be passed to trivially-controlled userspace memory, forcing 90attacks to operate entirely in kernel memory. 91 92### Reduced access to syscalls 93 94One trivial way to eliminate many syscalls for 64-bit systems is building 95without CONFIG_COMPAT. However, this is rarely a feasible scenario. 96 97The "seccomp" system provides an opt-in feature made available to 98userspace, which provides a way to reduce the number of kernel entry 99points available to a running process. This limits the breadth of kernel 100code that can be reached, possibly reducing the availability of a given 101bug to an attack. 102 103An area of improvement would be creating viable ways to keep access to 104things like compat, user namespaces, BPF creation, and perf limited only 105to trusted processes. This would keep the scope of kernel entry points 106restricted to the more regular set of normally available to unprivileged 107userspace. 108 109### Restricting access to kernel modules 110 111The kernel should never allow an unprivileged user the ability to 112load specific kernel modules, since that would provide a facility to 113unexpectedly extend the available attack surface. (The on-demand loading 114of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is 115considered "expected" here, though additional consideration should be 116given even to these.) For example, loading a filesystem module via an 117unprivileged socket API is nonsense: only the root or physically local 118user should trigger filesystem module loading. (And even this can be up 119for debate in some scenarios.) 120 121To protect against even privileged users, systems may need to either 122disable module loading entirely (e.g. monolithic kernel builds or 123modules_disabled sysctl), or provide signed modules (e.g. 124CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having 125root load arbitrary kernel code via the module loader interface. 126 127 128## Memory integrity 129 130There are many memory structures in the kernel that are regularly abused 131to gain execution control during an attack, By far the most commonly 132understood is that of the stack buffer overflow in which the return 133address stored on the stack is overwritten. Many other examples of this 134kind of attack exist, and protections exist to defend against them. 135 136### Stack buffer overflow 137 138The classic stack buffer overflow involves writing past the expected end 139of a variable stored on the stack, ultimately writing a controlled value 140to the stack frame's stored return address. The most widely used defense 141is the presence of a stack canary between the stack variables and the 142return address (CONFIG_CC_STACKPROTECTOR), which is verified just before 143the function returns. Other defenses include things like shadow stacks. 144 145### Stack depth overflow 146 147A less well understood attack is using a bug that triggers the 148kernel to consume stack memory with deep function calls or large stack 149allocations. With this attack it is possible to write beyond the end of 150the kernel's preallocated stack space and into sensitive structures. Two 151important changes need to be made for better protections: moving the 152sensitive thread_info structure elsewhere, and adding a faulting memory 153hole at the bottom of the stack to catch these overflows. 154 155### Heap memory integrity 156 157The structures used to track heap free lists can be sanity-checked during 158allocation and freeing to make sure they aren't being used to manipulate 159other memory areas. 160 161### Counter integrity 162 163Many places in the kernel use atomic counters to track object references 164or perform similar lifetime management. When these counters can be made 165to wrap (over or under) this traditionally exposes a use-after-free 166flaw. By trapping atomic wrapping, this class of bug vanishes. 167 168### Size calculation overflow detection 169 170Similar to counter overflow, integer overflows (usually size calculations) 171need to be detected at runtime to kill this class of bug, which 172traditionally leads to being able to write past the end of kernel buffers. 173 174 175## Statistical defenses 176 177While many protections can be considered deterministic (e.g. read-only 178memory cannot be written to), some protections provide only statistical 179defense, in that an attack must gather enough information about a 180running system to overcome the defense. While not perfect, these do 181provide meaningful defenses. 182 183### Canaries, blinding, and other secrets 184 185It should be noted that things like the stack canary discussed earlier 186are technically statistical defenses, since they rely on a secret value, 187and such values may become discoverable through an information exposure 188flaw. 189 190Blinding literal values for things like JITs, where the executable 191contents may be partially under the control of userspace, need a similar 192secret value. 193 194It is critical that the secret values used must be separate (e.g. 195different canary per stack) and high entropy (e.g. is the RNG actually 196working?) in order to maximize their success. 197 198### Kernel Address Space Layout Randomization (KASLR) 199 200Since the location of kernel memory is almost always instrumental in 201mounting a successful attack, making the location non-deterministic 202raises the difficulty of an exploit. (Note that this in turn makes 203the value of information exposures higher, since they may be used to 204discover desired memory locations.) 205 206#### Text and module base 207 208By relocating the physical and virtual base address of the kernel at 209boot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will be 210frustrated. Additionally, offsetting the module loading base address 211means that even systems that load the same set of modules in the same 212order every boot will not share a common base address with the rest of 213the kernel text. 214 215#### Stack base 216 217If the base address of the kernel stack is not the same between processes, 218or even not the same between syscalls, targets on or beyond the stack 219become more difficult to locate. 220 221#### Dynamic memory base 222 223Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up 224being relatively deterministic in layout due to the order of early-boot 225initializations. If the base address of these areas is not the same 226between boots, targeting them is frustrated, requiring an information 227exposure specific to the region. 228 229#### Structure layout 230 231By performing a per-build randomization of the layout of sensitive 232structures, attacks must either be tuned to known kernel builds or expose 233enough kernel memory to determine structure layouts before manipulating 234them. 235 236 237## Preventing Information Exposures 238 239Since the locations of sensitive structures are the primary target for 240attacks, it is important to defend against exposure of both kernel memory 241addresses and kernel memory contents (since they may contain kernel 242addresses or other sensitive things like canary values). 243 244### Unique identifiers 245 246Kernel memory addresses must never be used as identifiers exposed to 247userspace. Instead, use an atomic counter, an idr, or similar unique 248identifier. 249 250### Memory initialization 251 252Memory copied to userspace must always be fully initialized. If not 253explicitly memset(), this will require changes to the compiler to make 254sure structure holes are cleared. 255 256### Memory poisoning 257 258When releasing memory, it is best to poison the contents (clear stack on 259syscall return, wipe heap memory on a free), to avoid reuse attacks that 260rely on the old contents of memory. This frustrates many uninitialized 261variable attacks, stack content exposures, heap content exposures, and 262use-after-free attacks. 263 264### Destination tracking 265 266To help kill classes of bugs that result in kernel addresses being 267written to userspace, the destination of writes needs to be tracked. If 268the buffer is destined for userspace (e.g. seq_file backed /proc files), 269it should automatically censor sensitive values. 270