1.. SPDX-License-Identifier: GPL-2.0 2 3.. _physical_memory_model: 4 5===================== 6Physical Memory Model 7===================== 8 9Physical memory in a system may be addressed in different ways. The 10simplest case is when the physical memory starts at address 0 and 11spans a contiguous range up to the maximal address. It could be, 12however, that this range contains small holes that are not accessible 13for the CPU. Then there could be several contiguous ranges at 14completely distinct addresses. And, don't forget about NUMA, where 15different memory banks are attached to different CPUs. 16 17Linux abstracts this diversity using one of the three memory models: 18FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what 19memory models it supports, what the default memory model is and 20whether it is possible to manually override that default. 21 22.. note:: 23 At time of this writing, DISCONTIGMEM is considered deprecated, 24 although it is still in use by several architectures. 25 26All the memory models track the status of physical page frames using 27:c:type:`struct page` arranged in one or more arrays. 28 29Regardless of the selected memory model, there exists one-to-one 30mapping between the physical page frame number (PFN) and the 31corresponding `struct page`. 32 33Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn` 34helpers that allow the conversion from PFN to `struct page` and vice 35versa. 36 37FLATMEM 38======= 39 40The simplest memory model is FLATMEM. This model is suitable for 41non-NUMA systems with contiguous, or mostly contiguous, physical 42memory. 43 44In the FLATMEM memory model, there is a global `mem_map` array that 45maps the entire physical memory. For most architectures, the holes 46have entries in the `mem_map` array. The `struct page` objects 47corresponding to the holes are never fully initialized. 48 49To allocate the `mem_map` array, architecture specific setup code 50should call :c:func:`free_area_init_node` function or its convenience 51wrapper :c:func:`free_area_init`. Yet, the mappings array is not 52usable until the call to :c:func:`memblock_free_all` that hands all 53the memory to the page allocator. 54 55An architecture may free parts of the `mem_map` array that do not cover the 56actual physical pages. In such case, the architecture specific 57:c:func:`pfn_valid` implementation should take the holes in the 58`mem_map` into account. 59 60With FLATMEM, the conversion between a PFN and the `struct page` is 61straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the 62`mem_map` array. 63 64The `ARCH_PFN_OFFSET` defines the first page frame number for 65systems with physical memory starting at address different from 0. 66 67DISCONTIGMEM 68============ 69 70The DISCONTIGMEM model treats the physical memory as a collection of 71`nodes` similarly to how Linux NUMA support does. For each node Linux 72constructs an independent memory management subsystem represented by 73`struct pglist_data` (or `pg_data_t` for short). Among other 74things, `pg_data_t` holds the `node_mem_map` array that maps 75physical pages belonging to that node. The `node_start_pfn` field of 76`pg_data_t` is the number of the first page frame belonging to that 77node. 78 79The architecture setup code should call :c:func:`free_area_init_node` for 80each node in the system to initialize the `pg_data_t` object and its 81`node_mem_map`. 82 83Every `node_mem_map` behaves exactly as FLATMEM's `mem_map` - 84every physical page frame in a node has a `struct page` entry in the 85`node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the 86`flags` field of the `struct page` encodes the node number of the 87node hosting that page. 88 89The conversion between a PFN and the `struct page` in the 90DISCONTIGMEM model became slightly more complex as it has to determine 91which node hosts the physical page and which `pg_data_t` object 92holds the `struct page`. 93 94Architectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid` 95to convert PFN to the node number. The opposite conversion helper 96:c:func:`page_to_nid` is generic as it uses the node number encoded in 97page->flags. 98 99Once the node number is known, the PFN can be used to index 100appropriate `node_mem_map` array to access the `struct page` and 101the offset of the `struct page` from the `node_mem_map` plus 102`node_start_pfn` is the PFN of that page. 103 104SPARSEMEM 105========= 106 107SPARSEMEM is the most versatile memory model available in Linux and it 108is the only memory model that supports several advanced features such 109as hot-plug and hot-remove of the physical memory, alternative memory 110maps for non-volatile memory devices and deferred initialization of 111the memory map for larger systems. 112 113The SPARSEMEM model presents the physical memory as a collection of 114sections. A section is represented with :c:type:`struct mem_section` 115that contains `section_mem_map` that is, logically, a pointer to an 116array of struct pages. However, it is stored with some other magic 117that aids the sections management. The section size and maximal number 118of section is specified using `SECTION_SIZE_BITS` and 119`MAX_PHYSMEM_BITS` constants defined by each architecture that 120supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a 121physical address that an architecture supports, the 122`SECTION_SIZE_BITS` is an arbitrary value. 123 124The maximal number of sections is denoted `NR_MEM_SECTIONS` and 125defined as 126 127.. math:: 128 129 NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)} 130 131The `mem_section` objects are arranged in a two-dimensional array 132called `mem_sections`. The size and placement of this array depend 133on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of 134sections: 135 136* When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections` 137 array is static and has `NR_MEM_SECTIONS` rows. Each row holds a 138 single `mem_section` object. 139* When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections` 140 array is dynamically allocated. Each row contains PAGE_SIZE worth of 141 `mem_section` objects and the number of rows is calculated to fit 142 all the memory sections. 143 144The architecture setup code should call :c:func:`memory_present` for 145each active memory range or use :c:func:`memblocks_present` or 146:c:func:`sparse_memory_present_with_active_regions` wrappers to 147initialize the memory sections. Next, the actual memory maps should be 148set up using :c:func:`sparse_init`. 149 150With SPARSEMEM there are two possible ways to convert a PFN to the 151corresponding `struct page` - a "classic sparse" and "sparse 152vmemmap". The selection is made at build time and it is determined by 153the value of `CONFIG_SPARSEMEM_VMEMMAP`. 154 155The classic sparse encodes the section number of a page in page->flags 156and uses high bits of a PFN to access the section that maps that page 157frame. Inside a section, the PFN is the index to the array of pages. 158 159The sparse vmemmap uses a virtually mapped memory map to optimize 160pfn_to_page and page_to_pfn operations. There is a global `struct 161page *vmemmap` pointer that points to a virtually contiguous array of 162`struct page` objects. A PFN is an index to that array and the the 163offset of the `struct page` from `vmemmap` is the PFN of that 164page. 165 166To use vmemmap, an architecture has to reserve a range of virtual 167addresses that will map the physical pages containing the memory 168map and make sure that `vmemmap` points to that range. In addition, 169the architecture should implement :c:func:`vmemmap_populate` method 170that will allocate the physical memory and create page tables for the 171virtual memory map. If an architecture does not have any special 172requirements for the vmemmap mappings, it can use default 173:c:func:`vmemmap_populate_basepages` provided by the generic memory 174management. 175 176The virtually mapped memory map allows storing `struct page` objects 177for persistent memory devices in pre-allocated storage on those 178devices. This storage is represented with :c:type:`struct vmem_altmap` 179that is eventually passed to vmemmap_populate() through a long chain 180of function calls. The vmemmap_populate() implementation may use the 181`vmem_altmap` along with :c:func:`altmap_alloc_block_buf` helper to 182allocate memory map on the persistent memory device. 183 184ZONE_DEVICE 185=========== 186The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer 187`struct page` `mem_map` services for device driver identified physical 188address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact 189that the page objects for these address ranges are never marked online, 190and that a reference must be taken against the device, not just the page 191to keep the memory pinned for active use. `ZONE_DEVICE`, via 192:c:func:`devm_memremap_pages`, performs just enough memory hotplug to 193turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and 194:c:func:`get_user_pages` service for the given range of pfns. Since the 195page reference count never drops below 1 the page is never tracked as 196free memory and the page's `struct list_head lru` space is repurposed 197for back referencing to the host device / driver that mapped the memory. 198 199While `SPARSEMEM` presents memory as a collection of sections, 200optionally collected into memory blocks, `ZONE_DEVICE` users have a need 201for smaller granularity of populating the `mem_map`. Given that 202`ZONE_DEVICE` memory is never marked online it is subsequently never 203subject to its memory ranges being exposed through the sysfs memory 204hotplug api on memory block boundaries. The implementation relies on 205this lack of user-api constraint to allow sub-section sized memory 206ranges to be specified to :c:func:`arch_add_memory`, the top-half of 207memory hotplug. Sub-section support allows for 2MB as the cross-arch 208common alignment granularity for :c:func:`devm_memremap_pages`. 209 210The users of `ZONE_DEVICE` are: 211 212* pmem: Map platform persistent memory to be used as a direct-I/O target 213 via DAX mappings. 214 215* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()` 216 event callbacks to allow a device-driver to coordinate memory management 217 events related to device-memory, typically GPU memory. See 218 Documentation/vm/hmm.rst. 219 220* p2pdma: Create `struct page` objects to allow peer devices in a 221 PCI/-E topology to coordinate direct-DMA operations between themselves, 222 i.e. bypass host memory. 223