vdo-design.rst - OpenGrok cross reference for /Documentation/admin-guide/device-mapper/vdo-design.rst

Lines Matching +full:i +full:- +full:cache +full:- +full:block +full:- +full:size
1 .. SPDX-License-Identifier: GPL-2.0-only
4 Design of dm-vdo
7 The dm-vdo (virtual data optimizer) target provides inline deduplication,
8 compression, zero-block elimination, and thin provisioning. A dm-vdo target
9 can be backed by up to 256TB of storage, and can present a logical size of
12 production environments ever since. It was made open-source in 2017 after
14 dm-vdo. For usage, see vdo.rst in the same directory as this file.
16 Because deduplication rates fall drastically as the block size increases, a
17 vdo target has a maximum block size of 4K. However, it can achieve
18 deduplication rates of 254:1, i.e. up to 254 copies of a given 4K block can
25 The design of dm-vdo is based on the idea that deduplication is a two-part
27 storing multiple copies of those duplicates. Therefore, dm-vdo has two main
29 duplicate data, and a data store with a reference counted block map that
30 maps from logical block addresses to the actual storage location of the
34 -------------------
39 block sizes in order to achieve good deduplication rates, acceptable
41 design attempts to be lock-free.
59 reflected in the on-disk representation of each data structure. Therefore,
64 -----------------------
79 trade-off between the storage saved and the resources expended to achieve
80 those savings, vdo does not attempt to find every last duplicate block. It
83 Each block of data is hashed to produce a 16-byte block name. An index
84 record consists of this block name paired with the presumed location of
87 because it is too costly to update the index when a block is over-written
88 or discarded. Doing so would require either storing the block name along
89 with the blocks, which is difficult to do efficiently in block-based
90 storage, or reading and rehashing each block before overwriting it.
95 index as hints, and reads each indicated block to verify that it is indeed
96 a duplicate before sharing the existing block with a new one.
108 locality should be on a small number of pages, reducing the I/O required to
113 only a subset of the block name as its key, so it cannot guarantee that an
114 index entry refers to the desired block name. It can only guarantee that if
116 chapters are read-only structures and their contents are never altered in
122 into the open chapter. This ensures that useful block names remain available
123 in the index, while unreferenced block names are forgotten over time.
127 mapping each block name to the chapter containing its newest record. This
128 mapping is updated as records for the block name are copied or updated,
129 ensuring that only the newest record for a given block name can be found.
130 An older record for a block name will no longer be found even though it has
132 uses only a subset of the block name as its key and can not definitively
137 From the viewpoint of a request for a particular block name, it will first
146 many block name requests.
149 memory-efficient structure called a delta index. Instead of storing the
150 entire block name (the key) for each entry, the entries are sorted by name
152 Because we expect the hashes to be randomly distributed, the size of the
160 splitting its key space into many sub-lists, each starting at a fixed key
163 The default index size can hold 64 million records, corresponding to about
182 One way is to increase the memory size of the index, which also increases
183 the amount of backing storage required. Doubling the size of the index will
185 the storage size and the memory requirements.
189 increasing the storage size by a factor of 10. However with sparse
190 indexing, the memory requirements do not increase. The trade-off is
193 duplicate data, sparse indexing will detect 97-99% of the deduplication
197 -------------------------------
199 A vio (short for Vdo I/O) is conceptually similar to a bio, with additional
200 fields and data to track vdo-specific information. A struct vio maintains a
217 benchmarks have indicated that increasing the size of the pool does not
221 --------------
233 block map (see below). In addition to the data blocks, each slab has a set
234 of reference counters, using 1 byte for each data block. Finally each slab
243 memory and are written out, a block at a time in oldest-dirtied-order, only
246 particular I/O operations.
249 zones" in round-robin fashion. If there are P physical zones, then slab n
264 *The Block Map*
266 The block map contains the logical to physical mapping. It can be thought
268 36 bits of which contain the physical block number which holds the data for
271 which is unmapped (i.e. it has never been written, or has been discarded),
272 one represents an uncompressed block, and the other 14 states are used to
274 slots in the compressed block contains the data for this logical address.
276 In practice, the array of mapping entries is divided into "block map
277 pages," each of which fits in a single 4K block. Each block map page
279 a leaf of a radix tree which consists of block map pages at each level.
283 0-811 belong to tree 0, logical addresses 812-1623 belong to tree 1, and so
287 tree roots is allocated at format time. All other block map pages are
289 need to pre-allocate space for the entire set of logical mappings and also
290 makes growing the logical size of a vdo relatively easy.
292 In operation, the block map maintains two caches. It is prohibitive to keep
294 maintains its own cache of leaf pages. The size of this cache is
295 configurable at target start time. The second cache is allocated at start
296 time, and is large enough to hold all the non-leaf pages of the entire
297 block map. This cache is populated as pages are needed.
301 The recovery journal is used to amortize updates across the block map and
303 Entries are either "data remappings" or "block map remappings." For a data
305 new physical mappings. For a block map remapping, the journal records the
306 block map page number and the physical block allocated for it. Block map
311 before each journal block write to ensure that the physical data for the
312 new block mappings in that block are stable on storage, and journal block
322 All write I/O to vdo is asynchronous. Each bio will be acknowledged as soon
324 eventually. Generally, the data for acknowledged but unflushed write I/O
327 data with the FUA bit set like any other asynchronous I/O. Shutting down
328 the vdo target will also flush any remaining I/O.
334     will block until a data_vio is available. This provides back pressure
343     already read the underlying block and the data is instead copied over
344     the relevant portion of the larger block.
366 3.  The data_vio traverses the block map tree to ensure that all the
372     a. If any page-node in the tree has not yet been allocated, it must be
374        data_vio to lock the page-node that needs to be allocated. This
375        lock, like the logical block lock in step 2, is a hashtable entry
383        the tree using a similar process to adding a new data block mapping.
384        The data_vio journals the intent to add the new node to the block
385        map tree (step 10), updates the reference count of the new block
391     b. In the steady-state case, the block map tree nodes will already be
393        the required leaf node. The location of the mapping (the "block map
398 4.  If the block is a zero block, skip to step 9. Otherwise, an attempt is
399     made to allocate a free data block. This allocation ensures that the
405     block or decides there are none. If the first zone has no free space,
408     free block or runs out of zones to search. The data_vio will acquire a
409     struct pbn_lock (the "physical block lock") on the free block. The
412     added to a hashtable like the logical block locks in step 2. This
414     reference count of the free block is updated to prevent any other
416     sub-component of the slab and are thus also covered by the implicit
435     tracked in a hashtable similar to the way logical block locks are
464        obtain a physical block lock on the indicated physical address, in
471     c. If the data matches and the physical block can add references, the
473        physical block as their new physical address and proceed to step 9
479     d. If no usable duplicate block was found, the agent first checks that
480        it has an allocated physical block (from step 3) that it can write
483        none of the data_vios have an allocated physical block, these writes
490        If the compressed size is small enough, the agent will release the
497        data block. Compression is only helpful if vdo can pack at least 2
498        data_vios into a single data block. This means that a data_vio may
500        to fill out the compressed block. There is a mechanism for vdo to
504        the same logical block address. A data_vio may also be evicted from
505        the packer if it cannot be paired with any other compressed block
511        using the allocated physical block from one of its data_vios. Step
514     g. Each data_vio sets the compressed block as its new physical address.
516        acquires the struct pbn_lock for the compressed block, which is
518        zone lock and proceeds to step 8i.
521        step 3. It will write its data to that allocated physical block.
523     i. After the data is written, if the data_vio is the agent of a hash
534     There is a cache for block map leaf pages (the "block map cache"),
535     because there are usually too many block map leaf nodes to store
536     entirely in memory. If the desired leaf page is not in the cache, the
537     data_vio will reserve a slot in the cache and load the desired page
541     on the block map cache structures, covered by the implicit logical zone
545     logical block address, the old physical mapping, and the new physical
565     logical-to-physical mapping in the block map to point to the new
566     physical block. At this point the write operation is complete.
572     the struct pbn_lock it holds for its allocated block. If it had an
574     that block back to zero to free it for use by subsequent data_vios.
577     the logical block lock acquired in step 2.
589 logical-to-physical mapping by traversing the block map tree as in step 3,
591 physical block address. A read data_vio will not allocate block map tree
592 nodes if they are missing. If the interior block map nodes do not exist
593 yet, the logical block map address must still be unmapped and the read
602 a read-modify-write operation that reads the relevant 4K block, copies the
603 new data over the approriate sectors of the block, and then launches a
604 write operation for the modified data block. The read and write stages of
611 recovery journal. During the pre-resume phase of the next start, the
613 into the block map. Next, valid entries are played, in order as required,
620 *Read-only Rebuild*
622 If a vdo encounters an unrecoverable error, it will enter read-only mode.
626 to the possibility that data has been lost. During a read-only rebuild, the
627 block map is recovered from the recovery journal as before. However, the
629 reference counts are zeroed, the entire block map is traversed, and the
630 reference counts are updated from the block mappings. While this may lose
631 some data, it ensures that the block map and reference counts are