1# Introduction 2 3This directory contains SystemZ deflate hardware acceleration support. 4It can be enabled using the following build commands: 5 6 $ ./configure --with-dfltcc-deflate --with-dfltcc-inflate 7 $ make 8 9or 10 11 $ cmake -DWITH_DFLTCC_DEFLATE=1 -DWITH_DFLTCC_INFLATE=1 . 12 $ make 13 14When built like this, zlib-ng would compress using hardware on level 1, 15and using software on all other levels. Decompression will always happen 16in hardware. In order to enable hardware compression for levels 1-6 17(i.e. to make it used by default) one could add 18`-DDFLTCC_LEVEL_MASK=0x7e` to CFLAGS when building zlib-ng. 19 20SystemZ deflate hardware acceleration is available on [IBM z15]( 21https://www.ibm.com/products/z15) and newer machines under the name [ 22"Integrated Accelerator for zEnterprise Data Compression"]( 23https://www.ibm.com/support/z-content-solutions/compression/). The 24programming interface to it is a machine instruction called DEFLATE 25CONVERSION CALL (DFLTCC). It is documented in Chapter 26 of [Principles 26of Operation](http://publibfp.dhe.ibm.com/epubs/pdf/a227832c.pdf). Both 27the code and the rest of this document refer to this feature simply as 28"DFLTCC". 29 30# Performance 31 32Performance figures are published [here]( 33https://github.com/iii-i/zlib-ng/wiki/Performance-with-dfltcc-patch-applied-and-dfltcc-support-built-on-dfltcc-enabled-machine 34). The compression speed-up can be as high as 110x and the decompression 35speed-up can be as high as 15x. 36 37# Limitations 38 39Two DFLTCC compression calls with identical inputs are not guaranteed to 40produce identical outputs. Therefore care should be taken when using 41hardware compression when reproducible results are desired. In 42particular, zlib-ng-specific `zng_deflateSetParams` call allows setting 43`Z_DEFLATE_REPRODUCIBLE` parameter, which disables DFLTCC support for a 44particular stream. 45 46DFLTCC does not support every single zlib-ng feature, in particular: 47 48* `inflate(Z_BLOCK)` and `inflate(Z_TREES)` 49* `inflateMark()` 50* `inflatePrime()` 51 52When used, these functions will either switch to software, or, in case 53this is not possible, gracefully fail. 54 55# Code structure 56 57All SystemZ-specific code lives in `arch/s390` directory and is 58integrated with the rest of zlib-ng using hook macros. 59 60## Hook macros 61 62DFLTCC takes as arguments a parameter block, an input buffer, an output 63buffer and a window. `ZALLOC_STATE()`, `ZFREE_STATE()`, `ZCOPY_STATE()`, 64`ZALLOC_WINDOW()` and `TRY_FREE_WINDOW()` macros encapsulate allocation 65details for the parameter block (which is allocated alongside zlib-ng 66state) and the window (which must be page-aligned). 67 68While inflate software and hardware window formats match, this is not 69the case for deflate. Therefore, `deflateSetDictionary()` and 70`deflateGetDictionary()` need special handling, which is triggered using 71`DEFLATE_SET_DICTIONARY_HOOK()` and `DEFLATE_GET_DICTIONARY_HOOK()` 72macros. 73 74`deflateResetKeep()` and `inflateResetKeep()` update the DFLTCC 75parameter block using `DEFLATE_RESET_KEEP_HOOK()` and 76`INFLATE_RESET_KEEP_HOOK()` macros. 77 78`INFLATE_PRIME_HOOK()` and `INFLATE_MARK_HOOK()` macros make the 79unsupported `inflatePrime()` and `inflateMark()` calls fail gracefully. 80 81`DEFLATE_PARAMS_HOOK()` implements switching between hardware and 82software compression mid-stream using `deflateParams()`. Switching 83normally entails flushing the current block, which might not be possible 84in low memory situations. `deflateParams()` uses `DEFLATE_DONE()` hook 85in order to detect and gracefully handle such situations. 86 87The algorithm implemented in hardware has different compression ratio 88than the one implemented in software. `DEFLATE_BOUND_ADJUST_COMPLEN()` 89and `DEFLATE_NEED_CONSERVATIVE_BOUND()` macros make `deflateBound()` 90return the correct results for the hardware implementation. 91 92Actual compression and decompression are handled by `DEFLATE_HOOK()` and 93`INFLATE_TYPEDO_HOOK()` macros. Since inflation with DFLTCC manages the 94window on its own, calling `updatewindow()` is suppressed using 95`INFLATE_NEED_UPDATEWINDOW()` macro. 96 97In addition to compression, DFLTCC computes CRC-32 and Adler-32 98checksums, therefore, whenever it's used, software checksumming is 99suppressed using `DEFLATE_NEED_CHECKSUM()` and `INFLATE_NEED_CHECKSUM()` 100macros. 101 102While software always produces reproducible compression results, this 103is not the case for DFLTCC. Therefore, zlib-ng users are given the 104ability to specify whether or not reproducible compression results 105are required. While it is always possible to specify this setting 106before the compression begins, it is not always possible to do so in 107the middle of a deflate stream - the exact conditions for that are 108determined by `DEFLATE_CAN_SET_REPRODUCIBLE()` macro. 109 110## SystemZ-specific code 111 112When zlib-ng is built with DFLTCC, the hooks described above are 113converted to calls to functions, which are implemented in 114`arch/s390/dfltcc_*` files. The functions can be grouped in three broad 115categories: 116 117* Base DFLTCC support, e.g. wrapping the machine instruction - 118 `dfltcc()` and allocating aligned memory - `dfltcc_alloc_state()`. 119* Translating between software and hardware data formats, e.g. 120 `dfltcc_deflate_set_dictionary()`. 121* Translating between software and hardware state machines, e.g. 122 `dfltcc_deflate()` and `dfltcc_inflate()`. 123 124The functions from the first two categories are fairly simple, however, 125various quirks in both software and hardware state machines make the 126functions from the third category quite complicated. 127 128### `dfltcc_deflate()` function 129 130This function is called by `deflate()` and has the following 131responsibilities: 132 133* Checking whether DFLTCC can be used with the current stream. If this 134 is not the case, then it returns `0`, making `deflate()` use some 135 other function in order to compress in software. Otherwise it returns 136 `1`. 137* Block management and Huffman table generation. DFLTCC ends blocks only 138 when explicitly instructed to do so by the software. Furthermore, 139 whether to use fixed or dynamic Huffman tables must also be determined 140 by the software. Since looking at data in order to gather statistics 141 would negate performance benefits, the following approach is used: the 142 first `DFLTCC_FIRST_FHT_BLOCK_SIZE` bytes are placed into a fixed 143 block, and every next `DFLTCC_BLOCK_SIZE` bytes are placed into 144 dynamic blocks. 145* Writing EOBS. Block Closing Control bit in the parameter block 146 instructs DFLTCC to write EOBS, however, certain conditions need to be 147 met: input data length must be non-zero or Continuation Flag must be 148 set. To put this in simpler terms, DFLTCC will silently refuse to 149 write EOBS if this is the only thing that it is asked to do. Since the 150 code has to be able to emit EOBS in software anyway, in order to avoid 151 tricky corner cases Block Closing Control is never used. Whether to 152 write EOBS is instead controlled by `soft_bcc` variable. 153* Triggering block post-processing. Depending on flush mode, `deflate()` 154 must perform various additional actions when a block or a stream ends. 155 `dfltcc_deflate()` informs `deflate()` about this using 156 `block_state *result` parameter. 157* Converting software state fields into hardware parameter block fields, 158 and vice versa. For example, `wrap` and Check Value Type or `bi_valid` 159 and Sub-Byte Boundary. Certain fields cannot be translated and must 160 persist untouched in the parameter block between calls, for example, 161 Continuation Flag or Continuation State Buffer. 162* Handling flush modes and low-memory situations. These aspects are 163 quite intertwined and pervasive. The general idea here is that the 164 code must not do anything in software - whether explicitly by e.g. 165 calling `send_eobs()`, or implicitly - by returning to `deflate()` 166 with certain return and `*result` values, when Continuation Flag is 167 set. 168* Ending streams. When a new block is started and flush mode is 169 `Z_FINISH`, Block Header Final parameter block bit is used to mark 170 this block as final. However, sometimes an empty final block is 171 needed, and, unfortunately, just like with EOBS, DFLTCC will silently 172 refuse to do this. The general idea of DFLTCC implementation is to 173 rely as much as possible on the existing code. Here in order to do 174 this, the code pretends that it does not support DFLTCC, which makes 175 `deflate()` call a software compression function, which writes an 176 empty final block. Whether this is required is controlled by 177 `need_empty_block` variable. 178* Error handling. This is simply converting 179 Operation-Ending-Supplemental Code to string. Errors can only happen 180 due to things like memory corruption, and therefore they don't affect 181 the `deflate()` return code. 182 183### `dfltcc_inflate()` function 184 185This function is called by `inflate()` from the `TYPEDO` state (that is, 186when all the metadata is parsed and the stream is positioned at the type 187bits of deflate block header) and it's responsible for the following: 188 189* Falling back to software when flush mode is `Z_BLOCK` or `Z_TREES`. 190 Unfortunately, there is no way to ask DFLTCC to stop decompressing on 191 block or tree boundary. 192* `inflate()` decompression loop management. This is controlled using 193 the return value, which can be either `DFLTCC_INFLATE_BREAK` or 194 `DFLTCC_INFLATE_CONTINUE`. 195* Converting software state fields into hardware parameter block fields, 196 and vice versa. For example, `whave` and History Length or `wnext` and 197 History Offset. 198* Ending streams. This instructs `inflate()` to return `Z_STREAM_END` 199 and is controlled by `last` state field. 200* Error handling. Like deflate, error handling comprises 201 Operation-Ending-Supplemental Code to string conversion. Unlike 202 deflate, errors may happen due to bad inputs, therefore they are 203 propagated to `inflate()` by setting `mode` field to `MEM` or `BAD`. 204 205# Testing 206 207Given complexity of DFLTCC machine instruction, it is not clear whether 208QEMU TCG will ever support it. At the time of writing, one has to have 209access to an IBM z15+ VM or LPAR in order to test DFLTCC support. Since 210DFLTCC is a non-privileged instruction, neither special VM/LPAR 211configuration nor root are required. 212 213Still, zlib-ng CI has a few QEMU TCG-based configurations that check 214whether fallback to software is working. 215