1# 4.x series change log 2 3This page summarizes the major functional and performance changes in each 4release of the 4.x series. 5 6All performance data on this page is measured on an Intel Core i5-9600K 7clocked at 4.2 GHz, running `astcenc` using AVX2 and 6 threads. 8 9<!-- ---------------------------------------------------------------------- --> 10## 4.7.0 11 12**Status:** January 2024 13 14The 4.7.0 release is a major maintenance release, fixing rounding behavior in 15the decompressor to match the Khronos specification. This fix includes the 16addition of explicit support for optimizing for `decode_unorm8` rounding. 17 18Reminder - the codec library API is not designed to be binary compatible across 19versions. We always recommend rebuilding your client-side code using the updated 20`astcenc.h` header. 21 22* **General:** 23 * **Bug fix:** sRGB LDR decompression now uses the correct endpoint expansion 24 method to create the 16-bit RGB endpoint colors, and removes the previous 25 correction code from the interpolation function. This bug could result in 26 LSB bit flips relative to the standard specification. 27 * **Bug fix:** Decompressing to an 8-bit per component output image now matches 28 the `decode_unorm8` extension rounding rules. This bug could result in 29 LSB bit flips relative to the standard specification. 30 * **Bug fix:** Code now avoids using `alignas()` in the reference C 31 implementation, as the default `alignas(16)` is narrower than the 32 native minimum alignment requirement on some CPUs. 33 * **Feature:** Library configuration supports a new flag, 34 `ASTCENC_FLG_USE_DECODE_UNORM8`. This flag indicates that the image will be 35 used with the `decode_unorm8` decode mode. When set during compression 36 this allows the compressor to use the correct rounding when determining the 37 best encoding. 38 * **Feature:** Command line tool supports a new option, `-decode_unorm8`. 39 This option indicates that the image will be used with the `decode_unorm8` 40 decode mode. This option will automatically be set for decompression 41 (`-d*`) and trial (`-t*`) tool operation if the decompressed output image 42 is stored to an 8-bit per component file format. This option must be set 43 manually for compression (`-c*`) tool operation, as the desired decode mode 44 cannot be reliably determined. 45 * **Feature:** Library configuration supports a new optional progress 46 reporting callback to be specified. This is called during compression to 47 to allow interactive tooling use cases to display incremental progress. The 48 command line tool uses this feature to show compression progress unless 49 `-silent` is used. 50 51<!-- ---------------------------------------------------------------------- --> 52## 4.6.1 53 54**Status:** November 2023 55 56The 4.6.1 release is a minor maintenance release to fix a scaling bug on 57large core count Windows systems. 58 59* **General:** 60 * **Optimization:** Windows builds of the `astcenc` command line tool can now 61 use more than 64 cores on large core count systems. This change doubled 62 command line performance for `-exhaustive` compression when testing on an 63 96 core/192 thread system. 64 * **Feature:** Windows Arm64 native builds of the `astcenc` command line tool 65 are now included in the prebuilt release binaries. 66 67<!-- ---------------------------------------------------------------------- --> 68## 4.6.0 69 70**Status:** November 2023 71 72The 4.6.0 release retunes the compressor heuristics to give improvements to 73performance for trivial losses to image quality. It also includes some minor 74bug fixes and code quality improvements. 75 76Reminder - the codec library API is not designed to be binary compatible across 77versions. We always recommend rebuilding your client-side code using the updated 78`astcenc.h` header. 79 80* **General:** 81 * **Bug-fix:** Fixed context allocation for contexts allocated with the 82 `ASTCENC_FLG_DECOMPRESS_ONLY` flag. 83 * **Bug-fix:** Reduced use of `reinterpret_cast` in the core codec to 84 avoid strict aliasing violations. 85 * **Optimization:** `-medium` search quality no longer tests 4 partition 86 encodings for block sizes between 25 and 83 texels (inclusive). This 87 improves performance for a tiny drop in image quality. 88 * **Optimization:** `-thorough` and higher search qualities no longer test the 89 mode0 first search for block sizes between 25 and 83 texels (inclusive). 90 This improves performance for a tiny drop in image quality. 91 * **Optimization:** `TUNE_MAX_PARTITIONING_CANDIDATES` reduced from 32 to 8 92 to reduce the size of stack allocated data structures. This causes a tiny 93 drop in image quality for the `-verythorough` and `-exhaustive` presets. 94 95<!-- ---------------------------------------------------------------------- --> 96## 4.5.0 97 98**Status:** June 2023 99 100The 4.5.0 release is a maintenance release with small image quality 101improvements, and a number of build system quality of life improvements. 102 103* **General:** 104 * **Bug-fix:** Improved handling compiler arguments in CMake, including 105 consistent use of MSVC-style command line arguments for ClangCL. 106 * **Bug-fix:** Invariant Clang builds now use `-ffp-model=precise` with 107 `-ffp-contract=off` which is needed to restore invariance due to recent 108 changes in compiler defaults. 109 * **Change:** macOS binary releases are now distributed as a single universal 110 binary for all platforms. 111 * **Change:** Windows binary releases are now compiled with VS2022. 112 * **Change:** Invariant MSVC builds for VS2022 now use `/fp:precise` instead 113 of `/fp:strict`, which is is now possible because precise no longer implies 114 contraction. This should improve performance for MSVC builds. 115 * **Change:** Non-invariant Clang builds now use `-ffp-model=precise` with 116 `-ffp-contract=on`. This should improve performance on older Clang 117 versions which defaulted to no contraction. 118 * **Change:** Non-invariant MSVC builds for VS2022 now use `/fp:precise` 119 with `/fp:contract`. This should improve performance for MSVC builds. 120 * **Change:** CMake config variables now use an `ASTCENC_` prefix to add a 121 namespace and group options when the library is used in a larger project. 122 * **Change:** CMake config `ASTCENC_UNIVERSAL_BUILD` for building macOS 123 universal binaries has been improved to include the `x86_64h` slice for 124 AVX2 builds. Universal builds are now on by default for macOS, and always 125 include NEON (arm64), SSE4.1 (x86_64), and AVX2 (x86_64h) variants. 126 * **Change:** CMake config `ASTCENC_NO_INVARIANCE` has been inverted to 127 remove the negated option, and is now `ASTCENC_INVARIANCE` with a default 128 of `ON`. Disabling this option can substantially improve performance, but 129 images can different across platforms and compilers. 130 * **Optimization:** Color quantization and packing for LDR RGB and RGBA has 131 been vectorized to improve performance. 132 * **Change:** Color quantization for LDR RGB and RGBA endpoints will now try 133 multiple quantization packing methods, and pick the one with the lowest 134 endpoint encoding error. This gives a minor image quality improvement, for 135 no significant performance impact when combined with the vectorization 136 optimizations. 137 138<!-- ---------------------------------------------------------------------- --> 139## 4.4.0 140 141**Status:** March 2023 142 143The 4.4.0 release is a minor release with image quality improvements, a small 144performance boost, and a few new quality-of-life features. 145 146* **General:** 147 * **Change:** Core library no longer checks availability of required 148 instruction set extensions, such as SSE4.1 or AVX2. Checking compatibility 149 is now the responsibility of the caller. See `astcenccli_entry.cpp` for 150 an example of code performing this check. 151 * **Change:** Core library can be built as a shared object by setting the 152 `-DSHAREDLIB=ON` CMake option, resulting in e.g. `libastcenc-avx2-shared.so`. 153 Note that the command line tool is always statically linked. 154 * **Change:** Decompressed 3D images will now write one output file per 155 slice, if the target format is a 2D image format. 156 * **Change:** Command line errors print to stderr instead of stdout. 157 * **Change:** Color encoding uses new quantization tables, that now factor 158 in floating-point rounding if a distance tie is found when using the 159 integer quant256 value. This improves image quality for 4x4 and 5x5 block 160 sizes. 161 * **Optimization:** Partition selection uses a simplified line calculation 162 with a faster approximation. This improves performance for all block sizes. 163 * **Bug-fix:** Fixed missing symbol error in decompressor-only builds. 164 * **Bug-fix:** Fixed infinity handling in debug trace JSON files. 165 166### Performance: 167 168Key for charts: 169 170* Color = block size (see legend). 171* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR). 172 173**Relative performance vs 4.3 release:** 174 175 176 177<!-- ---------------------------------------------------------------------- --> 178## 4.3.1 179 180**Status:** January 2023 181 182The 4.3.1 release is a minor maintenance release. No performance or image 183quality changes are expected. 184 185* **General:** 186 * **Bug-fix:** Fixed typo in `-2/3/4partitioncandidatelimit` CLI options. 187 * **Bug-fix:** Fixed handling for `-3/4partitionindexlimit` CLI options. 188 * **Bug-fix:** Updated to `stb_image.h` v2.28, which includes multiple fixes 189 and improvements for image loading. 190 191<!-- ---------------------------------------------------------------------- --> 192## 4.3.0 193 194**Status:** January 2023 195 196The 4.3.0 release is an optimization release. There are minor performance 197and image quality improvements in this release. 198 199Reminder - the codec library API is not designed to be binary compatible across 200versions. We always recommend rebuilding your client-side code using the updated 201`astcenc.h` header. 202 203* **General:** 204 * **Bug-fix:** Use lower case `windows.h` include for MinGW compatibility. 205 * **Change:** The `-mask` command line option, `ASTCENC_FLG_MAP_MASK` in the 206 library API, has been removed. 207 * **Optimization:** Always skip blue-contraction for `QUANT_256` encodings. 208 This gives a small image quality improvement for the 4x4 block size. 209 * **Optimization:** Always skip RGBO vector calculation for LDR encodings. 210 * **Optimization:** Defer color packing and scrambling to physical layer. 211 * **Optimization:** Remove folded `decimation_info` lookup tables. This 212 significantly reduces compressor memory footprint and improves context 213 creation time. Impact increases with the active block size. 214 * **Optimization:** Increased trial and refinement pruning by using stricter 215 target errors when determining whether to skip iterations. 216 217### Performance: 218 219Key for charts: 220 221* Color = block size (see legend). 222* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR). 223 224**Relative performance vs 4.2 release:** 225 226 227 228 229<!-- ---------------------------------------------------------------------- --> 230## 4.2.0 231 232**Status:** November 2022 233 234The 4.2.0 release is an optimization release. There are significant performance 235improvements, minor image quality improvements, and library interface changes in 236this release. 237 238Reminder - the codec library API is not designed to be binary compatible across 239versions. We always recommend rebuilding your client-side code using the updated 240`astcenc.h` header. 241 242* **General:** 243 * **Bug-fix:** Compression for RGB and RGBA base+offset encodings no 244 longer generate endpoints with the incorrect blue-contract behavior. 245 * **Bug-fix:** Lowest channel correlation calculation now correctly ignores 246 constant color channels for the purposes of filtering 2 plane encodings. 247 On average this improves both performance and image quality. 248 * **Bug-fix:** ISA compatibility now checked in `config_init()` as well as 249 in `context_alloc()`. 250 * **Change:** Removed the low-weight count optimization, as more recent 251 changes had significantly reduced its performance benefit. Option removed 252 from both command line and configuration structure. 253 * **Feature:** The `-exhaustive` mode now runs full trials on more 254 partitioning candidates and block candidates. This improves image quality 255 by 0.1 to 0.25 dB, but slows down compression by 3x. The `-verythorough` 256 and `-thorough` modes also test more candidates. 257 * **Feature:** A new preset, `-verythorough`, has been introduced to provide 258 a standard performance point between `-thorough` and the re-tuned 259 `-exhaustive` mode. This new mode is faster and higher quality than the 260 `-exhaustive` preset in the 4.1 release. 261 * **Feature:** The compressor can now independently vary the number of 262 partitionings considered for error estimation for 2/3/4 partitions. This 263 allows heuristics to put more effort into 2 partitions, and less in to 264 3/4 partitions. 265 * **Feature:** The compressor can now run trials on a variable number of 266 candidate partitionings, allowing high quality modes to explore more of the 267 search space at the expense of slower compression. The number of trials is 268 independently configurable for 2/3/4 partition cases. 269 * **Optimization:** Introduce early-out threshold for 2/3/4 partition 270 searches based on the results after 1 of 2 trials. This significantly 271 improves performance for `-medium` and `-thorough` searches, for a minor 272 loss in image quality. 273 * **Optimization:** Reduce early-out threshold for 3/4 partition searches 274 based on 2/3 partition results. This significantly improves performance, 275 especially for `-thorough` searches, for a minor loss in image quality. 276 * **Optimization:** Use direct vector compare to create a SIMD mask instead 277 of a scalar compare that is broadcast to a vector mask. 278 * **Optimization:** Remove obsolete partition validity masks from the 279 partition selection algorithm. 280 * **Optimization:** Removed obsolete channel scaling from partition 281 `avgs_and_dirs()` calculation. 282 283### Performance: 284 285Key for charts: 286 287* Color = block size (see legend). 288* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR). 289 290**Relative performance vs 4.0 and 4.1 release:** 291 292 293 294 295<!-- ---------------------------------------------------------------------- --> 296## 4.1.0 297 298**Status:** August 2022 299 300The 4.1.0 release is a maintenance release. There is no performance or image 301quality change in this release. 302 303* **General:** 304 * **Change:** Command line decompressor no longer uses the legacy 305 `GL_LUMINANCE` or `GL_LUMINANCE_ALPHA` format enums when writing KTX 306 output files. Luminance textures now use the `GL_RED` format and 307 luminance_alpha textures now use the `GL_RG` format. 308 * **Change:** Command line tool gains a new `-dimage` option to generate 309 diagnostic images showing aspects of the compression encoding. The output 310 file name with its extension stripped is used as the stem of the diagnostic 311 image file names. 312 * **Bug-fix:** Library decompressor builds for SSE no longer use masked store 313 `maskmovdqu` instructions, as they can generate faults on masked lanes. 314 * **Bug-fix:** Command line decompressor now correctly uses sized type enums 315 for the internal format when writing output KTX files. 316 * **Bug-fix:** Command line compressor now correctly loads 16 and 32-bit per 317 component input KTX files. 318 * **Bug-fix:** Fixed GCC9 compiler warnings on Arm aarch64. 319 320<!-- ---------------------------------------------------------------------- --> 321## 4.0.0 322 323**Status:** July 2022 324 325The 4.0.0 release introduces some major performance enhancement, and a number 326of larger changes to the heuristics used in the codec to find a more effective 327cost:quality trade off. 328 329* **General:** 330 * **Change:** The `-array` option for specifying the number of image planes 331 for ASTC 3D volumetric block compression been renamed to `-zdim`. 332 * **Change:** The build root package directory is now `bin` instead of 333 `astcenc`, allowing the CMake install step to write binaries into 334 `/usr/local/bin` if the user wishes to do so. 335 * **Feature:** A new `-ssw` option for specifying the shader sampling swizzle 336 has been added as convenience alternative to the `-cw` option. This is 337 needed to correct error weighting during compression if not all components 338 are read in the shader. For example, to extract and compress two components 339 from an RGBA input image, weighting the two components equally when 340 sampling through .ra in the shader, use `-esw ggga -ssw ra`. In this 341 example `-ssw ra` is equivalent to the alternative `-cw 1 0 0 1` encoding. 342 * **Feature:** The `-a` alpha weighting option has been re-enabled in the 343 backend, and now again applies alpha scaling to the RGB error metrics when 344 encoding. This is based on the maximum alpha in each block, not the 345 individual texel alpha values used in the earlier implementation. 346 * **Feature:** The command line tool now has `-repeats <count>` for testing, 347 which will iterate around compression and decompression `count` times. 348 Reported performance metrics also now separate compression and 349 decompression scores. 350 * **Feature:** The core codec is now warning clean up to /W4 for both MSVC 351 `cl.exe` and `clangcl.exe` compilers. 352 * **Feature:** The core codec now supports arm64 for both MSVC `cl.exe` and 353 `clangcl.exe` compilers. 354 * **Feature:** `NO_INVARIANCE` builds will enable the `-ffp-contract=fast` 355 option for all targets when using Clang or GCC. In addition AVX2 targets 356 will also set the `-mfma` option. This reduces image quality by up to 0.2dB 357 (normally much less), but improves performance by up to 5-20%. 358 * **Optimization:** Angular endpoint min/max weight selection is restricted 359 to weight `QUANT_11` or lower. Higher quantization levels assume default 360 0-1 range, which is less accurate but much faster. 361 * **Optimization:** Maximum weight quantization for later trials is selected 362 based on the weight quantization of the best encoding from the 1 plane 1 363 partition trial. This significantly reduces the search space for the later 364 trials with more planes or partitions. 365 * **Optimization:** Small data tables now use in-register SIMD permutes 366 rather than gathers (AVX2) or unrolled scalar lookups (SSE/NEON). This can 367 be a significant optimization for paths that are load unit limited. 368 * **Optimization:** Decompressed image block writes in the decompressor now 369 use a vectorized approach to writing each row of texels in the block, 370 including to ability to exploit masked stores if the target supports them. 371 * **Optimization:** Weight scrambling has been moved into the physical layer; 372 the rest of the codec now uses linear order weights. 373 * **Optimization:** Weight packing has been moved into the physical layer; 374 the rest of the codec now uses unpacked weights in the 0-64 range. 375 * **Optimization:** Consistently vectorize the creation of unquantized weight 376 grids when they are needed. 377 * **Optimization:** Remove redundant per-decimation mode copies of endpoint 378 and weight structures, which were really read-only duplicates. 379 * **Optimization:** Early-out the same endpoint mode color calculation if it 380 cannot be applied. 381 * **Optimization:** Numerous type size reductions applied to arrays to reduce 382 both context working buffer size usage and stack usage. 383 384### Performance: 385 386Key for charts: 387 388* Color = block size (see legend). 389* Letter = image format (N = normal map, G = grayscale, L = LDR, H = HDR). 390 391**Relative performance vs 3.7 release:** 392 393 394 395 396- - - 397 398_Copyright © 2022-2024, Arm Limited and contributors. All rights reserved._ 399