1The basis transforms used for FFT and various other derived functions are based 2on the following unrollings. 3The functions can be easily adapted to double precision floats as well. 4 5# Parity permutation 6The basis transforms described here all use the following permutation: 7 8``` C 9void ff_tx_gen_split_radix_parity_revtab(int *revtab, int len, int inv, 10 int basis, int dual_stride); 11``` 12Parity means even and odd complex numbers will be split, e.g. the even 13coefficients will come first, after which the odd coefficients will be 14placed. For example, a 4-point transform's coefficients after reordering: 15`z[0].re, z[0].im, z[2].re, z[2].im, z[1].re, z[1].im, z[3].re, z[3].im` 16 17The basis argument is the length of the largest non-composite transform 18supported, and also implies that the basis/2 transform is supported as well, 19as the split-radix algorithm requires it to be. 20 21The dual_stride argument indicates that both the basis, as well as the 22basis/2 transforms support doing two transforms at once, and the coefficients 23will be interleaved between each pair in a split-radix like so (stride == 2): 24`tx1[0], tx1[2], tx2[0], tx2[2], tx1[1], tx1[3], tx2[1], tx2[3]` 25A non-zero number switches this on, with the value indicating the stride 26(how many values of 1 transform to put first before switching to the other). 27Must be a power of two or 0. Must be less than the basis. 28Value will be clipped to the transform size, so for a basis of 16 and a 29dual_stride of 8, dual 8-point transforms will be laid out as if dual_stride 30was set to 4. 31Usually you'll set this to half the complex numbers that fit in a single 32register or 0. This allows to reuse SSE functions as dual-transform 33functions in AVX mode. 34If length is smaller than basis/2 this function will not do anything. 35 36# 4-point FFT transform 37The only permutation this transform needs is to swap the `z[1]` and `z[2]` 38elements when performing an inverse transform, which in the assembly code is 39hardcoded with the function itself being templated and duplicated for each 40direction. 41 42``` C 43static void fft4(FFTComplex *z) 44{ 45 FFTSample r1 = z[0].re - z[2].re; 46 FFTSample r2 = z[0].im - z[2].im; 47 FFTSample r3 = z[1].re - z[3].re; 48 FFTSample r4 = z[1].im - z[3].im; 49 /* r5-r8 second transform */ 50 51 FFTSample t1 = z[0].re + z[2].re; 52 FFTSample t2 = z[0].im + z[2].im; 53 FFTSample t3 = z[1].re + z[3].re; 54 FFTSample t4 = z[1].im + z[3].im; 55 /* t5-t8 second transform */ 56 57 /* 1sub + 1add = 2 instructions */ 58 59 /* 2 shufs */ 60 FFTSample a3 = t1 - t3; 61 FFTSample a4 = t2 - t4; 62 FFTSample b3 = r1 - r4; 63 FFTSample b2 = r2 - r3; 64 65 FFTSample a1 = t1 + t3; 66 FFTSample a2 = t2 + t4; 67 FFTSample b1 = r1 + r4; 68 FFTSample b4 = r2 + r3; 69 /* 1 add 1 sub 3 shufs */ 70 71 z[0].re = a1; 72 z[0].im = a2; 73 z[2].re = a3; 74 z[2].im = a4; 75 76 z[1].re = b1; 77 z[1].im = b2; 78 z[3].re = b3; 79 z[3].im = b4; 80} 81``` 82 83# 8-point AVX FFT transform 84Input must be pre-permuted using the parity lookup table, generated via 85`ff_tx_gen_split_radix_parity_revtab`. 86 87``` C 88static void fft8(FFTComplex *z) 89{ 90 FFTSample r1 = z[0].re - z[4].re; 91 FFTSample r2 = z[0].im - z[4].im; 92 FFTSample r3 = z[1].re - z[5].re; 93 FFTSample r4 = z[1].im - z[5].im; 94 95 FFTSample r5 = z[2].re - z[6].re; 96 FFTSample r6 = z[2].im - z[6].im; 97 FFTSample r7 = z[3].re - z[7].re; 98 FFTSample r8 = z[3].im - z[7].im; 99 100 FFTSample q1 = z[0].re + z[4].re; 101 FFTSample q2 = z[0].im + z[4].im; 102 FFTSample q3 = z[1].re + z[5].re; 103 FFTSample q4 = z[1].im + z[5].im; 104 105 FFTSample q5 = z[2].re + z[6].re; 106 FFTSample q6 = z[2].im + z[6].im; 107 FFTSample q7 = z[3].re + z[7].re; 108 FFTSample q8 = z[3].im + z[7].im; 109 110 FFTSample s3 = q1 - q3; 111 FFTSample s1 = q1 + q3; 112 FFTSample s4 = q2 - q4; 113 FFTSample s2 = q2 + q4; 114 115 FFTSample s7 = q5 - q7; 116 FFTSample s5 = q5 + q7; 117 FFTSample s8 = q6 - q8; 118 FFTSample s6 = q6 + q8; 119 120 FFTSample e1 = s1 * -1; 121 FFTSample e2 = s2 * -1; 122 FFTSample e3 = s3 * -1; 123 FFTSample e4 = s4 * -1; 124 125 FFTSample e5 = s5 * 1; 126 FFTSample e6 = s6 * 1; 127 FFTSample e7 = s7 * -1; 128 FFTSample e8 = s8 * 1; 129 130 FFTSample w1 = e5 - e1; 131 FFTSample w2 = e6 - e2; 132 FFTSample w3 = e8 - e3; 133 FFTSample w4 = e7 - e4; 134 135 FFTSample w5 = s1 - e5; 136 FFTSample w6 = s2 - e6; 137 FFTSample w7 = s3 - e8; 138 FFTSample w8 = s4 - e7; 139 140 z[0].re = w1; 141 z[0].im = w2; 142 z[2].re = w3; 143 z[2].im = w4; 144 z[4].re = w5; 145 z[4].im = w6; 146 z[6].re = w7; 147 z[6].im = w8; 148 149 FFTSample z1 = r1 - r4; 150 FFTSample z2 = r1 + r4; 151 FFTSample z3 = r3 - r2; 152 FFTSample z4 = r3 + r2; 153 154 FFTSample z5 = r5 - r6; 155 FFTSample z6 = r5 + r6; 156 FFTSample z7 = r7 - r8; 157 FFTSample z8 = r7 + r8; 158 159 z3 *= -1; 160 z5 *= -M_SQRT1_2; 161 z6 *= -M_SQRT1_2; 162 z7 *= M_SQRT1_2; 163 z8 *= M_SQRT1_2; 164 165 FFTSample t5 = z7 - z6; 166 FFTSample t6 = z8 + z5; 167 FFTSample t7 = z8 - z5; 168 FFTSample t8 = z7 + z6; 169 170 FFTSample u1 = z2 + t5; 171 FFTSample u2 = z3 + t6; 172 FFTSample u3 = z1 - t7; 173 FFTSample u4 = z4 + t8; 174 175 FFTSample u5 = z2 - t5; 176 FFTSample u6 = z3 - t6; 177 FFTSample u7 = z1 + t7; 178 FFTSample u8 = z4 - t8; 179 180 z[1].re = u1; 181 z[1].im = u2; 182 z[3].re = u3; 183 z[3].im = u4; 184 z[5].re = u5; 185 z[5].im = u6; 186 z[7].re = u7; 187 z[7].im = u8; 188} 189``` 190 191As you can see, there are 2 independent paths, one for even and one for odd coefficients. 192This theme continues throughout the document. Note that in the actual assembly code, 193the paths are interleaved to improve unit saturation and CPU dependency tracking, so 194to more clearly see them, you'll need to deinterleave the instructions. 195 196# 8-point SSE/ARM64 FFT transform 197Input must be pre-permuted using the parity lookup table, generated via 198`ff_tx_gen_split_radix_parity_revtab`. 199 200``` C 201static void fft8(FFTComplex *z) 202{ 203 FFTSample r1 = z[0].re - z[4].re; 204 FFTSample r2 = z[0].im - z[4].im; 205 FFTSample r3 = z[1].re - z[5].re; 206 FFTSample r4 = z[1].im - z[5].im; 207 208 FFTSample j1 = z[2].re - z[6].re; 209 FFTSample j2 = z[2].im - z[6].im; 210 FFTSample j3 = z[3].re - z[7].re; 211 FFTSample j4 = z[3].im - z[7].im; 212 213 FFTSample q1 = z[0].re + z[4].re; 214 FFTSample q2 = z[0].im + z[4].im; 215 FFTSample q3 = z[1].re + z[5].re; 216 FFTSample q4 = z[1].im + z[5].im; 217 218 FFTSample k1 = z[2].re + z[6].re; 219 FFTSample k2 = z[2].im + z[6].im; 220 FFTSample k3 = z[3].re + z[7].re; 221 FFTSample k4 = z[3].im + z[7].im; 222 /* 2 add 2 sub = 4 */ 223 224 /* 2 shufs, 1 add 1 sub = 4 */ 225 FFTSample s1 = q1 + q3; 226 FFTSample s2 = q2 + q4; 227 FFTSample g1 = k3 + k1; 228 FFTSample g2 = k2 + k4; 229 230 FFTSample s3 = q1 - q3; 231 FFTSample s4 = q2 - q4; 232 FFTSample g4 = k3 - k1; 233 FFTSample g3 = k2 - k4; 234 235 /* 1 unpack + 1 shuffle = 2 */ 236 237 /* 1 add */ 238 FFTSample w1 = s1 + g1; 239 FFTSample w2 = s2 + g2; 240 FFTSample w3 = s3 + g3; 241 FFTSample w4 = s4 + g4; 242 243 /* 1 sub */ 244 FFTSample h1 = s1 - g1; 245 FFTSample h2 = s2 - g2; 246 FFTSample h3 = s3 - g3; 247 FFTSample h4 = s4 - g4; 248 249 z[0].re = w1; 250 z[0].im = w2; 251 z[2].re = w3; 252 z[2].im = w4; 253 z[4].re = h1; 254 z[4].im = h2; 255 z[6].re = h3; 256 z[6].im = h4; 257 258 /* 1 shuf + 1 shuf + 1 xor + 1 addsub */ 259 FFTSample z1 = r1 + r4; 260 FFTSample z2 = r2 - r3; 261 FFTSample z3 = r1 - r4; 262 FFTSample z4 = r2 + r3; 263 264 /* 1 mult */ 265 j1 *= M_SQRT1_2; 266 j2 *= -M_SQRT1_2; 267 j3 *= -M_SQRT1_2; 268 j4 *= M_SQRT1_2; 269 270 /* 1 shuf + 1 addsub */ 271 FFTSample l2 = j1 - j2; 272 FFTSample l1 = j2 + j1; 273 FFTSample l4 = j3 - j4; 274 FFTSample l3 = j4 + j3; 275 276 /* 1 shuf + 1 addsub */ 277 FFTSample t1 = l3 - l2; 278 FFTSample t2 = l4 + l1; 279 FFTSample t3 = l1 - l4; 280 FFTSample t4 = l2 + l3; 281 282 /* 1 add */ 283 FFTSample u1 = z1 - t1; 284 FFTSample u2 = z2 - t2; 285 FFTSample u3 = z3 - t3; 286 FFTSample u4 = z4 - t4; 287 288 /* 1 sub */ 289 FFTSample o1 = z1 + t1; 290 FFTSample o2 = z2 + t2; 291 FFTSample o3 = z3 + t3; 292 FFTSample o4 = z4 + t4; 293 294 z[1].re = u1; 295 z[1].im = u2; 296 z[3].re = u3; 297 z[3].im = u4; 298 z[5].re = o1; 299 z[5].im = o2; 300 z[7].re = o3; 301 z[7].im = o4; 302} 303``` 304 305Most functions here are highly tuned to use x86's addsub instruction to save on 306external sign mask loading. 307 308# 16-point AVX FFT transform 309This version expects the output of the 8 and 4-point transforms to follow the 310even/odd convention established above. 311 312``` C 313static void fft16(FFTComplex *z) 314{ 315 FFTSample cos_16_1 = 0.92387950420379638671875f; 316 FFTSample cos_16_3 = 0.3826834261417388916015625f; 317 318 fft8(z); 319 fft4(z+8); 320 fft4(z+10); 321 322 FFTSample s[32]; 323 324 /* 325 xorps m1, m1 - free 326 mulps m0 327 shufps m1, m1, m0 328 xorps 329 addsub 330 shufps 331 mulps 332 mulps 333 addps 334 or (fma3) 335 shufps 336 shufps 337 mulps 338 mulps 339 fma 340 fma 341 */ 342 343 s[0] = z[8].re*( 1) - z[8].im*( 0); 344 s[1] = z[8].im*( 1) + z[8].re*( 0); 345 s[2] = z[9].re*( 1) - z[9].im*(-1); 346 s[3] = z[9].im*( 1) + z[9].re*(-1); 347 348 s[4] = z[10].re*( 1) - z[10].im*( 0); 349 s[5] = z[10].im*( 1) + z[10].re*( 0); 350 s[6] = z[11].re*( 1) - z[11].im*( 1); 351 s[7] = z[11].im*( 1) + z[11].re*( 1); 352 353 s[8] = z[12].re*( cos_16_1) - z[12].im*( -cos_16_3); 354 s[9] = z[12].im*( cos_16_1) + z[12].re*( -cos_16_3); 355 s[10] = z[13].re*( cos_16_3) - z[13].im*( -cos_16_1); 356 s[11] = z[13].im*( cos_16_3) + z[13].re*( -cos_16_1); 357 358 s[12] = z[14].re*( cos_16_1) - z[14].im*( cos_16_3); 359 s[13] = z[14].im*( -cos_16_1) + z[14].re*( -cos_16_3); 360 s[14] = z[15].re*( cos_16_3) - z[15].im*( cos_16_1); 361 s[15] = z[15].im*( -cos_16_3) + z[15].re*( -cos_16_1); 362 363 s[2] *= M_SQRT1_2; 364 s[3] *= M_SQRT1_2; 365 s[5] *= -1; 366 s[6] *= M_SQRT1_2; 367 s[7] *= -M_SQRT1_2; 368 369 FFTSample w5 = s[0] + s[4]; 370 FFTSample w6 = s[1] - s[5]; 371 FFTSample x5 = s[2] + s[6]; 372 FFTSample x6 = s[3] - s[7]; 373 374 FFTSample w3 = s[4] - s[0]; 375 FFTSample w4 = s[5] + s[1]; 376 FFTSample x3 = s[6] - s[2]; 377 FFTSample x4 = s[7] + s[3]; 378 379 FFTSample y5 = s[8] + s[12]; 380 FFTSample y6 = s[9] - s[13]; 381 FFTSample u5 = s[10] + s[14]; 382 FFTSample u6 = s[11] - s[15]; 383 384 FFTSample y3 = s[12] - s[8]; 385 FFTSample y4 = s[13] + s[9]; 386 FFTSample u3 = s[14] - s[10]; 387 FFTSample u4 = s[15] + s[11]; 388 389 /* 2xorps, 2vperm2fs, 2 adds, 2 vpermilps = 8 */ 390 391 FFTSample o1 = z[0].re + w5; 392 FFTSample o2 = z[0].im + w6; 393 FFTSample o5 = z[1].re + x5; 394 FFTSample o6 = z[1].im + x6; 395 FFTSample o9 = z[2].re + w4; //h 396 FFTSample o10 = z[2].im + w3; 397 FFTSample o13 = z[3].re + x4; 398 FFTSample o14 = z[3].im + x3; 399 400 FFTSample o17 = z[0].re - w5; 401 FFTSample o18 = z[0].im - w6; 402 FFTSample o21 = z[1].re - x5; 403 FFTSample o22 = z[1].im - x6; 404 FFTSample o25 = z[2].re - w4; //h 405 FFTSample o26 = z[2].im - w3; 406 FFTSample o29 = z[3].re - x4; 407 FFTSample o30 = z[3].im - x3; 408 409 FFTSample o3 = z[4].re + y5; 410 FFTSample o4 = z[4].im + y6; 411 FFTSample o7 = z[5].re + u5; 412 FFTSample o8 = z[5].im + u6; 413 FFTSample o11 = z[6].re + y4; //h 414 FFTSample o12 = z[6].im + y3; 415 FFTSample o15 = z[7].re + u4; 416 FFTSample o16 = z[7].im + u3; 417 418 FFTSample o19 = z[4].re - y5; 419 FFTSample o20 = z[4].im - y6; 420 FFTSample o23 = z[5].re - u5; 421 FFTSample o24 = z[5].im - u6; 422 FFTSample o27 = z[6].re - y4; //h 423 FFTSample o28 = z[6].im - y3; 424 FFTSample o31 = z[7].re - u4; 425 FFTSample o32 = z[7].im - u3; 426 427 /* This is just deinterleaving, happens separately */ 428 z[0] = (FFTComplex){ o1, o2 }; 429 z[1] = (FFTComplex){ o3, o4 }; 430 z[2] = (FFTComplex){ o5, o6 }; 431 z[3] = (FFTComplex){ o7, o8 }; 432 z[4] = (FFTComplex){ o9, o10 }; 433 z[5] = (FFTComplex){ o11, o12 }; 434 z[6] = (FFTComplex){ o13, o14 }; 435 z[7] = (FFTComplex){ o15, o16 }; 436 437 z[8] = (FFTComplex){ o17, o18 }; 438 z[9] = (FFTComplex){ o19, o20 }; 439 z[10] = (FFTComplex){ o21, o22 }; 440 z[11] = (FFTComplex){ o23, o24 }; 441 z[12] = (FFTComplex){ o25, o26 }; 442 z[13] = (FFTComplex){ o27, o28 }; 443 z[14] = (FFTComplex){ o29, o30 }; 444 z[15] = (FFTComplex){ o31, o32 }; 445} 446``` 447 448# AVX split-radix synthesis 449To create larger transforms, the following unrolling of the C split-radix 450function is used. 451 452``` C 453#define BF(x, y, a, b) \ 454 do { \ 455 x = (a) - (b); \ 456 y = (a) + (b); \ 457 } while (0) 458 459#define BUTTERFLIES(a0,a1,a2,a3) \ 460 do { \ 461 r0=a0.re; \ 462 i0=a0.im; \ 463 r1=a1.re; \ 464 i1=a1.im; \ 465 BF(q3, q5, q5, q1); \ 466 BF(a2.re, a0.re, r0, q5); \ 467 BF(a3.im, a1.im, i1, q3); \ 468 BF(q4, q6, q2, q6); \ 469 BF(a3.re, a1.re, r1, q4); \ 470 BF(a2.im, a0.im, i0, q6); \ 471 } while (0) 472 473#undef TRANSFORM 474#define TRANSFORM(a0,a1,a2,a3,wre,wim) \ 475 do { \ 476 CMUL(q1, q2, a2.re, a2.im, wre, -wim); \ 477 CMUL(q5, q6, a3.re, a3.im, wre, wim); \ 478 BUTTERFLIES(a0, a1, a2, a3); \ 479 } while (0) 480 481#define CMUL(dre, dim, are, aim, bre, bim) \ 482 do { \ 483 (dre) = (are) * (bre) - (aim) * (bim); \ 484 (dim) = (are) * (bim) + (aim) * (bre); \ 485 } while (0) 486 487static void recombine(FFTComplex *z, const FFTSample *cos, 488 unsigned int n) 489{ 490 const int o1 = 2*n; 491 const int o2 = 4*n; 492 const int o3 = 6*n; 493 const FFTSample *wim = cos + o1 - 7; 494 FFTSample q1, q2, q3, q4, q5, q6, r0, i0, r1, i1; 495 496#if 0 497 for (int i = 0; i < n; i += 4) { 498#endif 499 500#if 0 501 TRANSFORM(z[ 0 + 0], z[ 0 + 4], z[o2 + 0], z[o2 + 2], cos[0], wim[7]); 502 TRANSFORM(z[ 0 + 1], z[ 0 + 5], z[o2 + 1], z[o2 + 3], cos[2], wim[5]); 503 TRANSFORM(z[ 0 + 2], z[ 0 + 6], z[o2 + 4], z[o2 + 6], cos[4], wim[3]); 504 TRANSFORM(z[ 0 + 3], z[ 0 + 7], z[o2 + 5], z[o2 + 7], cos[6], wim[1]); 505 506 TRANSFORM(z[o1 + 0], z[o1 + 4], z[o3 + 0], z[o3 + 2], cos[1], wim[6]); 507 TRANSFORM(z[o1 + 1], z[o1 + 5], z[o3 + 1], z[o3 + 3], cos[3], wim[4]); 508 TRANSFORM(z[o1 + 2], z[o1 + 6], z[o3 + 4], z[o3 + 6], cos[5], wim[2]); 509 TRANSFORM(z[o1 + 3], z[o1 + 7], z[o3 + 5], z[o3 + 7], cos[7], wim[0]); 510#else 511 FFTSample h[8], j[8], r[8], w[8]; 512 FFTSample t[8]; 513 FFTComplex *m0 = &z[0]; 514 FFTComplex *m1 = &z[4]; 515 FFTComplex *m2 = &z[o2 + 0]; 516 FFTComplex *m3 = &z[o2 + 4]; 517 518 const FFTSample *t1 = &cos[0]; 519 const FFTSample *t2 = &wim[0]; 520 521 /* 2 loads (tabs) */ 522 523 /* 2 vperm2fs, 2 shufs (im), 2 shufs (tabs) */ 524 /* 1 xor, 1 add, 1 sub, 4 mults OR 2 mults, 2 fmas */ 525 /* 13 OR 10ish (-2 each for second passovers!) */ 526 527 w[0] = m2[0].im*t1[0] - m2[0].re*t2[7]; 528 w[1] = m2[0].re*t1[0] + m2[0].im*t2[7]; 529 w[2] = m2[1].im*t1[2] - m2[1].re*t2[5]; 530 w[3] = m2[1].re*t1[2] + m2[1].im*t2[5]; 531 w[4] = m3[0].im*t1[4] - m3[0].re*t2[3]; 532 w[5] = m3[0].re*t1[4] + m3[0].im*t2[3]; 533 w[6] = m3[1].im*t1[6] - m3[1].re*t2[1]; 534 w[7] = m3[1].re*t1[6] + m3[1].im*t2[1]; 535 536 j[0] = m2[2].im*t1[0] + m2[2].re*t2[7]; 537 j[1] = m2[2].re*t1[0] - m2[2].im*t2[7]; 538 j[2] = m2[3].im*t1[2] + m2[3].re*t2[5]; 539 j[3] = m2[3].re*t1[2] - m2[3].im*t2[5]; 540 j[4] = m3[2].im*t1[4] + m3[2].re*t2[3]; 541 j[5] = m3[2].re*t1[4] - m3[2].im*t2[3]; 542 j[6] = m3[3].im*t1[6] + m3[3].re*t2[1]; 543 j[7] = m3[3].re*t1[6] - m3[3].im*t2[1]; 544 545 /* 1 add + 1 shuf */ 546 t[1] = j[0] + w[0]; 547 t[0] = j[1] + w[1]; 548 t[3] = j[2] + w[2]; 549 t[2] = j[3] + w[3]; 550 t[5] = j[4] + w[4]; 551 t[4] = j[5] + w[5]; 552 t[7] = j[6] + w[6]; 553 t[6] = j[7] + w[7]; 554 555 /* 1 sub + 1 xor */ 556 r[0] = (w[0] - j[0]); 557 r[1] = -(w[1] - j[1]); 558 r[2] = (w[2] - j[2]); 559 r[3] = -(w[3] - j[3]); 560 r[4] = (w[4] - j[4]); 561 r[5] = -(w[5] - j[5]); 562 r[6] = (w[6] - j[6]); 563 r[7] = -(w[7] - j[7]); 564 565 /* Min: 2 subs, 2 adds, 2 vperm2fs (OPTIONAL) */ 566 m2[0].re = m0[0].re - t[0]; 567 m2[0].im = m0[0].im - t[1]; 568 m2[1].re = m0[1].re - t[2]; 569 m2[1].im = m0[1].im - t[3]; 570 m3[0].re = m0[2].re - t[4]; 571 m3[0].im = m0[2].im - t[5]; 572 m3[1].re = m0[3].re - t[6]; 573 m3[1].im = m0[3].im - t[7]; 574 575 m2[2].re = m1[0].re - r[0]; 576 m2[2].im = m1[0].im - r[1]; 577 m2[3].re = m1[1].re - r[2]; 578 m2[3].im = m1[1].im - r[3]; 579 m3[2].re = m1[2].re - r[4]; 580 m3[2].im = m1[2].im - r[5]; 581 m3[3].re = m1[3].re - r[6]; 582 m3[3].im = m1[3].im - r[7]; 583 584 m0[0].re = m0[0].re + t[0]; 585 m0[0].im = m0[0].im + t[1]; 586 m0[1].re = m0[1].re + t[2]; 587 m0[1].im = m0[1].im + t[3]; 588 m0[2].re = m0[2].re + t[4]; 589 m0[2].im = m0[2].im + t[5]; 590 m0[3].re = m0[3].re + t[6]; 591 m0[3].im = m0[3].im + t[7]; 592 593 m1[0].re = m1[0].re + r[0]; 594 m1[0].im = m1[0].im + r[1]; 595 m1[1].re = m1[1].re + r[2]; 596 m1[1].im = m1[1].im + r[3]; 597 m1[2].re = m1[2].re + r[4]; 598 m1[2].im = m1[2].im + r[5]; 599 m1[3].re = m1[3].re + r[6]; 600 m1[3].im = m1[3].im + r[7]; 601 602 /* Identical for below, but with the following parameters */ 603 m0 = &z[o1]; 604 m1 = &z[o1 + 4]; 605 m2 = &z[o3 + 0]; 606 m3 = &z[o3 + 4]; 607 t1 = &cos[1]; 608 t2 = &wim[-1]; 609 610 w[0] = m2[0].im*t1[0] - m2[0].re*t2[7]; 611 w[1] = m2[0].re*t1[0] + m2[0].im*t2[7]; 612 w[2] = m2[1].im*t1[2] - m2[1].re*t2[5]; 613 w[3] = m2[1].re*t1[2] + m2[1].im*t2[5]; 614 w[4] = m3[0].im*t1[4] - m3[0].re*t2[3]; 615 w[5] = m3[0].re*t1[4] + m3[0].im*t2[3]; 616 w[6] = m3[1].im*t1[6] - m3[1].re*t2[1]; 617 w[7] = m3[1].re*t1[6] + m3[1].im*t2[1]; 618 619 j[0] = m2[2].im*t1[0] + m2[2].re*t2[7]; 620 j[1] = m2[2].re*t1[0] - m2[2].im*t2[7]; 621 j[2] = m2[3].im*t1[2] + m2[3].re*t2[5]; 622 j[3] = m2[3].re*t1[2] - m2[3].im*t2[5]; 623 j[4] = m3[2].im*t1[4] + m3[2].re*t2[3]; 624 j[5] = m3[2].re*t1[4] - m3[2].im*t2[3]; 625 j[6] = m3[3].im*t1[6] + m3[3].re*t2[1]; 626 j[7] = m3[3].re*t1[6] - m3[3].im*t2[1]; 627 628 /* 1 add + 1 shuf */ 629 t[1] = j[0] + w[0]; 630 t[0] = j[1] + w[1]; 631 t[3] = j[2] + w[2]; 632 t[2] = j[3] + w[3]; 633 t[5] = j[4] + w[4]; 634 t[4] = j[5] + w[5]; 635 t[7] = j[6] + w[6]; 636 t[6] = j[7] + w[7]; 637 638 /* 1 sub + 1 xor */ 639 r[0] = (w[0] - j[0]); 640 r[1] = -(w[1] - j[1]); 641 r[2] = (w[2] - j[2]); 642 r[3] = -(w[3] - j[3]); 643 r[4] = (w[4] - j[4]); 644 r[5] = -(w[5] - j[5]); 645 r[6] = (w[6] - j[6]); 646 r[7] = -(w[7] - j[7]); 647 648 /* Min: 2 subs, 2 adds, 2 vperm2fs (OPTIONAL) */ 649 m2[0].re = m0[0].re - t[0]; 650 m2[0].im = m0[0].im - t[1]; 651 m2[1].re = m0[1].re - t[2]; 652 m2[1].im = m0[1].im - t[3]; 653 m3[0].re = m0[2].re - t[4]; 654 m3[0].im = m0[2].im - t[5]; 655 m3[1].re = m0[3].re - t[6]; 656 m3[1].im = m0[3].im - t[7]; 657 658 m2[2].re = m1[0].re - r[0]; 659 m2[2].im = m1[0].im - r[1]; 660 m2[3].re = m1[1].re - r[2]; 661 m2[3].im = m1[1].im - r[3]; 662 m3[2].re = m1[2].re - r[4]; 663 m3[2].im = m1[2].im - r[5]; 664 m3[3].re = m1[3].re - r[6]; 665 m3[3].im = m1[3].im - r[7]; 666 667 m0[0].re = m0[0].re + t[0]; 668 m0[0].im = m0[0].im + t[1]; 669 m0[1].re = m0[1].re + t[2]; 670 m0[1].im = m0[1].im + t[3]; 671 m0[2].re = m0[2].re + t[4]; 672 m0[2].im = m0[2].im + t[5]; 673 m0[3].re = m0[3].re + t[6]; 674 m0[3].im = m0[3].im + t[7]; 675 676 m1[0].re = m1[0].re + r[0]; 677 m1[0].im = m1[0].im + r[1]; 678 m1[1].re = m1[1].re + r[2]; 679 m1[1].im = m1[1].im + r[3]; 680 m1[2].re = m1[2].re + r[4]; 681 m1[2].im = m1[2].im + r[5]; 682 m1[3].re = m1[3].re + r[6]; 683 m1[3].im = m1[3].im + r[7]; 684#endif 685 686#if 0 687 z += 4; // !!! 688 cos += 2*4; 689 wim -= 2*4; 690 } 691#endif 692} 693``` 694 695The macros used are identical to those in the generic C version, only with all 696variable declarations exported to the function body. 697An important point here is that the high frequency registers (m2 and m3) have 698their high and low halves swapped in the output. This is intentional, as the 699inputs must also have the same layout, and therefore, the input swapping is only 700performed once for the bottom-most basis transform, with all subsequent combinations 701using the already swapped halves. 702 703Also note that this function requires a special iteration way, due to coefficients 704beginning to overlap, particularly `[o1]` with `[0]` after the second iteration. 705To iterate further, set `z = &z[16]` via `z += 8` for the second iteration. After 706the 4th iteration, the layout resets, so repeat the same. 707