1Target Independent Opportunities: 2 3//===---------------------------------------------------------------------===// 4 5With the recent changes to make the implicit def/use set explicit in 6machineinstrs, we should change the target descriptions for 'call' instructions 7so that the .td files don't list all the call-clobbered registers as implicit 8defs. Instead, these should be added by the code generator (e.g. on the dag). 9 10This has a number of uses: 11 121. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions 13 for their different impdef sets. 142. Targets with multiple calling convs (e.g. x86) which have different clobber 15 sets don't need copies of call instructions. 163. 'Interprocedural register allocation' can be done to reduce the clobber sets 17 of calls. 18 19//===---------------------------------------------------------------------===// 20 21We should recognized various "overflow detection" idioms and translate them into 22llvm.uadd.with.overflow and similar intrinsics. Here is a multiply idiom: 23 24unsigned int mul(unsigned int a,unsigned int b) { 25 if ((unsigned long long)a*b>0xffffffff) 26 exit(0); 27 return a*b; 28} 29 30The legalization code for mul-with-overflow needs to be made more robust before 31this can be implemented though. 32 33//===---------------------------------------------------------------------===// 34 35Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and 36precision don't matter (ffastmath). Misc/mandel will like this. :) This isn't 37safe in general, even on darwin. See the libm implementation of hypot for 38examples (which special case when x/y are exactly zero to get signed zeros etc 39right). 40 41//===---------------------------------------------------------------------===// 42 43On targets with expensive 64-bit multiply, we could LSR this: 44 45for (i = ...; ++i) { 46 x = 1ULL << i; 47 48into: 49 long long tmp = 1; 50 for (i = ...; ++i, tmp+=tmp) 51 x = tmp; 52 53This would be a win on ppc32, but not x86 or ppc64. 54 55//===---------------------------------------------------------------------===// 56 57Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0) 58 59//===---------------------------------------------------------------------===// 60 61Reassociate should turn things like: 62 63int factorial(int X) { 64 return X*X*X*X*X*X*X*X; 65} 66 67into llvm.powi calls, allowing the code generator to produce balanced 68multiplication trees. 69 70First, the intrinsic needs to be extended to support integers, and second the 71code generator needs to be enhanced to lower these to multiplication trees. 72 73//===---------------------------------------------------------------------===// 74 75Interesting? testcase for add/shift/mul reassoc: 76 77int bar(int x, int y) { 78 return x*x*x+y+x*x*x*x*x*y*y*y*y; 79} 80int foo(int z, int n) { 81 return bar(z, n) + bar(2*z, 2*n); 82} 83 84This is blocked on not handling X*X*X -> powi(X, 3) (see note above). The issue 85is that we end up getting t = 2*X s = t*t and don't turn this into 4*X*X, 86which is the same number of multiplies and is canonical, because the 2*X has 87multiple uses. Here's a simple example: 88 89define i32 @test15(i32 %X1) { 90 %B = mul i32 %X1, 47 ; X1*47 91 %C = mul i32 %B, %B 92 ret i32 %C 93} 94 95 96//===---------------------------------------------------------------------===// 97 98Reassociate should handle the example in GCC PR16157: 99 100extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4; 101void f () { /* this can be optimized to four additions... */ 102 b4 = a4 + a3 + a2 + a1 + a0; 103 b3 = a3 + a2 + a1 + a0; 104 b2 = a2 + a1 + a0; 105 b1 = a1 + a0; 106} 107 108This requires reassociating to forms of expressions that are already available, 109something that reassoc doesn't think about yet. 110 111 112//===---------------------------------------------------------------------===// 113 114This function: (derived from GCC PR19988) 115double foo(double x, double y) { 116 return ((x + 0.1234 * y) * (x + -0.1234 * y)); 117} 118 119compiles to: 120_foo: 121 movapd %xmm1, %xmm2 122 mulsd LCPI1_1(%rip), %xmm1 123 mulsd LCPI1_0(%rip), %xmm2 124 addsd %xmm0, %xmm1 125 addsd %xmm0, %xmm2 126 movapd %xmm1, %xmm0 127 mulsd %xmm2, %xmm0 128 ret 129 130Reassociate should be able to turn it into: 131 132double foo(double x, double y) { 133 return ((x + 0.1234 * y) * (x - 0.1234 * y)); 134} 135 136Which allows the multiply by constant to be CSE'd, producing: 137 138_foo: 139 mulsd LCPI1_0(%rip), %xmm1 140 movapd %xmm1, %xmm2 141 addsd %xmm0, %xmm2 142 subsd %xmm1, %xmm0 143 mulsd %xmm2, %xmm0 144 ret 145 146This doesn't need -ffast-math support at all. This is particularly bad because 147the llvm-gcc frontend is canonicalizing the later into the former, but clang 148doesn't have this problem. 149 150//===---------------------------------------------------------------------===// 151 152These two functions should generate the same code on big-endian systems: 153 154int g(int *j,int *l) { return memcmp(j,l,4); } 155int h(int *j, int *l) { return *j - *l; } 156 157this could be done in SelectionDAGISel.cpp, along with other special cases, 158for 1,2,4,8 bytes. 159 160//===---------------------------------------------------------------------===// 161 162It would be nice to revert this patch: 163http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20060213/031986.html 164 165And teach the dag combiner enough to simplify the code expanded before 166legalize. It seems plausible that this knowledge would let it simplify other 167stuff too. 168 169//===---------------------------------------------------------------------===// 170 171For vector types, TargetData.cpp::getTypeInfo() returns alignment that is equal 172to the type size. It works but can be overly conservative as the alignment of 173specific vector types are target dependent. 174 175//===---------------------------------------------------------------------===// 176 177We should produce an unaligned load from code like this: 178 179v4sf example(float *P) { 180 return (v4sf){P[0], P[1], P[2], P[3] }; 181} 182 183//===---------------------------------------------------------------------===// 184 185Add support for conditional increments, and other related patterns. Instead 186of: 187 188 movl 136(%esp), %eax 189 cmpl $0, %eax 190 je LBB16_2 #cond_next 191LBB16_1: #cond_true 192 incl _foo 193LBB16_2: #cond_next 194 195emit: 196 movl _foo, %eax 197 cmpl $1, %edi 198 sbbl $-1, %eax 199 movl %eax, _foo 200 201//===---------------------------------------------------------------------===// 202 203Combine: a = sin(x), b = cos(x) into a,b = sincos(x). 204 205Expand these to calls of sin/cos and stores: 206 double sincos(double x, double *sin, double *cos); 207 float sincosf(float x, float *sin, float *cos); 208 long double sincosl(long double x, long double *sin, long double *cos); 209 210Doing so could allow SROA of the destination pointers. See also: 211http://gcc.gnu.org/bugzilla/show_bug.cgi?id=17687 212 213This is now easily doable with MRVs. We could even make an intrinsic for this 214if anyone cared enough about sincos. 215 216//===---------------------------------------------------------------------===// 217 218quantum_sigma_x in 462.libquantum contains the following loop: 219 220 for(i=0; i<reg->size; i++) 221 { 222 /* Flip the target bit of each basis state */ 223 reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target); 224 } 225 226Where MAX_UNSIGNED/state is a 64-bit int. On a 32-bit platform it would be just 227so cool to turn it into something like: 228 229 long long Res = ((MAX_UNSIGNED) 1 << target); 230 if (target < 32) { 231 for(i=0; i<reg->size; i++) 232 reg->node[i].state ^= Res & 0xFFFFFFFFULL; 233 } else { 234 for(i=0; i<reg->size; i++) 235 reg->node[i].state ^= Res & 0xFFFFFFFF00000000ULL 236 } 237 238... which would only do one 32-bit XOR per loop iteration instead of two. 239 240It would also be nice to recognize the reg->size doesn't alias reg->node[i], but 241this requires TBAA. 242 243//===---------------------------------------------------------------------===// 244 245This isn't recognized as bswap by instcombine (yes, it really is bswap): 246 247unsigned long reverse(unsigned v) { 248 unsigned t; 249 t = v ^ ((v << 16) | (v >> 16)); 250 t &= ~0xff0000; 251 v = (v << 24) | (v >> 8); 252 return v ^ (t >> 8); 253} 254 255//===---------------------------------------------------------------------===// 256 257[LOOP DELETION] 258 259We don't delete this output free loop, because trip count analysis doesn't 260realize that it is finite (if it were infinite, it would be undefined). Not 261having this blocks Loop Idiom from matching strlen and friends. 262 263void foo(char *C) { 264 int x = 0; 265 while (*C) 266 ++x,++C; 267} 268 269//===---------------------------------------------------------------------===// 270 271[LOOP RECOGNITION] 272 273These idioms should be recognized as popcount (see PR1488): 274 275unsigned countbits_slow(unsigned v) { 276 unsigned c; 277 for (c = 0; v; v >>= 1) 278 c += v & 1; 279 return c; 280} 281unsigned countbits_fast(unsigned v){ 282 unsigned c; 283 for (c = 0; v; c++) 284 v &= v - 1; // clear the least significant bit set 285 return c; 286} 287 288BITBOARD = unsigned long long 289int PopCnt(register BITBOARD a) { 290 register int c=0; 291 while(a) { 292 c++; 293 a &= a - 1; 294 } 295 return c; 296} 297unsigned int popcount(unsigned int input) { 298 unsigned int count = 0; 299 for (unsigned int i = 0; i < 4 * 8; i++) 300 count += (input >> i) & i; 301 return count; 302} 303 304This should be recognized as CLZ: rdar://8459039 305 306unsigned clz_a(unsigned a) { 307 int i; 308 for (i=0;i<32;i++) 309 if (a & (1<<(31-i))) 310 return i; 311 return 32; 312} 313 314This sort of thing should be added to the loop idiom pass. 315 316//===---------------------------------------------------------------------===// 317 318These should turn into single 16-bit (unaligned?) loads on little/big endian 319processors. 320 321unsigned short read_16_le(const unsigned char *adr) { 322 return adr[0] | (adr[1] << 8); 323} 324unsigned short read_16_be(const unsigned char *adr) { 325 return (adr[0] << 8) | adr[1]; 326} 327 328//===---------------------------------------------------------------------===// 329 330-instcombine should handle this transform: 331 icmp pred (sdiv X / C1 ), C2 332when X, C1, and C2 are unsigned. Similarly for udiv and signed operands. 333 334Currently InstCombine avoids this transform but will do it when the signs of 335the operands and the sign of the divide match. See the FIXME in 336InstructionCombining.cpp in the visitSetCondInst method after the switch case 337for Instruction::UDiv (around line 4447) for more details. 338 339The SingleSource/Benchmarks/Shootout-C++/hash and hash2 tests have examples of 340this construct. 341 342//===---------------------------------------------------------------------===// 343 344[LOOP OPTIMIZATION] 345 346SingleSource/Benchmarks/Misc/dt.c shows several interesting optimization 347opportunities in its double_array_divs_variable function: it needs loop 348interchange, memory promotion (which LICM already does), vectorization and 349variable trip count loop unrolling (since it has a constant trip count). ICC 350apparently produces this very nice code with -ffast-math: 351 352..B1.70: # Preds ..B1.70 ..B1.69 353 mulpd %xmm0, %xmm1 #108.2 354 mulpd %xmm0, %xmm1 #108.2 355 mulpd %xmm0, %xmm1 #108.2 356 mulpd %xmm0, %xmm1 #108.2 357 addl $8, %edx # 358 cmpl $131072, %edx #108.2 359 jb ..B1.70 # Prob 99% #108.2 360 361It would be better to count down to zero, but this is a lot better than what we 362do. 363 364//===---------------------------------------------------------------------===// 365 366Consider: 367 368typedef unsigned U32; 369typedef unsigned long long U64; 370int test (U32 *inst, U64 *regs) { 371 U64 effective_addr2; 372 U32 temp = *inst; 373 int r1 = (temp >> 20) & 0xf; 374 int b2 = (temp >> 16) & 0xf; 375 effective_addr2 = temp & 0xfff; 376 if (b2) effective_addr2 += regs[b2]; 377 b2 = (temp >> 12) & 0xf; 378 if (b2) effective_addr2 += regs[b2]; 379 effective_addr2 &= regs[4]; 380 if ((effective_addr2 & 3) == 0) 381 return 1; 382 return 0; 383} 384 385Note that only the low 2 bits of effective_addr2 are used. On 32-bit systems, 386we don't eliminate the computation of the top half of effective_addr2 because 387we don't have whole-function selection dags. On x86, this means we use one 388extra register for the function when effective_addr2 is declared as U64 than 389when it is declared U32. 390 391PHI Slicing could be extended to do this. 392 393//===---------------------------------------------------------------------===// 394 395Tail call elim should be more aggressive, checking to see if the call is 396followed by an uncond branch to an exit block. 397 398; This testcase is due to tail-duplication not wanting to copy the return 399; instruction into the terminating blocks because there was other code 400; optimized out of the function after the taildup happened. 401; RUN: llvm-as < %s | opt -tailcallelim | llvm-dis | not grep call 402 403define i32 @t4(i32 %a) { 404entry: 405 %tmp.1 = and i32 %a, 1 ; <i32> [#uses=1] 406 %tmp.2 = icmp ne i32 %tmp.1, 0 ; <i1> [#uses=1] 407 br i1 %tmp.2, label %then.0, label %else.0 408 409then.0: ; preds = %entry 410 %tmp.5 = add i32 %a, -1 ; <i32> [#uses=1] 411 %tmp.3 = call i32 @t4( i32 %tmp.5 ) ; <i32> [#uses=1] 412 br label %return 413 414else.0: ; preds = %entry 415 %tmp.7 = icmp ne i32 %a, 0 ; <i1> [#uses=1] 416 br i1 %tmp.7, label %then.1, label %return 417 418then.1: ; preds = %else.0 419 %tmp.11 = add i32 %a, -2 ; <i32> [#uses=1] 420 %tmp.9 = call i32 @t4( i32 %tmp.11 ) ; <i32> [#uses=1] 421 br label %return 422 423return: ; preds = %then.1, %else.0, %then.0 424 %result.0 = phi i32 [ 0, %else.0 ], [ %tmp.3, %then.0 ], 425 [ %tmp.9, %then.1 ] 426 ret i32 %result.0 427} 428 429//===---------------------------------------------------------------------===// 430 431Tail recursion elimination should handle: 432 433int pow2m1(int n) { 434 if (n == 0) 435 return 0; 436 return 2 * pow2m1 (n - 1) + 1; 437} 438 439Also, multiplies can be turned into SHL's, so they should be handled as if 440they were associative. "return foo() << 1" can be tail recursion eliminated. 441 442//===---------------------------------------------------------------------===// 443 444Argument promotion should promote arguments for recursive functions, like 445this: 446 447; RUN: llvm-as < %s | opt -argpromotion | llvm-dis | grep x.val 448 449define internal i32 @foo(i32* %x) { 450entry: 451 %tmp = load i32* %x ; <i32> [#uses=0] 452 %tmp.foo = call i32 @foo( i32* %x ) ; <i32> [#uses=1] 453 ret i32 %tmp.foo 454} 455 456define i32 @bar(i32* %x) { 457entry: 458 %tmp3 = call i32 @foo( i32* %x ) ; <i32> [#uses=1] 459 ret i32 %tmp3 460} 461 462//===---------------------------------------------------------------------===// 463 464We should investigate an instruction sinking pass. Consider this silly 465example in pic mode: 466 467#include <assert.h> 468void foo(int x) { 469 assert(x); 470 //... 471} 472 473we compile this to: 474_foo: 475 subl $28, %esp 476 call "L1$pb" 477"L1$pb": 478 popl %eax 479 cmpl $0, 32(%esp) 480 je LBB1_2 # cond_true 481LBB1_1: # return 482 # ... 483 addl $28, %esp 484 ret 485LBB1_2: # cond_true 486... 487 488The PIC base computation (call+popl) is only used on one path through the 489code, but is currently always computed in the entry block. It would be 490better to sink the picbase computation down into the block for the 491assertion, as it is the only one that uses it. This happens for a lot of 492code with early outs. 493 494Another example is loads of arguments, which are usually emitted into the 495entry block on targets like x86. If not used in all paths through a 496function, they should be sunk into the ones that do. 497 498In this case, whole-function-isel would also handle this. 499 500//===---------------------------------------------------------------------===// 501 502Investigate lowering of sparse switch statements into perfect hash tables: 503http://burtleburtle.net/bob/hash/perfect.html 504 505//===---------------------------------------------------------------------===// 506 507We should turn things like "load+fabs+store" and "load+fneg+store" into the 508corresponding integer operations. On a yonah, this loop: 509 510double a[256]; 511void foo() { 512 int i, b; 513 for (b = 0; b < 10000000; b++) 514 for (i = 0; i < 256; i++) 515 a[i] = -a[i]; 516} 517 518is twice as slow as this loop: 519 520long long a[256]; 521void foo() { 522 int i, b; 523 for (b = 0; b < 10000000; b++) 524 for (i = 0; i < 256; i++) 525 a[i] ^= (1ULL << 63); 526} 527 528and I suspect other processors are similar. On X86 in particular this is a 529big win because doing this with integers allows the use of read/modify/write 530instructions. 531 532//===---------------------------------------------------------------------===// 533 534DAG Combiner should try to combine small loads into larger loads when 535profitable. For example, we compile this C++ example: 536 537struct THotKey { short Key; bool Control; bool Shift; bool Alt; }; 538extern THotKey m_HotKey; 539THotKey GetHotKey () { return m_HotKey; } 540 541into (-m64 -O3 -fno-exceptions -static -fomit-frame-pointer): 542 543__Z9GetHotKeyv: ## @_Z9GetHotKeyv 544 movq _m_HotKey@GOTPCREL(%rip), %rax 545 movzwl (%rax), %ecx 546 movzbl 2(%rax), %edx 547 shlq $16, %rdx 548 orq %rcx, %rdx 549 movzbl 3(%rax), %ecx 550 shlq $24, %rcx 551 orq %rdx, %rcx 552 movzbl 4(%rax), %eax 553 shlq $32, %rax 554 orq %rcx, %rax 555 ret 556 557//===---------------------------------------------------------------------===// 558 559We should add an FRINT node to the DAG to model targets that have legal 560implementations of ceil/floor/rint. 561 562//===---------------------------------------------------------------------===// 563 564Consider: 565 566int test() { 567 long long input[8] = {1,0,1,0,1,0,1,0}; 568 foo(input); 569} 570 571Clang compiles this into: 572 573 call void @llvm.memset.p0i8.i64(i8* %tmp, i8 0, i64 64, i32 16, i1 false) 574 %0 = getelementptr [8 x i64]* %input, i64 0, i64 0 575 store i64 1, i64* %0, align 16 576 %1 = getelementptr [8 x i64]* %input, i64 0, i64 2 577 store i64 1, i64* %1, align 16 578 %2 = getelementptr [8 x i64]* %input, i64 0, i64 4 579 store i64 1, i64* %2, align 16 580 %3 = getelementptr [8 x i64]* %input, i64 0, i64 6 581 store i64 1, i64* %3, align 16 582 583Which gets codegen'd into: 584 585 pxor %xmm0, %xmm0 586 movaps %xmm0, -16(%rbp) 587 movaps %xmm0, -32(%rbp) 588 movaps %xmm0, -48(%rbp) 589 movaps %xmm0, -64(%rbp) 590 movq $1, -64(%rbp) 591 movq $1, -48(%rbp) 592 movq $1, -32(%rbp) 593 movq $1, -16(%rbp) 594 595It would be better to have 4 movq's of 0 instead of the movaps's. 596 597//===---------------------------------------------------------------------===// 598 599http://llvm.org/PR717: 600 601The following code should compile into "ret int undef". Instead, LLVM 602produces "ret int 0": 603 604int f() { 605 int x = 4; 606 int y; 607 if (x == 3) y = 0; 608 return y; 609} 610 611//===---------------------------------------------------------------------===// 612 613The loop unroller should partially unroll loops (instead of peeling them) 614when code growth isn't too bad and when an unroll count allows simplification 615of some code within the loop. One trivial example is: 616 617#include <stdio.h> 618int main() { 619 int nRet = 17; 620 int nLoop; 621 for ( nLoop = 0; nLoop < 1000; nLoop++ ) { 622 if ( nLoop & 1 ) 623 nRet += 2; 624 else 625 nRet -= 1; 626 } 627 return nRet; 628} 629 630Unrolling by 2 would eliminate the '&1' in both copies, leading to a net 631reduction in code size. The resultant code would then also be suitable for 632exit value computation. 633 634//===---------------------------------------------------------------------===// 635 636We miss a bunch of rotate opportunities on various targets, including ppc, x86, 637etc. On X86, we miss a bunch of 'rotate by variable' cases because the rotate 638matching code in dag combine doesn't look through truncates aggressively 639enough. Here are some testcases reduces from GCC PR17886: 640 641unsigned long long f5(unsigned long long x, unsigned long long y) { 642 return (x << 8) | ((y >> 48) & 0xffull); 643} 644unsigned long long f6(unsigned long long x, unsigned long long y, int z) { 645 switch(z) { 646 case 1: 647 return (x << 8) | ((y >> 48) & 0xffull); 648 case 2: 649 return (x << 16) | ((y >> 40) & 0xffffull); 650 case 3: 651 return (x << 24) | ((y >> 32) & 0xffffffull); 652 case 4: 653 return (x << 32) | ((y >> 24) & 0xffffffffull); 654 default: 655 return (x << 40) | ((y >> 16) & 0xffffffffffull); 656 } 657} 658 659//===---------------------------------------------------------------------===// 660 661This (and similar related idioms): 662 663unsigned int foo(unsigned char i) { 664 return i | (i<<8) | (i<<16) | (i<<24); 665} 666 667compiles into: 668 669define i32 @foo(i8 zeroext %i) nounwind readnone ssp noredzone { 670entry: 671 %conv = zext i8 %i to i32 672 %shl = shl i32 %conv, 8 673 %shl5 = shl i32 %conv, 16 674 %shl9 = shl i32 %conv, 24 675 %or = or i32 %shl9, %conv 676 %or6 = or i32 %or, %shl5 677 %or10 = or i32 %or6, %shl 678 ret i32 %or10 679} 680 681it would be better as: 682 683unsigned int bar(unsigned char i) { 684 unsigned int j=i | (i << 8); 685 return j | (j<<16); 686} 687 688aka: 689 690define i32 @bar(i8 zeroext %i) nounwind readnone ssp noredzone { 691entry: 692 %conv = zext i8 %i to i32 693 %shl = shl i32 %conv, 8 694 %or = or i32 %shl, %conv 695 %shl5 = shl i32 %or, 16 696 %or6 = or i32 %shl5, %or 697 ret i32 %or6 698} 699 700or even i*0x01010101, depending on the speed of the multiplier. The best way to 701handle this is to canonicalize it to a multiply in IR and have codegen handle 702lowering multiplies to shifts on cpus where shifts are faster. 703 704//===---------------------------------------------------------------------===// 705 706We do a number of simplifications in simplify libcalls to strength reduce 707standard library functions, but we don't currently merge them together. For 708example, it is useful to merge memcpy(a,b,strlen(b)) -> strcpy. This can only 709be done safely if "b" isn't modified between the strlen and memcpy of course. 710 711//===---------------------------------------------------------------------===// 712 713We compile this program: (from GCC PR11680) 714http://gcc.gnu.org/bugzilla/attachment.cgi?id=4487 715 716Into code that runs the same speed in fast/slow modes, but both modes run 2x 717slower than when compile with GCC (either 4.0 or 4.2): 718 719$ llvm-g++ perf.cpp -O3 -fno-exceptions 720$ time ./a.out fast 7211.821u 0.003s 0:01.82 100.0% 0+0k 0+0io 0pf+0w 722 723$ g++ perf.cpp -O3 -fno-exceptions 724$ time ./a.out fast 7250.821u 0.001s 0:00.82 100.0% 0+0k 0+0io 0pf+0w 726 727It looks like we are making the same inlining decisions, so this may be raw 728codegen badness or something else (haven't investigated). 729 730//===---------------------------------------------------------------------===// 731 732Divisibility by constant can be simplified (according to GCC PR12849) from 733being a mulhi to being a mul lo (cheaper). Testcase: 734 735void bar(unsigned n) { 736 if (n % 3 == 0) 737 true(); 738} 739 740This is equivalent to the following, where 2863311531 is the multiplicative 741inverse of 3, and 1431655766 is ((2^32)-1)/3+1: 742void bar(unsigned n) { 743 if (n * 2863311531U < 1431655766U) 744 true(); 745} 746 747The same transformation can work with an even modulo with the addition of a 748rotate: rotate the result of the multiply to the right by the number of bits 749which need to be zero for the condition to be true, and shrink the compare RHS 750by the same amount. Unless the target supports rotates, though, that 751transformation probably isn't worthwhile. 752 753The transformation can also easily be made to work with non-zero equality 754comparisons: just transform, for example, "n % 3 == 1" to "(n-1) % 3 == 0". 755 756//===---------------------------------------------------------------------===// 757 758Better mod/ref analysis for scanf would allow us to eliminate the vtable and a 759bunch of other stuff from this example (see PR1604): 760 761#include <cstdio> 762struct test { 763 int val; 764 virtual ~test() {} 765}; 766 767int main() { 768 test t; 769 std::scanf("%d", &t.val); 770 std::printf("%d\n", t.val); 771} 772 773//===---------------------------------------------------------------------===// 774 775These functions perform the same computation, but produce different assembly. 776 777define i8 @select(i8 %x) readnone nounwind { 778 %A = icmp ult i8 %x, 250 779 %B = select i1 %A, i8 0, i8 1 780 ret i8 %B 781} 782 783define i8 @addshr(i8 %x) readnone nounwind { 784 %A = zext i8 %x to i9 785 %B = add i9 %A, 6 ;; 256 - 250 == 6 786 %C = lshr i9 %B, 8 787 %D = trunc i9 %C to i8 788 ret i8 %D 789} 790 791//===---------------------------------------------------------------------===// 792 793From gcc bug 24696: 794int 795f (unsigned long a, unsigned long b, unsigned long c) 796{ 797 return ((a & (c - 1)) != 0) || ((b & (c - 1)) != 0); 798} 799int 800f (unsigned long a, unsigned long b, unsigned long c) 801{ 802 return ((a & (c - 1)) != 0) | ((b & (c - 1)) != 0); 803} 804Both should combine to ((a|b) & (c-1)) != 0. Currently not optimized with 805"clang -emit-llvm-bc | opt -std-compile-opts". 806 807//===---------------------------------------------------------------------===// 808 809From GCC Bug 20192: 810#define PMD_MASK (~((1UL << 23) - 1)) 811void clear_pmd_range(unsigned long start, unsigned long end) 812{ 813 if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK)) 814 f(); 815} 816The expression should optimize to something like 817"!((start|end)&~PMD_MASK). Currently not optimized with "clang 818-emit-llvm-bc | opt -std-compile-opts". 819 820//===---------------------------------------------------------------------===// 821 822unsigned int f(unsigned int i, unsigned int n) {++i; if (i == n) ++i; return 823i;} 824unsigned int f2(unsigned int i, unsigned int n) {++i; i += i == n; return i;} 825These should combine to the same thing. Currently, the first function 826produces better code on X86. 827 828//===---------------------------------------------------------------------===// 829 830From GCC Bug 15784: 831#define abs(x) x>0?x:-x 832int f(int x, int y) 833{ 834 return (abs(x)) >= 0; 835} 836This should optimize to x == INT_MIN. (With -fwrapv.) Currently not 837optimized with "clang -emit-llvm-bc | opt -std-compile-opts". 838 839//===---------------------------------------------------------------------===// 840 841From GCC Bug 14753: 842void 843rotate_cst (unsigned int a) 844{ 845 a = (a << 10) | (a >> 22); 846 if (a == 123) 847 bar (); 848} 849void 850minus_cst (unsigned int a) 851{ 852 unsigned int tem; 853 854 tem = 20 - a; 855 if (tem == 5) 856 bar (); 857} 858void 859mask_gt (unsigned int a) 860{ 861 /* This is equivalent to a > 15. */ 862 if ((a & ~7) > 8) 863 bar (); 864} 865void 866rshift_gt (unsigned int a) 867{ 868 /* This is equivalent to a > 23. */ 869 if ((a >> 2) > 5) 870 bar (); 871} 872 873All should simplify to a single comparison. All of these are 874currently not optimized with "clang -emit-llvm-bc | opt 875-std-compile-opts". 876 877//===---------------------------------------------------------------------===// 878 879From GCC Bug 32605: 880int c(int* x) {return (char*)x+2 == (char*)x;} 881Should combine to 0. Currently not optimized with "clang 882-emit-llvm-bc | opt -std-compile-opts" (although llc can optimize it). 883 884//===---------------------------------------------------------------------===// 885 886int a(unsigned b) {return ((b << 31) | (b << 30)) >> 31;} 887Should be combined to "((b >> 1) | b) & 1". Currently not optimized 888with "clang -emit-llvm-bc | opt -std-compile-opts". 889 890//===---------------------------------------------------------------------===// 891 892unsigned a(unsigned x, unsigned y) { return x | (y & 1) | (y & 2);} 893Should combine to "x | (y & 3)". Currently not optimized with "clang 894-emit-llvm-bc | opt -std-compile-opts". 895 896//===---------------------------------------------------------------------===// 897 898int a(int a, int b, int c) {return (~a & c) | ((c|a) & b);} 899Should fold to "(~a & c) | (a & b)". Currently not optimized with 900"clang -emit-llvm-bc | opt -std-compile-opts". 901 902//===---------------------------------------------------------------------===// 903 904int a(int a,int b) {return (~(a|b))|a;} 905Should fold to "a|~b". Currently not optimized with "clang 906-emit-llvm-bc | opt -std-compile-opts". 907 908//===---------------------------------------------------------------------===// 909 910int a(int a, int b) {return (a&&b) || (a&&!b);} 911Should fold to "a". Currently not optimized with "clang -emit-llvm-bc 912| opt -std-compile-opts". 913 914//===---------------------------------------------------------------------===// 915 916int a(int a, int b, int c) {return (a&&b) || (!a&&c);} 917Should fold to "a ? b : c", or at least something sane. Currently not 918optimized with "clang -emit-llvm-bc | opt -std-compile-opts". 919 920//===---------------------------------------------------------------------===// 921 922int a(int a, int b, int c) {return (a&&b) || (a&&c) || (a&&b&&c);} 923Should fold to a && (b || c). Currently not optimized with "clang 924-emit-llvm-bc | opt -std-compile-opts". 925 926//===---------------------------------------------------------------------===// 927 928int a(int x) {return x | ((x & 8) ^ 8);} 929Should combine to x | 8. Currently not optimized with "clang 930-emit-llvm-bc | opt -std-compile-opts". 931 932//===---------------------------------------------------------------------===// 933 934int a(int x) {return x ^ ((x & 8) ^ 8);} 935Should also combine to x | 8. Currently not optimized with "clang 936-emit-llvm-bc | opt -std-compile-opts". 937 938//===---------------------------------------------------------------------===// 939 940int a(int x) {return ((x | -9) ^ 8) & x;} 941Should combine to x & -9. Currently not optimized with "clang 942-emit-llvm-bc | opt -std-compile-opts". 943 944//===---------------------------------------------------------------------===// 945 946unsigned a(unsigned a) {return a * 0x11111111 >> 28 & 1;} 947Should combine to "a * 0x88888888 >> 31". Currently not optimized 948with "clang -emit-llvm-bc | opt -std-compile-opts". 949 950//===---------------------------------------------------------------------===// 951 952unsigned a(char* x) {if ((*x & 32) == 0) return b();} 953There's an unnecessary zext in the generated code with "clang 954-emit-llvm-bc | opt -std-compile-opts". 955 956//===---------------------------------------------------------------------===// 957 958unsigned a(unsigned long long x) {return 40 * (x >> 1);} 959Should combine to "20 * (((unsigned)x) & -2)". Currently not 960optimized with "clang -emit-llvm-bc | opt -std-compile-opts". 961 962//===---------------------------------------------------------------------===// 963 964This was noticed in the entryblock for grokdeclarator in 403.gcc: 965 966 %tmp = icmp eq i32 %decl_context, 4 967 %decl_context_addr.0 = select i1 %tmp, i32 3, i32 %decl_context 968 %tmp1 = icmp eq i32 %decl_context_addr.0, 1 969 %decl_context_addr.1 = select i1 %tmp1, i32 0, i32 %decl_context_addr.0 970 971tmp1 should be simplified to something like: 972 (!tmp || decl_context == 1) 973 974This allows recursive simplifications, tmp1 is used all over the place in 975the function, e.g. by: 976 977 %tmp23 = icmp eq i32 %decl_context_addr.1, 0 ; <i1> [#uses=1] 978 %tmp24 = xor i1 %tmp1, true ; <i1> [#uses=1] 979 %or.cond8 = and i1 %tmp23, %tmp24 ; <i1> [#uses=1] 980 981later. 982 983//===---------------------------------------------------------------------===// 984 985[STORE SINKING] 986 987Store sinking: This code: 988 989void f (int n, int *cond, int *res) { 990 int i; 991 *res = 0; 992 for (i = 0; i < n; i++) 993 if (*cond) 994 *res ^= 234; /* (*) */ 995} 996 997On this function GVN hoists the fully redundant value of *res, but nothing 998moves the store out. This gives us this code: 999 1000bb: ; preds = %bb2, %entry 1001 %.rle = phi i32 [ 0, %entry ], [ %.rle6, %bb2 ] 1002 %i.05 = phi i32 [ 0, %entry ], [ %indvar.next, %bb2 ] 1003 %1 = load i32* %cond, align 4 1004 %2 = icmp eq i32 %1, 0 1005 br i1 %2, label %bb2, label %bb1 1006 1007bb1: ; preds = %bb 1008 %3 = xor i32 %.rle, 234 1009 store i32 %3, i32* %res, align 4 1010 br label %bb2 1011 1012bb2: ; preds = %bb, %bb1 1013 %.rle6 = phi i32 [ %3, %bb1 ], [ %.rle, %bb ] 1014 %indvar.next = add i32 %i.05, 1 1015 %exitcond = icmp eq i32 %indvar.next, %n 1016 br i1 %exitcond, label %return, label %bb 1017 1018DSE should sink partially dead stores to get the store out of the loop. 1019 1020Here's another partial dead case: 1021http://gcc.gnu.org/bugzilla/show_bug.cgi?id=12395 1022 1023//===---------------------------------------------------------------------===// 1024 1025Scalar PRE hoists the mul in the common block up to the else: 1026 1027int test (int a, int b, int c, int g) { 1028 int d, e; 1029 if (a) 1030 d = b * c; 1031 else 1032 d = b - c; 1033 e = b * c + g; 1034 return d + e; 1035} 1036 1037It would be better to do the mul once to reduce codesize above the if. 1038This is GCC PR38204. 1039 1040 1041//===---------------------------------------------------------------------===// 1042This simple function from 179.art: 1043 1044int winner, numf2s; 1045struct { double y; int reset; } *Y; 1046 1047void find_match() { 1048 int i; 1049 winner = 0; 1050 for (i=0;i<numf2s;i++) 1051 if (Y[i].y > Y[winner].y) 1052 winner =i; 1053} 1054 1055Compiles into (with clang TBAA): 1056 1057for.body: ; preds = %for.inc, %bb.nph 1058 %indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %for.inc ] 1059 %i.01718 = phi i32 [ 0, %bb.nph ], [ %i.01719, %for.inc ] 1060 %tmp4 = getelementptr inbounds %struct.anon* %tmp3, i64 %indvar, i32 0 1061 %tmp5 = load double* %tmp4, align 8, !tbaa !4 1062 %idxprom7 = sext i32 %i.01718 to i64 1063 %tmp10 = getelementptr inbounds %struct.anon* %tmp3, i64 %idxprom7, i32 0 1064 %tmp11 = load double* %tmp10, align 8, !tbaa !4 1065 %cmp12 = fcmp ogt double %tmp5, %tmp11 1066 br i1 %cmp12, label %if.then, label %for.inc 1067 1068if.then: ; preds = %for.body 1069 %i.017 = trunc i64 %indvar to i32 1070 br label %for.inc 1071 1072for.inc: ; preds = %for.body, %if.then 1073 %i.01719 = phi i32 [ %i.01718, %for.body ], [ %i.017, %if.then ] 1074 %indvar.next = add i64 %indvar, 1 1075 %exitcond = icmp eq i64 %indvar.next, %tmp22 1076 br i1 %exitcond, label %for.cond.for.end_crit_edge, label %for.body 1077 1078 1079It is good that we hoisted the reloads of numf2's, and Y out of the loop and 1080sunk the store to winner out. 1081 1082However, this is awful on several levels: the conditional truncate in the loop 1083(-indvars at fault? why can't we completely promote the IV to i64?). 1084 1085Beyond that, we have a partially redundant load in the loop: if "winner" (aka 1086%i.01718) isn't updated, we reload Y[winner].y the next time through the loop. 1087Similarly, the addressing that feeds it (including the sext) is redundant. In 1088the end we get this generated assembly: 1089 1090LBB0_2: ## %for.body 1091 ## =>This Inner Loop Header: Depth=1 1092 movsd (%rdi), %xmm0 1093 movslq %edx, %r8 1094 shlq $4, %r8 1095 ucomisd (%rcx,%r8), %xmm0 1096 jbe LBB0_4 1097 movl %esi, %edx 1098LBB0_4: ## %for.inc 1099 addq $16, %rdi 1100 incq %rsi 1101 cmpq %rsi, %rax 1102 jne LBB0_2 1103 1104All things considered this isn't too bad, but we shouldn't need the movslq or 1105the shlq instruction, or the load folded into ucomisd every time through the 1106loop. 1107 1108On an x86-specific topic, if the loop can't be restructure, the movl should be a 1109cmov. 1110 1111//===---------------------------------------------------------------------===// 1112 1113[STORE SINKING] 1114 1115GCC PR37810 is an interesting case where we should sink load/store reload 1116into the if block and outside the loop, so we don't reload/store it on the 1117non-call path. 1118 1119for () { 1120 *P += 1; 1121 if () 1122 call(); 1123 else 1124 ... 1125-> 1126tmp = *P 1127for () { 1128 tmp += 1; 1129 if () { 1130 *P = tmp; 1131 call(); 1132 tmp = *P; 1133 } else ... 1134} 1135*P = tmp; 1136 1137We now hoist the reload after the call (Transforms/GVN/lpre-call-wrap.ll), but 1138we don't sink the store. We need partially dead store sinking. 1139 1140//===---------------------------------------------------------------------===// 1141 1142[LOAD PRE CRIT EDGE SPLITTING] 1143 1144GCC PR37166: Sinking of loads prevents SROA'ing the "g" struct on the stack 1145leading to excess stack traffic. This could be handled by GVN with some crazy 1146symbolic phi translation. The code we get looks like (g is on the stack): 1147 1148bb2: ; preds = %bb1 1149.. 1150 %9 = getelementptr %struct.f* %g, i32 0, i32 0 1151 store i32 %8, i32* %9, align bel %bb3 1152 1153bb3: ; preds = %bb1, %bb2, %bb 1154 %c_addr.0 = phi %struct.f* [ %g, %bb2 ], [ %c, %bb ], [ %c, %bb1 ] 1155 %b_addr.0 = phi %struct.f* [ %b, %bb2 ], [ %g, %bb ], [ %b, %bb1 ] 1156 %10 = getelementptr %struct.f* %c_addr.0, i32 0, i32 0 1157 %11 = load i32* %10, align 4 1158 1159%11 is partially redundant, an in BB2 it should have the value %8. 1160 1161GCC PR33344 and PR35287 are similar cases. 1162 1163 1164//===---------------------------------------------------------------------===// 1165 1166[LOAD PRE] 1167 1168There are many load PRE testcases in testsuite/gcc.dg/tree-ssa/loadpre* in the 1169GCC testsuite, ones we don't get yet are (checked through loadpre25): 1170 1171[CRIT EDGE BREAKING] 1172loadpre3.c predcom-4.c 1173 1174[PRE OF READONLY CALL] 1175loadpre5.c 1176 1177[TURN SELECT INTO BRANCH] 1178loadpre14.c loadpre15.c 1179 1180actually a conditional increment: loadpre18.c loadpre19.c 1181 1182//===---------------------------------------------------------------------===// 1183 1184[LOAD PRE / STORE SINKING / SPEC HACK] 1185 1186This is a chunk of code from 456.hmmer: 1187 1188int f(int M, int *mc, int *mpp, int *tpmm, int *ip, int *tpim, int *dpp, 1189 int *tpdm, int xmb, int *bp, int *ms) { 1190 int k, sc; 1191 for (k = 1; k <= M; k++) { 1192 mc[k] = mpp[k-1] + tpmm[k-1]; 1193 if ((sc = ip[k-1] + tpim[k-1]) > mc[k]) mc[k] = sc; 1194 if ((sc = dpp[k-1] + tpdm[k-1]) > mc[k]) mc[k] = sc; 1195 if ((sc = xmb + bp[k]) > mc[k]) mc[k] = sc; 1196 mc[k] += ms[k]; 1197 } 1198} 1199 1200It is very profitable for this benchmark to turn the conditional stores to mc[k] 1201into a conditional move (select instr in IR) and allow the final store to do the 1202store. See GCC PR27313 for more details. Note that this is valid to xform even 1203with the new C++ memory model, since mc[k] is previously loaded and later 1204stored. 1205 1206//===---------------------------------------------------------------------===// 1207 1208[SCALAR PRE] 1209There are many PRE testcases in testsuite/gcc.dg/tree-ssa/ssa-pre-*.c in the 1210GCC testsuite. 1211 1212//===---------------------------------------------------------------------===// 1213 1214There are some interesting cases in testsuite/gcc.dg/tree-ssa/pred-comm* in the 1215GCC testsuite. For example, we get the first example in predcom-1.c, but 1216miss the second one: 1217 1218unsigned fib[1000]; 1219unsigned avg[1000]; 1220 1221__attribute__ ((noinline)) 1222void count_averages(int n) { 1223 int i; 1224 for (i = 1; i < n; i++) 1225 avg[i] = (((unsigned long) fib[i - 1] + fib[i] + fib[i + 1]) / 3) & 0xffff; 1226} 1227 1228which compiles into two loads instead of one in the loop. 1229 1230predcom-2.c is the same as predcom-1.c 1231 1232predcom-3.c is very similar but needs loads feeding each other instead of 1233store->load. 1234 1235 1236//===---------------------------------------------------------------------===// 1237 1238[ALIAS ANALYSIS] 1239 1240Type based alias analysis: 1241http://gcc.gnu.org/bugzilla/show_bug.cgi?id=14705 1242 1243We should do better analysis of posix_memalign. At the least it should 1244no-capture its pointer argument, at best, we should know that the out-value 1245result doesn't point to anything (like malloc). One example of this is in 1246SingleSource/Benchmarks/Misc/dt.c 1247 1248//===---------------------------------------------------------------------===// 1249 1250Interesting missed case because of control flow flattening (should be 2 loads): 1251http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26629 1252With: llvm-gcc t2.c -S -o - -O0 -emit-llvm | llvm-as | 1253 opt -mem2reg -gvn -instcombine | llvm-dis 1254we miss it because we need 1) CRIT EDGE 2) MULTIPLE DIFFERENT 1255VALS PRODUCED BY ONE BLOCK OVER DIFFERENT PATHS 1256 1257//===---------------------------------------------------------------------===// 1258 1259http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19633 1260We could eliminate the branch condition here, loading from null is undefined: 1261 1262struct S { int w, x, y, z; }; 1263struct T { int r; struct S s; }; 1264void bar (struct S, int); 1265void foo (int a, struct T b) 1266{ 1267 struct S *c = 0; 1268 if (a) 1269 c = &b.s; 1270 bar (*c, a); 1271} 1272 1273//===---------------------------------------------------------------------===// 1274 1275simplifylibcalls should do several optimizations for strspn/strcspn: 1276 1277strcspn(x, "a") -> inlined loop for up to 3 letters (similarly for strspn): 1278 1279size_t __strcspn_c3 (__const char *__s, int __reject1, int __reject2, 1280 int __reject3) { 1281 register size_t __result = 0; 1282 while (__s[__result] != '\0' && __s[__result] != __reject1 && 1283 __s[__result] != __reject2 && __s[__result] != __reject3) 1284 ++__result; 1285 return __result; 1286} 1287 1288This should turn into a switch on the character. See PR3253 for some notes on 1289codegen. 1290 1291456.hmmer apparently uses strcspn and strspn a lot. 471.omnetpp uses strspn. 1292 1293//===---------------------------------------------------------------------===// 1294 1295simplifylibcalls should turn these snprintf idioms into memcpy (GCC PR47917) 1296 1297char buf1[6], buf2[6], buf3[4], buf4[4]; 1298int i; 1299 1300int foo (void) { 1301 int ret = snprintf (buf1, sizeof buf1, "abcde"); 1302 ret += snprintf (buf2, sizeof buf2, "abcdef") * 16; 1303 ret += snprintf (buf3, sizeof buf3, "%s", i++ < 6 ? "abc" : "def") * 256; 1304 ret += snprintf (buf4, sizeof buf4, "%s", i++ > 10 ? "abcde" : "defgh")*4096; 1305 return ret; 1306} 1307 1308//===---------------------------------------------------------------------===// 1309 1310"gas" uses this idiom: 1311 else if (strchr ("+-/*%|&^:[]()~", *intel_parser.op_string)) 1312.. 1313 else if (strchr ("<>", *intel_parser.op_string) 1314 1315Those should be turned into a switch. 1316 1317//===---------------------------------------------------------------------===// 1318 1319252.eon contains this interesting code: 1320 1321 %3072 = getelementptr [100 x i8]* %tempString, i32 0, i32 0 1322 %3073 = call i8* @strcpy(i8* %3072, i8* %3071) nounwind 1323 %strlen = call i32 @strlen(i8* %3072) ; uses = 1 1324 %endptr = getelementptr [100 x i8]* %tempString, i32 0, i32 %strlen 1325 call void @llvm.memcpy.i32(i8* %endptr, 1326 i8* getelementptr ([5 x i8]* @"\01LC42", i32 0, i32 0), i32 5, i32 1) 1327 %3074 = call i32 @strlen(i8* %endptr) nounwind readonly 1328 1329This is interesting for a couple reasons. First, in this: 1330 1331The memcpy+strlen strlen can be replaced with: 1332 1333 %3074 = call i32 @strlen([5 x i8]* @"\01LC42") nounwind readonly 1334 1335Because the destination was just copied into the specified memory buffer. This, 1336in turn, can be constant folded to "4". 1337 1338In other code, it contains: 1339 1340 %endptr6978 = bitcast i8* %endptr69 to i32* 1341 store i32 7107374, i32* %endptr6978, align 1 1342 %3167 = call i32 @strlen(i8* %endptr69) nounwind readonly 1343 1344Which could also be constant folded. Whatever is producing this should probably 1345be fixed to leave this as a memcpy from a string. 1346 1347Further, eon also has an interesting partially redundant strlen call: 1348 1349bb8: ; preds = %_ZN18eonImageCalculatorC1Ev.exit 1350 %682 = getelementptr i8** %argv, i32 6 ; <i8**> [#uses=2] 1351 %683 = load i8** %682, align 4 ; <i8*> [#uses=4] 1352 %684 = load i8* %683, align 1 ; <i8> [#uses=1] 1353 %685 = icmp eq i8 %684, 0 ; <i1> [#uses=1] 1354 br i1 %685, label %bb10, label %bb9 1355 1356bb9: ; preds = %bb8 1357 %686 = call i32 @strlen(i8* %683) nounwind readonly 1358 %687 = icmp ugt i32 %686, 254 ; <i1> [#uses=1] 1359 br i1 %687, label %bb10, label %bb11 1360 1361bb10: ; preds = %bb9, %bb8 1362 %688 = call i32 @strlen(i8* %683) nounwind readonly 1363 1364This could be eliminated by doing the strlen once in bb8, saving code size and 1365improving perf on the bb8->9->10 path. 1366 1367//===---------------------------------------------------------------------===// 1368 1369I see an interesting fully redundant call to strlen left in 186.crafty:InputMove 1370which looks like: 1371 %movetext11 = getelementptr [128 x i8]* %movetext, i32 0, i32 0 1372 1373 1374bb62: ; preds = %bb55, %bb53 1375 %promote.0 = phi i32 [ %169, %bb55 ], [ 0, %bb53 ] 1376 %171 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1 1377 %172 = add i32 %171, -1 ; <i32> [#uses=1] 1378 %173 = getelementptr [128 x i8]* %movetext, i32 0, i32 %172 1379 1380... no stores ... 1381 br i1 %or.cond, label %bb65, label %bb72 1382 1383bb65: ; preds = %bb62 1384 store i8 0, i8* %173, align 1 1385 br label %bb72 1386 1387bb72: ; preds = %bb65, %bb62 1388 %trank.1 = phi i32 [ %176, %bb65 ], [ -1, %bb62 ] 1389 %177 = call i32 @strlen(i8* %movetext11) nounwind readonly align 1 1390 1391Note that on the bb62->bb72 path, that the %177 strlen call is partially 1392redundant with the %171 call. At worst, we could shove the %177 strlen call 1393up into the bb65 block moving it out of the bb62->bb72 path. However, note 1394that bb65 stores to the string, zeroing out the last byte. This means that on 1395that path the value of %177 is actually just %171-1. A sub is cheaper than a 1396strlen! 1397 1398This pattern repeats several times, basically doing: 1399 1400 A = strlen(P); 1401 P[A-1] = 0; 1402 B = strlen(P); 1403 where it is "obvious" that B = A-1. 1404 1405//===---------------------------------------------------------------------===// 1406 1407186.crafty has this interesting pattern with the "out.4543" variable: 1408 1409call void @llvm.memcpy.i32( 1410 i8* getelementptr ([10 x i8]* @out.4543, i32 0, i32 0), 1411 i8* getelementptr ([7 x i8]* @"\01LC28700", i32 0, i32 0), i32 7, i32 1) 1412%101 = call@printf(i8* ... @out.4543, i32 0, i32 0)) nounwind 1413 1414It is basically doing: 1415 1416 memcpy(globalarray, "string"); 1417 printf(..., globalarray); 1418 1419Anyway, by knowing that printf just reads the memory and forward substituting 1420the string directly into the printf, this eliminates reads from globalarray. 1421Since this pattern occurs frequently in crafty (due to the "DisplayTime" and 1422other similar functions) there are many stores to "out". Once all the printfs 1423stop using "out", all that is left is the memcpy's into it. This should allow 1424globalopt to remove the "stored only" global. 1425 1426//===---------------------------------------------------------------------===// 1427 1428This code: 1429 1430define inreg i32 @foo(i8* inreg %p) nounwind { 1431 %tmp0 = load i8* %p 1432 %tmp1 = ashr i8 %tmp0, 5 1433 %tmp2 = sext i8 %tmp1 to i32 1434 ret i32 %tmp2 1435} 1436 1437could be dagcombine'd to a sign-extending load with a shift. 1438For example, on x86 this currently gets this: 1439 1440 movb (%eax), %al 1441 sarb $5, %al 1442 movsbl %al, %eax 1443 1444while it could get this: 1445 1446 movsbl (%eax), %eax 1447 sarl $5, %eax 1448 1449//===---------------------------------------------------------------------===// 1450 1451GCC PR31029: 1452 1453int test(int x) { return 1-x == x; } // --> return false 1454int test2(int x) { return 2-x == x; } // --> return x == 1 ? 1455 1456Always foldable for odd constants, what is the rule for even? 1457 1458//===---------------------------------------------------------------------===// 1459 1460PR 3381: GEP to field of size 0 inside a struct could be turned into GEP 1461for next field in struct (which is at same address). 1462 1463For example: store of float into { {{}}, float } could be turned into a store to 1464the float directly. 1465 1466//===---------------------------------------------------------------------===// 1467 1468The arg promotion pass should make use of nocapture to make its alias analysis 1469stuff much more precise. 1470 1471//===---------------------------------------------------------------------===// 1472 1473The following functions should be optimized to use a select instead of a 1474branch (from gcc PR40072): 1475 1476char char_int(int m) {if(m>7) return 0; return m;} 1477int int_char(char m) {if(m>7) return 0; return m;} 1478 1479//===---------------------------------------------------------------------===// 1480 1481int func(int a, int b) { if (a & 0x80) b |= 0x80; else b &= ~0x80; return b; } 1482 1483Generates this: 1484 1485define i32 @func(i32 %a, i32 %b) nounwind readnone ssp { 1486entry: 1487 %0 = and i32 %a, 128 ; <i32> [#uses=1] 1488 %1 = icmp eq i32 %0, 0 ; <i1> [#uses=1] 1489 %2 = or i32 %b, 128 ; <i32> [#uses=1] 1490 %3 = and i32 %b, -129 ; <i32> [#uses=1] 1491 %b_addr.0 = select i1 %1, i32 %3, i32 %2 ; <i32> [#uses=1] 1492 ret i32 %b_addr.0 1493} 1494 1495However, it's functionally equivalent to: 1496 1497 b = (b & ~0x80) | (a & 0x80); 1498 1499Which generates this: 1500 1501define i32 @func(i32 %a, i32 %b) nounwind readnone ssp { 1502entry: 1503 %0 = and i32 %b, -129 ; <i32> [#uses=1] 1504 %1 = and i32 %a, 128 ; <i32> [#uses=1] 1505 %2 = or i32 %0, %1 ; <i32> [#uses=1] 1506 ret i32 %2 1507} 1508 1509This can be generalized for other forms: 1510 1511 b = (b & ~0x80) | (a & 0x40) << 1; 1512 1513//===---------------------------------------------------------------------===// 1514 1515These two functions produce different code. They shouldn't: 1516 1517#include <stdint.h> 1518 1519uint8_t p1(uint8_t b, uint8_t a) { 1520 b = (b & ~0xc0) | (a & 0xc0); 1521 return (b); 1522} 1523 1524uint8_t p2(uint8_t b, uint8_t a) { 1525 b = (b & ~0x40) | (a & 0x40); 1526 b = (b & ~0x80) | (a & 0x80); 1527 return (b); 1528} 1529 1530define zeroext i8 @p1(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp { 1531entry: 1532 %0 = and i8 %b, 63 ; <i8> [#uses=1] 1533 %1 = and i8 %a, -64 ; <i8> [#uses=1] 1534 %2 = or i8 %1, %0 ; <i8> [#uses=1] 1535 ret i8 %2 1536} 1537 1538define zeroext i8 @p2(i8 zeroext %b, i8 zeroext %a) nounwind readnone ssp { 1539entry: 1540 %0 = and i8 %b, 63 ; <i8> [#uses=1] 1541 %.masked = and i8 %a, 64 ; <i8> [#uses=1] 1542 %1 = and i8 %a, -128 ; <i8> [#uses=1] 1543 %2 = or i8 %1, %0 ; <i8> [#uses=1] 1544 %3 = or i8 %2, %.masked ; <i8> [#uses=1] 1545 ret i8 %3 1546} 1547 1548//===---------------------------------------------------------------------===// 1549 1550IPSCCP does not currently propagate argument dependent constants through 1551functions where it does not not all of the callers. This includes functions 1552with normal external linkage as well as templates, C99 inline functions etc. 1553Specifically, it does nothing to: 1554 1555define i32 @test(i32 %x, i32 %y, i32 %z) nounwind { 1556entry: 1557 %0 = add nsw i32 %y, %z 1558 %1 = mul i32 %0, %x 1559 %2 = mul i32 %y, %z 1560 %3 = add nsw i32 %1, %2 1561 ret i32 %3 1562} 1563 1564define i32 @test2() nounwind { 1565entry: 1566 %0 = call i32 @test(i32 1, i32 2, i32 4) nounwind 1567 ret i32 %0 1568} 1569 1570It would be interesting extend IPSCCP to be able to handle simple cases like 1571this, where all of the arguments to a call are constant. Because IPSCCP runs 1572before inlining, trivial templates and inline functions are not yet inlined. 1573The results for a function + set of constant arguments should be memoized in a 1574map. 1575 1576//===---------------------------------------------------------------------===// 1577 1578The libcall constant folding stuff should be moved out of SimplifyLibcalls into 1579libanalysis' constantfolding logic. This would allow IPSCCP to be able to 1580handle simple things like this: 1581 1582static int foo(const char *X) { return strlen(X); } 1583int bar() { return foo("abcd"); } 1584 1585//===---------------------------------------------------------------------===// 1586 1587functionattrs doesn't know much about memcpy/memset. This function should be 1588marked readnone rather than readonly, since it only twiddles local memory, but 1589functionattrs doesn't handle memset/memcpy/memmove aggressively: 1590 1591struct X { int *p; int *q; }; 1592int foo() { 1593 int i = 0, j = 1; 1594 struct X x, y; 1595 int **p; 1596 y.p = &i; 1597 x.q = &j; 1598 p = __builtin_memcpy (&x, &y, sizeof (int *)); 1599 return **p; 1600} 1601 1602This can be seen at: 1603$ clang t.c -S -o - -mkernel -O0 -emit-llvm | opt -functionattrs -S 1604 1605 1606//===---------------------------------------------------------------------===// 1607 1608Missed instcombine transformation: 1609define i1 @a(i32 %x) nounwind readnone { 1610entry: 1611 %cmp = icmp eq i32 %x, 30 1612 %sub = add i32 %x, -30 1613 %cmp2 = icmp ugt i32 %sub, 9 1614 %or = or i1 %cmp, %cmp2 1615 ret i1 %or 1616} 1617This should be optimized to a single compare. Testcase derived from gcc. 1618 1619//===---------------------------------------------------------------------===// 1620 1621Missed instcombine or reassociate transformation: 1622int a(int a, int b) { return (a==12)&(b>47)&(b<58); } 1623 1624The sgt and slt should be combined into a single comparison. Testcase derived 1625from gcc. 1626 1627//===---------------------------------------------------------------------===// 1628 1629Missed instcombine transformation: 1630 1631 %382 = srem i32 %tmp14.i, 64 ; [#uses=1] 1632 %383 = zext i32 %382 to i64 ; [#uses=1] 1633 %384 = shl i64 %381, %383 ; [#uses=1] 1634 %385 = icmp slt i32 %tmp14.i, 64 ; [#uses=1] 1635 1636The srem can be transformed to an and because if %tmp14.i is negative, the 1637shift is undefined. Testcase derived from 403.gcc. 1638 1639//===---------------------------------------------------------------------===// 1640 1641This is a range comparison on a divided result (from 403.gcc): 1642 1643 %1337 = sdiv i32 %1336, 8 ; [#uses=1] 1644 %.off.i208 = add i32 %1336, 7 ; [#uses=1] 1645 %1338 = icmp ult i32 %.off.i208, 15 ; [#uses=1] 1646 1647We already catch this (removing the sdiv) if there isn't an add, we should 1648handle the 'add' as well. This is a common idiom with it's builtin_alloca code. 1649C testcase: 1650 1651int a(int x) { return (unsigned)(x/16+7) < 15; } 1652 1653Another similar case involves truncations on 64-bit targets: 1654 1655 %361 = sdiv i64 %.046, 8 ; [#uses=1] 1656 %362 = trunc i64 %361 to i32 ; [#uses=2] 1657... 1658 %367 = icmp eq i32 %362, 0 ; [#uses=1] 1659 1660//===---------------------------------------------------------------------===// 1661 1662Missed instcombine/dagcombine transformation: 1663define void @lshift_lt(i8 zeroext %a) nounwind { 1664entry: 1665 %conv = zext i8 %a to i32 1666 %shl = shl i32 %conv, 3 1667 %cmp = icmp ult i32 %shl, 33 1668 br i1 %cmp, label %if.then, label %if.end 1669 1670if.then: 1671 tail call void @bar() nounwind 1672 ret void 1673 1674if.end: 1675 ret void 1676} 1677declare void @bar() nounwind 1678 1679The shift should be eliminated. Testcase derived from gcc. 1680 1681//===---------------------------------------------------------------------===// 1682 1683These compile into different code, one gets recognized as a switch and the 1684other doesn't due to phase ordering issues (PR6212): 1685 1686int test1(int mainType, int subType) { 1687 if (mainType == 7) 1688 subType = 4; 1689 else if (mainType == 9) 1690 subType = 6; 1691 else if (mainType == 11) 1692 subType = 9; 1693 return subType; 1694} 1695 1696int test2(int mainType, int subType) { 1697 if (mainType == 7) 1698 subType = 4; 1699 if (mainType == 9) 1700 subType = 6; 1701 if (mainType == 11) 1702 subType = 9; 1703 return subType; 1704} 1705 1706//===---------------------------------------------------------------------===// 1707 1708The following test case (from PR6576): 1709 1710define i32 @mul(i32 %a, i32 %b) nounwind readnone { 1711entry: 1712 %cond1 = icmp eq i32 %b, 0 ; <i1> [#uses=1] 1713 br i1 %cond1, label %exit, label %bb.nph 1714bb.nph: ; preds = %entry 1715 %tmp = mul i32 %b, %a ; <i32> [#uses=1] 1716 ret i32 %tmp 1717exit: ; preds = %entry 1718 ret i32 0 1719} 1720 1721could be reduced to: 1722 1723define i32 @mul(i32 %a, i32 %b) nounwind readnone { 1724entry: 1725 %tmp = mul i32 %b, %a 1726 ret i32 %tmp 1727} 1728 1729//===---------------------------------------------------------------------===// 1730 1731We should use DSE + llvm.lifetime.end to delete dead vtable pointer updates. 1732See GCC PR34949 1733 1734Another interesting case is that something related could be used for variables 1735that go const after their ctor has finished. In these cases, globalopt (which 1736can statically run the constructor) could mark the global const (so it gets put 1737in the readonly section). A testcase would be: 1738 1739#include <complex> 1740using namespace std; 1741const complex<char> should_be_in_rodata (42,-42); 1742complex<char> should_be_in_data (42,-42); 1743complex<char> should_be_in_bss; 1744 1745Where we currently evaluate the ctors but the globals don't become const because 1746the optimizer doesn't know they "become const" after the ctor is done. See 1747GCC PR4131 for more examples. 1748 1749//===---------------------------------------------------------------------===// 1750 1751In this code: 1752 1753long foo(long x) { 1754 return x > 1 ? x : 1; 1755} 1756 1757LLVM emits a comparison with 1 instead of 0. 0 would be equivalent 1758and cheaper on most targets. 1759 1760LLVM prefers comparisons with zero over non-zero in general, but in this 1761case it choses instead to keep the max operation obvious. 1762 1763//===---------------------------------------------------------------------===// 1764 1765define void @a(i32 %x) nounwind { 1766entry: 1767 switch i32 %x, label %if.end [ 1768 i32 0, label %if.then 1769 i32 1, label %if.then 1770 i32 2, label %if.then 1771 i32 3, label %if.then 1772 i32 5, label %if.then 1773 ] 1774if.then: 1775 tail call void @foo() nounwind 1776 ret void 1777if.end: 1778 ret void 1779} 1780declare void @foo() 1781 1782Generated code on x86-64 (other platforms give similar results): 1783a: 1784 cmpl $5, %edi 1785 ja LBB2_2 1786 cmpl $4, %edi 1787 jne LBB2_3 1788.LBB0_2: 1789 ret 1790.LBB0_3: 1791 jmp foo # TAILCALL 1792 1793If we wanted to be really clever, we could simplify the whole thing to 1794something like the following, which eliminates a branch: 1795 xorl $1, %edi 1796 cmpl $4, %edi 1797 ja .LBB0_2 1798 ret 1799.LBB0_2: 1800 jmp foo # TAILCALL 1801 1802//===---------------------------------------------------------------------===// 1803 1804We compile this: 1805 1806int foo(int a) { return (a & (~15)) / 16; } 1807 1808Into: 1809 1810define i32 @foo(i32 %a) nounwind readnone ssp { 1811entry: 1812 %and = and i32 %a, -16 1813 %div = sdiv i32 %and, 16 1814 ret i32 %div 1815} 1816 1817but this code (X & -A)/A is X >> log2(A) when A is a power of 2, so this case 1818should be instcombined into just "a >> 4". 1819 1820We do get this at the codegen level, so something knows about it, but 1821instcombine should catch it earlier: 1822 1823_foo: ## @foo 1824## BB#0: ## %entry 1825 movl %edi, %eax 1826 sarl $4, %eax 1827 ret 1828 1829//===---------------------------------------------------------------------===// 1830 1831This code (from GCC PR28685): 1832 1833int test(int a, int b) { 1834 int lt = a < b; 1835 int eq = a == b; 1836 if (lt) 1837 return 1; 1838 return eq; 1839} 1840 1841Is compiled to: 1842 1843define i32 @test(i32 %a, i32 %b) nounwind readnone ssp { 1844entry: 1845 %cmp = icmp slt i32 %a, %b 1846 br i1 %cmp, label %return, label %if.end 1847 1848if.end: ; preds = %entry 1849 %cmp5 = icmp eq i32 %a, %b 1850 %conv6 = zext i1 %cmp5 to i32 1851 ret i32 %conv6 1852 1853return: ; preds = %entry 1854 ret i32 1 1855} 1856 1857it could be: 1858 1859define i32 @test__(i32 %a, i32 %b) nounwind readnone ssp { 1860entry: 1861 %0 = icmp sle i32 %a, %b 1862 %retval = zext i1 %0 to i32 1863 ret i32 %retval 1864} 1865 1866//===---------------------------------------------------------------------===// 1867 1868This code can be seen in viterbi: 1869 1870 %64 = call noalias i8* @malloc(i64 %62) nounwind 1871... 1872 %67 = call i64 @llvm.objectsize.i64(i8* %64, i1 false) nounwind 1873 %68 = call i8* @__memset_chk(i8* %64, i32 0, i64 %62, i64 %67) nounwind 1874 1875llvm.objectsize.i64 should be taught about malloc/calloc, allowing it to 1876fold to %62. This is a security win (overflows of malloc will get caught) 1877and also a performance win by exposing more memsets to the optimizer. 1878 1879This occurs several times in viterbi. 1880 1881Note that this would change the semantics of @llvm.objectsize which by its 1882current definition always folds to a constant. We also should make sure that 1883we remove checking in code like 1884 1885 char *p = malloc(strlen(s)+1); 1886 __strcpy_chk(p, s, __builtin_objectsize(p, 0)); 1887 1888//===---------------------------------------------------------------------===// 1889 1890This code (from Benchmarks/Dhrystone/dry.c): 1891 1892define i32 @Func1(i32, i32) nounwind readnone optsize ssp { 1893entry: 1894 %sext = shl i32 %0, 24 1895 %conv = ashr i32 %sext, 24 1896 %sext6 = shl i32 %1, 24 1897 %conv4 = ashr i32 %sext6, 24 1898 %cmp = icmp eq i32 %conv, %conv4 1899 %. = select i1 %cmp, i32 10000, i32 0 1900 ret i32 %. 1901} 1902 1903Should be simplified into something like: 1904 1905define i32 @Func1(i32, i32) nounwind readnone optsize ssp { 1906entry: 1907 %sext = shl i32 %0, 24 1908 %conv = and i32 %sext, 0xFF000000 1909 %sext6 = shl i32 %1, 24 1910 %conv4 = and i32 %sext6, 0xFF000000 1911 %cmp = icmp eq i32 %conv, %conv4 1912 %. = select i1 %cmp, i32 10000, i32 0 1913 ret i32 %. 1914} 1915 1916and then to: 1917 1918define i32 @Func1(i32, i32) nounwind readnone optsize ssp { 1919entry: 1920 %conv = and i32 %0, 0xFF 1921 %conv4 = and i32 %1, 0xFF 1922 %cmp = icmp eq i32 %conv, %conv4 1923 %. = select i1 %cmp, i32 10000, i32 0 1924 ret i32 %. 1925} 1926//===---------------------------------------------------------------------===// 1927 1928clang -O3 currently compiles this code 1929 1930int g(unsigned int a) { 1931 unsigned int c[100]; 1932 c[10] = a; 1933 c[11] = a; 1934 unsigned int b = c[10] + c[11]; 1935 if(b > a*2) a = 4; 1936 else a = 8; 1937 return a + 7; 1938} 1939 1940into 1941 1942define i32 @g(i32 a) nounwind readnone { 1943 %add = shl i32 %a, 1 1944 %mul = shl i32 %a, 1 1945 %cmp = icmp ugt i32 %add, %mul 1946 %a.addr.0 = select i1 %cmp, i32 11, i32 15 1947 ret i32 %a.addr.0 1948} 1949 1950The icmp should fold to false. This CSE opportunity is only available 1951after GVN and InstCombine have run. 1952 1953//===---------------------------------------------------------------------===// 1954 1955memcpyopt should turn this: 1956 1957define i8* @test10(i32 %x) { 1958 %alloc = call noalias i8* @malloc(i32 %x) nounwind 1959 call void @llvm.memset.p0i8.i32(i8* %alloc, i8 0, i32 %x, i32 1, i1 false) 1960 ret i8* %alloc 1961} 1962 1963into a call to calloc. We should make sure that we analyze calloc as 1964aggressively as malloc though. 1965 1966//===---------------------------------------------------------------------===// 1967 1968clang -O3 doesn't optimize this: 1969 1970void f1(int* begin, int* end) { 1971 std::fill(begin, end, 0); 1972} 1973 1974into a memset. This is PR8942. 1975 1976//===---------------------------------------------------------------------===// 1977 1978clang -O3 -fno-exceptions currently compiles this code: 1979 1980void f(int N) { 1981 std::vector<int> v(N); 1982 1983 extern void sink(void*); sink(&v); 1984} 1985 1986into 1987 1988define void @_Z1fi(i32 %N) nounwind { 1989entry: 1990 %v2 = alloca [3 x i32*], align 8 1991 %v2.sub = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 0 1992 %tmpcast = bitcast [3 x i32*]* %v2 to %"class.std::vector"* 1993 %conv = sext i32 %N to i64 1994 store i32* null, i32** %v2.sub, align 8, !tbaa !0 1995 %tmp3.i.i.i.i.i = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 1 1996 store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0 1997 %tmp4.i.i.i.i.i = getelementptr inbounds [3 x i32*]* %v2, i64 0, i64 2 1998 store i32* null, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0 1999 %cmp.i.i.i.i = icmp eq i32 %N, 0 2000 br i1 %cmp.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i, label %cond.true.i.i.i.i 2001 2002_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.thread.i.i: ; preds = %entry 2003 store i32* null, i32** %v2.sub, align 8, !tbaa !0 2004 store i32* null, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0 2005 %add.ptr.i5.i.i = getelementptr inbounds i32* null, i64 %conv 2006 store i32* %add.ptr.i5.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0 2007 br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit 2008 2009cond.true.i.i.i.i: ; preds = %entry 2010 %cmp.i.i.i.i.i = icmp slt i32 %N, 0 2011 br i1 %cmp.i.i.i.i.i, label %if.then.i.i.i.i.i, label %_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i 2012 2013if.then.i.i.i.i.i: ; preds = %cond.true.i.i.i.i 2014 call void @_ZSt17__throw_bad_allocv() noreturn nounwind 2015 unreachable 2016 2017_ZNSt12_Vector_baseIiSaIiEEC2EmRKS0_.exit.i.i: ; preds = %cond.true.i.i.i.i 2018 %mul.i.i.i.i.i = shl i64 %conv, 2 2019 %call3.i.i.i.i.i = call noalias i8* @_Znwm(i64 %mul.i.i.i.i.i) nounwind 2020 %0 = bitcast i8* %call3.i.i.i.i.i to i32* 2021 store i32* %0, i32** %v2.sub, align 8, !tbaa !0 2022 store i32* %0, i32** %tmp3.i.i.i.i.i, align 8, !tbaa !0 2023 %add.ptr.i.i.i = getelementptr inbounds i32* %0, i64 %conv 2024 store i32* %add.ptr.i.i.i, i32** %tmp4.i.i.i.i.i, align 8, !tbaa !0 2025 call void @llvm.memset.p0i8.i64(i8* %call3.i.i.i.i.i, i8 0, i64 %mul.i.i.i.i.i, i32 4, i1 false) 2026 br label %_ZNSt6vectorIiSaIiEEC1EmRKiRKS0_.exit 2027 2028This is just the handling the construction of the vector. Most surprising here 2029is the fact that all three null stores in %entry are dead (because we do no 2030cross-block DSE). 2031 2032Also surprising is that %conv isn't simplified to 0 in %....exit.thread.i.i. 2033This is a because the client of LazyValueInfo doesn't simplify all instruction 2034operands, just selected ones. 2035 2036//===---------------------------------------------------------------------===// 2037 2038clang -O3 -fno-exceptions currently compiles this code: 2039 2040void f(char* a, int n) { 2041 __builtin_memset(a, 0, n); 2042 for (int i = 0; i < n; ++i) 2043 a[i] = 0; 2044} 2045 2046into: 2047 2048define void @_Z1fPci(i8* nocapture %a, i32 %n) nounwind { 2049entry: 2050 %conv = sext i32 %n to i64 2051 tail call void @llvm.memset.p0i8.i64(i8* %a, i8 0, i64 %conv, i32 1, i1 false) 2052 %cmp8 = icmp sgt i32 %n, 0 2053 br i1 %cmp8, label %for.body.lr.ph, label %for.end 2054 2055for.body.lr.ph: ; preds = %entry 2056 %tmp10 = add i32 %n, -1 2057 %tmp11 = zext i32 %tmp10 to i64 2058 %tmp12 = add i64 %tmp11, 1 2059 call void @llvm.memset.p0i8.i64(i8* %a, i8 0, i64 %tmp12, i32 1, i1 false) 2060 ret void 2061 2062for.end: ; preds = %entry 2063 ret void 2064} 2065 2066This shouldn't need the ((zext (%n - 1)) + 1) game, and it should ideally fold 2067the two memset's together. 2068 2069The issue with the addition only occurs in 64-bit mode, and appears to be at 2070least partially caused by Scalar Evolution not keeping its cache updated: it 2071returns the "wrong" result immediately after indvars runs, but figures out the 2072expected result if it is run from scratch on IR resulting from running indvars. 2073 2074//===---------------------------------------------------------------------===// 2075 2076clang -O3 -fno-exceptions currently compiles this code: 2077 2078struct S { 2079 unsigned short m1, m2; 2080 unsigned char m3, m4; 2081}; 2082 2083void f(int N) { 2084 std::vector<S> v(N); 2085 extern void sink(void*); sink(&v); 2086} 2087 2088into poor code for zero-initializing 'v' when N is >0. The problem is that 2089S is only 6 bytes, but each element is 8 byte-aligned. We generate a loop and 20904 stores on each iteration. If the struct were 8 bytes, this gets turned into 2091a memset. 2092 2093In order to handle this we have to: 2094 A) Teach clang to generate metadata for memsets of structs that have holes in 2095 them. 2096 B) Teach clang to use such a memset for zero init of this struct (since it has 2097 a hole), instead of doing elementwise zeroing. 2098 2099//===---------------------------------------------------------------------===// 2100 2101clang -O3 currently compiles this code: 2102 2103extern const int magic; 2104double f() { return 0.0 * magic; } 2105 2106into 2107 2108@magic = external constant i32 2109 2110define double @_Z1fv() nounwind readnone { 2111entry: 2112 %tmp = load i32* @magic, align 4, !tbaa !0 2113 %conv = sitofp i32 %tmp to double 2114 %mul = fmul double %conv, 0.000000e+00 2115 ret double %mul 2116} 2117 2118We should be able to fold away this fmul to 0.0. More generally, fmul(x,0.0) 2119can be folded to 0.0 if we can prove that the LHS is not -0.0, not a NaN, and 2120not an INF. The CannotBeNegativeZero predicate in value tracking should be 2121extended to support general "fpclassify" operations that can return 2122yes/no/unknown for each of these predicates. 2123 2124In this predicate, we know that uitofp is trivially never NaN or -0.0, and 2125we know that it isn't +/-Inf if the floating point type has enough exponent bits 2126to represent the largest integer value as < inf. 2127 2128//===---------------------------------------------------------------------===// 2129 2130When optimizing a transformation that can change the sign of 0.0 (such as the 21310.0*val -> 0.0 transformation above), it might be provable that the sign of the 2132expression doesn't matter. For example, by the above rules, we can't transform 2133fmul(sitofp(x), 0.0) into 0.0, because x might be -1 and the result of the 2134expression is defined to be -0.0. 2135 2136If we look at the uses of the fmul for example, we might be able to prove that 2137all uses don't care about the sign of zero. For example, if we have: 2138 2139 fadd(fmul(sitofp(x), 0.0), 2.0) 2140 2141Since we know that x+2.0 doesn't care about the sign of any zeros in X, we can 2142transform the fmul to 0.0, and then the fadd to 2.0. 2143 2144//===---------------------------------------------------------------------===// 2145 2146We should enhance memcpy/memcpy/memset to allow a metadata node on them 2147indicating that some bytes of the transfer are undefined. This is useful for 2148frontends like clang when lowering struct copies, when some elements of the 2149struct are undefined. Consider something like this: 2150 2151struct x { 2152 char a; 2153 int b[4]; 2154}; 2155void foo(struct x*P); 2156struct x testfunc() { 2157 struct x V1, V2; 2158 foo(&V1); 2159 V2 = V1; 2160 2161 return V2; 2162} 2163 2164We currently compile this to: 2165$ clang t.c -S -o - -O0 -emit-llvm | opt -scalarrepl -S 2166 2167 2168%struct.x = type { i8, [4 x i32] } 2169 2170define void @testfunc(%struct.x* sret %agg.result) nounwind ssp { 2171entry: 2172 %V1 = alloca %struct.x, align 4 2173 call void @foo(%struct.x* %V1) 2174 %tmp1 = bitcast %struct.x* %V1 to i8* 2175 %0 = bitcast %struct.x* %V1 to i160* 2176 %srcval1 = load i160* %0, align 4 2177 %tmp2 = bitcast %struct.x* %agg.result to i8* 2178 %1 = bitcast %struct.x* %agg.result to i160* 2179 store i160 %srcval1, i160* %1, align 4 2180 ret void 2181} 2182 2183This happens because SRoA sees that the temp alloca has is being memcpy'd into 2184and out of and it has holes and it has to be conservative. If we knew about the 2185holes, then this could be much much better. 2186 2187Having information about these holes would also improve memcpy (etc) lowering at 2188llc time when it gets inlined, because we can use smaller transfers. This also 2189avoids partial register stalls in some important cases. 2190 2191//===---------------------------------------------------------------------===// 2192 2193We don't fold (icmp (add) (add)) unless the two adds only have a single use. 2194There are a lot of cases that we're refusing to fold in (e.g.) 256.bzip2, for 2195example: 2196 2197 %indvar.next90 = add i64 %indvar89, 1 ;; Has 2 uses 2198 %tmp96 = add i64 %tmp95, 1 ;; Has 1 use 2199 %exitcond97 = icmp eq i64 %indvar.next90, %tmp96 2200 2201We don't fold this because we don't want to introduce an overlapped live range 2202of the ivar. However if we can make this more aggressive without causing 2203performance issues in two ways: 2204 22051. If *either* the LHS or RHS has a single use, we can definitely do the 2206 transformation. In the overlapping liverange case we're trading one register 2207 use for one fewer operation, which is a reasonable trade. Before doing this 2208 we should verify that the llc output actually shrinks for some benchmarks. 22092. If both ops have multiple uses, we can still fold it if the operations are 2210 both sinkable to *after* the icmp (e.g. in a subsequent block) which doesn't 2211 increase register pressure. 2212 2213There are a ton of icmp's we aren't simplifying because of the reg pressure 2214concern. Care is warranted here though because many of these are induction 2215variables and other cases that matter a lot to performance, like the above. 2216Here's a blob of code that you can drop into the bottom of visitICmp to see some 2217missed cases: 2218 2219 { Value *A, *B, *C, *D; 2220 if (match(Op0, m_Add(m_Value(A), m_Value(B))) && 2221 match(Op1, m_Add(m_Value(C), m_Value(D))) && 2222 (A == C || A == D || B == C || B == D)) { 2223 errs() << "OP0 = " << *Op0 << " U=" << Op0->getNumUses() << "\n"; 2224 errs() << "OP1 = " << *Op1 << " U=" << Op1->getNumUses() << "\n"; 2225 errs() << "CMP = " << I << "\n\n"; 2226 } 2227 } 2228 2229//===---------------------------------------------------------------------===// 2230 2231define i1 @test1(i32 %x) nounwind { 2232 %and = and i32 %x, 3 2233 %cmp = icmp ult i32 %and, 2 2234 ret i1 %cmp 2235} 2236 2237Can be folded to (x & 2) == 0. 2238 2239define i1 @test2(i32 %x) nounwind { 2240 %and = and i32 %x, 3 2241 %cmp = icmp ugt i32 %and, 1 2242 ret i1 %cmp 2243} 2244 2245Can be folded to (x & 2) != 0. 2246 2247SimplifyDemandedBits shrinks the "and" constant to 2 but instcombine misses the 2248icmp transform. 2249 2250//===---------------------------------------------------------------------===// 2251 2252This code: 2253 2254typedef struct { 2255int f1:1; 2256int f2:1; 2257int f3:1; 2258int f4:29; 2259} t1; 2260 2261typedef struct { 2262int f1:1; 2263int f2:1; 2264int f3:30; 2265} t2; 2266 2267t1 s1; 2268t2 s2; 2269 2270void func1(void) 2271{ 2272s1.f1 = s2.f1; 2273s1.f2 = s2.f2; 2274} 2275 2276Compiles into this IR (on x86-64 at least): 2277 2278%struct.t1 = type { i8, [3 x i8] } 2279@s2 = global %struct.t1 zeroinitializer, align 4 2280@s1 = global %struct.t1 zeroinitializer, align 4 2281define void @func1() nounwind ssp noredzone { 2282entry: 2283 %0 = load i32* bitcast (%struct.t1* @s2 to i32*), align 4 2284 %bf.val.sext5 = and i32 %0, 1 2285 %1 = load i32* bitcast (%struct.t1* @s1 to i32*), align 4 2286 %2 = and i32 %1, -4 2287 %3 = or i32 %2, %bf.val.sext5 2288 %bf.val.sext26 = and i32 %0, 2 2289 %4 = or i32 %3, %bf.val.sext26 2290 store i32 %4, i32* bitcast (%struct.t1* @s1 to i32*), align 4 2291 ret void 2292} 2293 2294The two or/and's should be merged into one each. 2295 2296//===---------------------------------------------------------------------===// 2297 2298Machine level code hoisting can be useful in some cases. For example, PR9408 2299is about: 2300 2301typedef union { 2302 void (*f1)(int); 2303 void (*f2)(long); 2304} funcs; 2305 2306void foo(funcs f, int which) { 2307 int a = 5; 2308 if (which) { 2309 f.f1(a); 2310 } else { 2311 f.f2(a); 2312 } 2313} 2314 2315which we compile to: 2316 2317foo: # @foo 2318# BB#0: # %entry 2319 pushq %rbp 2320 movq %rsp, %rbp 2321 testl %esi, %esi 2322 movq %rdi, %rax 2323 je .LBB0_2 2324# BB#1: # %if.then 2325 movl $5, %edi 2326 callq *%rax 2327 popq %rbp 2328 ret 2329.LBB0_2: # %if.else 2330 movl $5, %edi 2331 callq *%rax 2332 popq %rbp 2333 ret 2334 2335Note that bb1 and bb2 are the same. This doesn't happen at the IR level 2336because one call is passing an i32 and the other is passing an i64. 2337 2338//===---------------------------------------------------------------------===// 2339 2340I see this sort of pattern in 176.gcc in a few places (e.g. the start of 2341store_bit_field). The rem should be replaced with a multiply and subtract: 2342 2343 %3 = sdiv i32 %A, %B 2344 %4 = srem i32 %A, %B 2345 2346Similarly for udiv/urem. Note that this shouldn't be done on X86 or ARM, 2347which can do this in a single operation (instruction or libcall). It is 2348probably best to do this in the code generator. 2349 2350//===---------------------------------------------------------------------===// 2351 2352unsigned foo(unsigned x, unsigned y) { return (x & y) == 0 || x == 0; } 2353should fold to (x & y) == 0. 2354 2355//===---------------------------------------------------------------------===// 2356 2357unsigned foo(unsigned x, unsigned y) { return x > y && x != 0; } 2358should fold to x > y. 2359 2360//===---------------------------------------------------------------------===// 2361