Sunday, January 8, 2017

GCC code generation for C++ Weekly Ep 43 example

Episode 43 of “C++ Weekly” talks about evaluating and eliminating code at compile time, and the example is fun as it triggers a few different deficiencies in the GCC optimization passes (using the -O3 optimization level with GCC trunk r243987 built for x86_64-linux-gnu).

The example
#include <type_traits>
#include <numeric>
#include <iterator>

template<typename ... T>
int sum(T ... t)
{
  std::common_type_t<T...> array[sizeof...(T)]{ t... };

  return std::accumulate(std::begin(array), std::end(array), 0);
}

int main()
{
  return sum(5,4,3,2,1);
}
is meant to be optimized to return a constant value, and it does
main:
    movl    $15, %eax
    ret
But the behavior varies a lot depending on how many arguments are passed to sum1–5 arguments work as expected, while calling sum with 6 or 7 arguments is generated as
main:
 movdqa .LC0(%rip), %xmm0
 movaps %xmm0, -40(%rsp)
 movl -32(%rsp), %edx
 movl -28(%rsp), %eax
 leal 14(%rdx,%rax), %eax
 ret
where .LC0 is an array consisting of four constants. 9–12 arguments are similar (but with more code). 

13–28 arguments are generated as a constant again
main:
    movl    $91, %eax
    ret

29–64 arguments are optimized to a constant, but with some redundant stack adjustments when the number of arguments is not divisible by four
main:
    subq    $16, %rsp
    movl    $435, %eax
    addq    $16, %rsp
    ret

Finally, 65 and more arguments are generated as a vectorized monstrosity in a separate function, called from main by pushing all the arguments to the stack, one at a time
main:
    subq    $16, %rsp
    movl    $60, %r9d
    movl    $61, %r8d
    pushq   $1
    pushq   $2
    movl    $62, %ecx
    pushq   $3
    ushq    $4
    ...
This is essentially as far from generating a constant as you can come. 😀

The rest of the blog post will look at how GCC is reasoning when trying to optimize this, by examining GCC's internal representation of the program at different points in the optimization pipeline. The IR works essentially as a restricted version of the C language, and you can get GCC to write the IR to a file after each pass by using the command-line option -fdump-tree-all.

1–5 arguments

The use of std::accumulate and iterators expand to five functions, and the compiler starts by inlining and simplifying this to
int main() ()
{
  common_type_t array[5];
  int __init;
  int * __first;
  int _3;

  <bb 2> [16.67%]:
  array[0] = 5;
  array[1] = 4;
  array[2] = 3;
  array[3] = 2;
  array[4] = 1;

  <bb 3> [100.00%]:
  # __first_2 = PHI <&array(2), __first_6(4)>
  # __init_4 = PHI <0 __init_5>
  if (&MEM[(void *)&array + 20B] == __first_2)
    goto <bb 5>; [16.67%]
  else
    goto <bb 4>; [83.33%]

  <bb 4> [83.33%]:
  _3 = *__first_2;
  __init_5 = _3 + __init_4;
  __first_6 = __first_2 + 4;
  goto <bb 3>; [100.00%]

  <bb 5> [16.67%]:
  array ={v} {CLOBBER};
  return __init_4;
}
The loop is immediately unrolled and simplified to
int main() ()
{
  common_type_t array[5];
  int __init;
  int * __first;
  int _15;
  int _20;
  int _25;
  int _30;
  int _35;

  <bb 2> [16.70%]:
  array[0] = 5;
  array[1] = 4;
  array[2] = 3;
  array[3] = 2;
  array[4] = 1;
  _15 = MEM[(int *)&array];
  __init_16 = _15;
  _20 = MEM[(int *)&array + 4B];
  __init_21 = _15 + _20;
  _25 = MEM[(int *)&array + 8B];
  __init_26 = __init_21 + _25;
  _30 = MEM[(int *)&array + 12B];
  __init_31 = __init_26 + _30;
  _35 = MEM[(int *)&array + 16B];
  __init_36 = __init_31 + _35;
  array ={v} {CLOBBER};
  return __init_36;
}
that is then optimized to a constant by the fre3 (“Full Redundancy Elimination”) pass
int main() ()
{
  <bb 2> [16.70%]:
  return 15;
}

6–12 arguments

The early optimizations that handled the previous case are there mostly to get rid of noise before the heavier optimizations (such as loop optimizations) kicks in, and the loop doing 6 iterations is considered too big to be unrolled by the early unroller.

The “real” loop optimizers determine that the loop is not iterating enough for vectorization to be profitable, so it is just unrolled. The “SLP vectorizer” that vectorizes straight line code is run right after the loop optimizations, and it sees that we are copying constants into consecutive addresses, so it combines four of them to a vector assignment
MEM[(int *)&array] = { 6, 5, 4, 3 };
array[4] = 2;
array[5] = 1;
This is now simplified by the dom3 pass that does SSA dominator optimizations (jump threading, redundancy elimination, and const/copy propagation), but it does not understand that a scalar initialized by a constant vector is a constant, so it only propagates the constants for array[4] and array[5] that were initialized as scalars, and the code passed to the backend looks like
int main() ()
{
  common_type_t array[6];
  int __init;
  int _15;
  int _22;
  int _27;
  int _32;

  <bb 2> [14.31%]:
  MEM[(int *)&array] = { 6, 5, 4, 3 };
  _15 = MEM[(int *)&array];
  _22 = MEM[(int *)&array + 4B];
  __init_23 = _15 + _22;
  _27 = MEM[(int *)&array + 8B];
  __init_28 = __init_23 + _27;
  _32 = MEM[(int *)&array + 12B];
  __init_33 = __init_28 + _32;
  __init_43 = __init_33 + 3;
  array ={v} {CLOBBER};
  return __init_43;
}

13–28 arguments

The loop is now iterated enough times that the compiler determines that vectorization is profitable. The idea behind the vectorization is to end up with something like
tmp = { array[0], array[1], array[2], array[3] }
    + { array[4], array[5], array[6], array[7] }
    + { array[8], array[9], array[10], array[11] };
sum = tmp[0] + tmp[1] + tmp[2] + tmp[3] + array[12];
and the vectorizer generates two loops — one that consumes four elements at a time as long as possible, and one that consumes the remaining elements one at a time. The rest of the loop optimizers know how many times the loops are iterating, so the loops can then be unrolled etc. as appropriate.

The vectorizer is, unfortunately, generating somewhat strange code for the checks that there are enough elements
_35 = (unsigned long) &MEM[(void *)&array + 52B];
_36 = &array + 4;
_37 = (unsigned long) _36;
_38 = _35 - _37;
_39 = _38 /[ex] 4;
_40 = _39 & 4611686018427387903;
if (_40 <= 4)
  goto ; [10.00%]
else
  goto ; [90.00%]
that confuse the rest of the loop optimizations, with the result that the IR contains lots of conditional code of this form. This is not the first time I have seen GCC having problems with the pointer arithmetics from iterators (see bug 78847), and I believe this is the same problem (as the bitwise and should be optimized away when the pointer arithmetics has been evaluated to a constant).

The subsequent passes mostly manage to clean up these conditionals, and dom3 optimizes the vector operations to a constant. But it does not understand the expression used to decide how many scalar elements need to be handled if the iteration count is not a multiple of four (that check is eliminated by the value range propagation pass after dom3 is run), so the scalar additions are kept in the code given to the backend
int main() ()
{
  common_type_t array[13];
  int __init;
  int _23;

   [7.14%]:
  array[12] = 1;
  _23 = MEM[(int *)&array + 48B];
  __init_3 = _23 + 90;
  array ={v} {CLOBBER};
  return __init_3;
}
This is, however, not much of a problem for this program, as the backend manages to optimize this to
main:
    movl    $91, %eax
    ret
when generating the code.

2964 arguments

The backend eliminates the memory accesses to the array in the previous case, but the array seems to be kept on the stack. That makes sense — the code is supposed to be optimized before being passed to the backend, so the backend should not be able to eliminate variables, and there is no need to implement code doing this.

Leaf functions do not need to adjust the stack, but GCC does some stack adjustment on leaf functions too when more than 112 bytes are placed on the stack. You can see this for the meaningless function
void foo()
{
    volatile char a[113];
    a[0] = 0;
}
where the stack is adjusted when the array size is larger than 112.
foo:
    subq    $16, %rsp
    movb    $0, -120(%rsp)
    addq    $16, %rsp
    ret
I do not understand what GCC is trying to do here...

Anyway, passing 29 arguments to sum makes the array large enough that GCC adds the stack adjustments.

65– arguments

The sequence of assignments initializing the array is now large enough that sum is not inlined into main.

9 comments:

  1. Very interesting, thanks for writing/sharing.
    Did you tried the same kind of investigations with clang++ ?

    ReplyDelete
    Replies
    1. Thanks! I have not looked much into clang optimizations — the blog posts are mostly observations from my attempts to improve gcc...

      Delete
    2. clang does better
      https://godbolt.org/g/9BO2b3

      Delete
    3. This comment has been removed by the author.

      Delete
    4. This comment has been removed by the author.

      Delete
    5. Third time lucky! Clang falls over in other ways:
      https://godbolt.org/g/V1FMMW

      Delete
  2. "I do not understand what GCC is trying to do here..."

    If you pass -mno-red-zone to the compilation, the stack is also adjusted for 112 bytes. So I guess for stack sizes <= 112 GCC doesn't have to adjust the stack because it still fits in the red zone (it has a size of 128 on x86_64).

    I am not sure why it stops for 112 and not 128?

    ReplyDelete
    Replies
    1. Thanks! I had totally forgotten the red zone (I'm usually working on embedded devices, and x86 is alien to me...)

      Delete
  3. I tested how clang 3.9 behaves (on godbolt.org). Clang folds to constant up to 45 arguments, for a longer list it generates a function. The function, depending on argument count, is either scalar or use SIMD instructions.

    ReplyDelete