## Monday, October 17, 2016

### Inlining — main is special

I wrote in my previous post that I assumed Jason compiled using the -O2 optimization level for the example in his CppCon 2016 talk, as it worked for me when I used -O3. But I was wrong — Jason used -O3, but his example was slightly different compared to mine. I used the function
int bar()
{
return std::string("a").size() + std::string("b").size();
}

that optimizes to just returning a constant, while Jason used
int main()
{
return std::string("a").size() + std::string("b").size();
}

which does not optimize. The only difference is the name of the function...

The reason for the difference is that GCC knows that main is special — it has the property that it is called only once. That is, it is a cold function, which makes the inlining less aggressive.

GCC is propagating the "called only once" information where possible, so functions for which the compiler can see all callers (such as static functions, functions in an anonymous namespace, or when compiling using -fwhole-program) are also marked as "called only once" if they are called exactly once from such a function. For example, the compiler can see that bar is called only once in
static int bar()
{
return std::string("a").size() + std::string("b").size();
}

int main()
{
return bar();
}

and the string code is not inlined. Removing the static prevents GCC from marking bar as called only once (as it could be called from outside the translation unit), and the string code will now be inlined.

On the other hand, if the "called only once" function contains loops, then GCC can infer that it makes sense to optimize the content of the loop as it is executed multiple times. So for example
int main()
{
int k = 0;
for (int i = 0; i < 10000; i++)
k += std::string("a").size() + std::string("b").size();
return k;
}

gets the strings inlined, and it optimizes to
main:
mov     eax, 20000
ret


## Sunday, October 16, 2016

### C++ and code inlining

Jason Turner mentions in his CppCon 2016 talk that GCC optimizes the function
int foo(void)
{
return std::string("a").size();
}

to just a return of a constant value
_Z3foov:
movl    \$1, %eax
ret

while
int bar(void)
{
return std::string("a").size() + std::string("b").size();
}

generates a less optimized result involving calls to constructors, destructors, etc. The difference between the two cases comes from how GCC handles inlining.

The basic idea behind the GCC inliner is to inline greedily as long as it estimates that the code size does not increase too much (where the growth limit depends on optimization level, if the function is hot/cold, etc.). One important special case is functions that are called exactly once — they do not increase code size if they are inlined, as the inlining just moves the code from the callee into the caller.

STL usage often expands to a large amount of code, for example, the string constructor used in the examples above calls
template<typename _CharT, typename _Traits, typename _Alloc>
template<typename _InIterator>
void
basic_string<_CharT, _Traits, _Alloc>::
_M_construct(_InIterator __beg, _InIterator __end,
std::forward_iterator_tag)
{
// NB: Not required, but considered best practice.
if (__gnu_cxx::__is_null_pointer(__beg) && __beg != __end)
std::__throw_logic_error(__N("basic_string::"
"_M_construct null not valid"));

size_type __dnew =
        static_cast<size_type>(std::distance(__beg, __end));

if (__dnew > size_type(_S_local_capacity))
{
_M_data(_M_create(__dnew, size_type(0)));
_M_capacity(__dnew);
}

// Check for out_of_range and length_error exceptions.
__try
{ this->_S_copy_chars(_M_data(), __beg, __end); }
__catch(...)
{
_M_dispose();
__throw_exception_again;
}

_M_set_length(__dnew);
}

This in turn calls several other functions, and more than 50 function calls need to be inlined in order to fully optimize foo. The temporary code increase of inlinling before the code gets optimized is over the limit allowed by -O2, but foo has exactly one string, so the compiler can inline this without increasing code size, and the compiler can eventually eliminate everything. But the bar function does not get the constructors inlined, as it constructs two strings...

This can have surprising effects when trying to understand how well code optimize. For example, we saw that foo optimizes to return a constant value, so let us add one more identical function
int foo(void)
{
return std::string("a").size();
}

int foo2(void)
{
return std::string("b").size();
}

We now construct two strings, which prevents inlining. So neither foo nor foo2 will be fully optimized when compiled with -O2.

The -O3 optimization level allows more inlining, so all examples in this blog post optimize fully when compiled using -O3 (so I assume Jason used -O2 for his examples). In general, -O2 does optimizations that do not involve a space-speed tradeoff, while -O3 allows the code size to increase if the resulting code is faster (which is good if you have enough cache).

To conclude:
• It is not enough to look at trivial code snippets if you want to verify that your complex templated code is being optimized as you expect.
• You may want to use -O3 rather than -O2 when compiling code for modern architectures.

## Sunday, September 25, 2016

### The cost of clearing memory

The technical report "The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V" looks at how the RISC-V ISA compares to ARM and Intel by analyzing the result from running SPEC CINT2006. One thing that surprised me was that 30% of the instructions executed when running the 403.gcc benchmark (which compiles a few files using a modified GCC 3.2) is from doing memset!

The RISC-V memset loop writes 16 bytes in 4 instructions
// RV64G, 4 instructions to move 16 bytes
4a3814:   sd    a1, 0(a4)
4a3818:   sd    a1, 8(a4)
4a3820:   bltu  a4, a3, 4a3814

which is somewhat less efficient compared to ARM and Intel that writes more data per instruction
// armv8, 6 instructions to move 64 bytes
6f0928:   stp   x7, x7, [x8,#16]
6f092c:   stp   x7, x7, [x8,#32]
6f0930:   stp   x7, x7, [x8,#48]
6f0934:   stp   x7, x7, [x8,#64]!
6f0938:   subs  x2, x2, #0x40
6f093c:   b.ge  6f0928

so this should translate to about 10% of the executed ARM instructions doing memset. But that is still much more than what I would have guessed.

I do not have access to SPEC, and I have been too lazy to try to replicate with other data, but a quick literature search indicates that this is not as insane as I thought. The papers I have found look at the cost of clearing data in garbage collection implementations, and they seem to get a similar result for the cost. For example "Why Nothing Matters: The Impact of Zeroing" says
We show that existing approaches of zero initialization are surprisingly expensive. On three modern IA32 architectures, the direct cost is around 2.7-4.5% on average and as much as 12.7% of all cycles, in a high-performance Java Virtual Machine (JVM), without accounting for indirect costs due to cache displacement and memory bandwidth consumption.

## Monday, September 12, 2016

### Mobile GPUs — power, performance, area

I read an IWOCL 2016 conference paper "OpenCL-Based Mobile GPGPU Benchmarking: Methods and Challenges" that gives some good advice on how to measure GPU performance on mobile devices. The paper focuses on micro benchmarks, and it is hard to use these to draw relevant conclusions — especially as mobile GPUs are rather different compared to the desktop CPUs that most developers are used to. Below are some of the things that were unclear to me when I started working on the shader compiler for a mobile GPU.

#### Power consumption and heat

The major design constraint for mobile devices is power consumption as it is hard to get rid of heat without large heatsinks and fans. The amount of heat that can be handled varies a lot between devices depending on the size and how they are built, but it corresponds to somewhere between 1W for a low-end phone and 7W for a tablet — and that includes power for CPU, GPU, memory, radio, etc.

It takes a while for the temperature to rise, so the hardware may temporarily run hotter (and faster), but the device will eventually throttle the performance if it is running too hot. This means that benchmarking results are not that interesting unless you know how long that performance can be sustained, and measurements such as peak performance tend to say just how over-dimensioned the hardware is.

I find it amusing that the discussion in the paper suggest that you want high performance numbers
[...] benchmarks with long running time may trigger thermal gating in certain cases, resulting in lower performance result.
I usually want realistic results, but wanting high numbers makes sense if you (as the authors) work at a hardware vendor and want a high number in the marketing material. They also suggest
Running the benchmark in a temperature-controlled environment is one option; if such an option is not available, adding idle periods between workloads may reduce chances of high system temperature.
which is especially important if your SoC has a heat problem. 😏

#### Dynamic Voltage and Frequency Scaling

Power consumption increases roughly quadratically with clock frequency1 — for example, raising the frequency from 600MHz to 700MHz on Exynos 5433 increases power consumption by 42%. This means it is better to lower the clock frequency and keep the GPU running 100% of the time instead of, for example, running it on full speed 75% of the time and being idle the remaining 25%.

This performance tuning is done by Dynamic Voltage and Frequency Scaling (DVFS). It is hard to make a good implementation as different applications have different requirements, and there is no "correct" tradeoff. For example, should a high end game run at full speed, and be throttled (i.e. reduced to unplayable frame rate) when the device overheats, or should it run on a lower sustainable speed from the beginning? Different device vendors implement DVFS in different ways, so two phones with the same GPU may behave differently.

Different operations need a different amount of power, and a good DVFS implementation uses this when adjusting the voltage and frequency. For example, memory operations consumes much more power than arithmetic operations, and this is used in Exynos to use lower voltage/frequency for shaders using more memory operations. This is "fun" when optimizing shaders, as a faster shader (as measured in number of clock cycles) does not need to run faster in reality if it uses more power hungry instructions and thus get a lower clock frequency.

#### Power- and area-efficiency

GPU workloads are embarrassingly parallel, so it is easy to double the performance of the hardware if you are allowed to increase the power and chip area — just place two identical GPUs in the package! In the same way, you can get much improved power efficiency by using two GPUs and running them with halved frequency. This means that you need to look at metrics such as "performance per area power" when comparing GPU architectures.

This is annoying when developing GPUs, as most ideas for improving the performance means that some hardware block becomes more complicated, with the result that the size increases and it consumes more power. And it does not make much sense making the GPU 10% faster if it also need 10% more area and power...

1. The dynamic power consumption is actually $$P_{dyn}=CV^{2}f$$ where $$C$$ is capacitance, $$V$$ is voltage, and $$f$$ is frequency. This varies linearly with the frequency, but increased frequency need a higher voltage, and the power consumption thus varies superlinearly with the frequency.

## Sunday, August 14, 2016

### Surprising properties of C null pointer constants

A previous blog post mentioned two properties of NULL that many developers find surprising:
• NULL does not need to be 0 when stored in memory — the compiler is allowed to transform the value when casting between integer and pointer types, so
union {
uintptr_t u;
void *p;
} u;
u.p = NULL;
printf("0x%" PRIxPTR "\n", u.u);

does not need to print 0, even though u.p==(void*)0 evaluates to true.
• NULL does not need to be a pointer — it is valid for an implementation to define NULL as
#define NULL 0

so it is not portable to pass NULL to a va_arg function, as this will fail for architectures where sizeof(int)!=sizeof(void*).

Today I found a third case in the paper "Beyond the PDP-11: Architectural support for a memory-safe C abstract machine" — casting an integer variable having the value 0 to a pointer type is not guaranteed to produce a null pointer. For example, it is implementation-defined if the comparison
int zero = 0;
bool b = (void*)0 == (void*)zero;

evaluates to true or false. Both (void*)0 and (void*)zero cast the value 0 to void*, so I had naively thought that both would produce a null pointer, but the standard does only guarantee this for (void*)0 as it is a constant expression
An integer constant expression with the value 0, or such an expression cast to type void *, is called a null pointer constant. If a null pointer constant is converted to a pointer type, the resulting pointer, called a null pointer, is guaranteed to compare unequal to a pointer to any object or function.
while the result is implementation-defined when casting a variable
An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation.
So (void*)0 is a null pointer while (void*)zero is converted in an implementation-defined manner,  which may or may not produce a null pointer...

## Sunday, July 24, 2016

### Code behaving differently in C90, C99, C11, C++98, and C++11

There are some subtle differences between the revisions of the C standard that makes it possible to create programs that behaves differently depending on if they are compiled as C90, C99, or C11. Similarly, C++ is mostly a superset of C, but there are constructs that produce different results for C and C++.

This is used in Don Yang's contribution to the 2015 International Obfuscated C Code Contest to create a program that produces different output depending on if it is compiled as C89, C99, C11, C++98, C++11. For C90 it prints stars of the form
*********************************************              ***               **
***********************     *****************             ******             **
*********************        ****************           **********           **
********************         ****************         **************         **
******        ******         *****************     ****************************
******           **          **************************************************
******                      *************************(O************************
*******                     *     *********************************************
*********                              ****************************************
***********                             ***************************************
************                            ***************************************
*********                             *****************************************
*******                     ***************************************************
******                       ****************)d));o((d=************************
******           **          **************************************************
******         *****         *********************      ***********************
********************(         O******************        **********************
**********************       *******************          *********************
************************     *******************          **************)p)-p);
o(d-(p=*****************************************          *********************

For C99 the stars have eyes, C++11 prints circles, etc. (There is more to this. The program reads text from standard input, and the output is obfuscated C90 source code that prints that text — all the * characters are pointer dereferences!)

The source code for the program is a little bit hard to read
                                           #define r(R) R"()"
/*[*/#include  /**/<stdio.h>
#include<math.h>/*!![crc=0f527cd2]*/
float I,bu,k,i,F,u,U,K,O;char o[5200];int
#define R(U) (sizeof('U')==1||sizeof(U"1"[0])==1)
h=0,t=-1,m=80,n=26,d,g,p=0,q=0,v=0,y=112,x=40;  float
N(float/*x*/_){g=1<<30;d=-~d*1103515245&--g;return  d*_
/g;}void/**/w(int/**/_){if(t<0){for(g=0;g<5200;o[g++   ]=
0);for(;g;o[g+79]=10)g-=80;for(t=37;g<62;o[80+g++]=32)   ;
}if(m&&o[h*80+m-1]==10){for(g=0;g<79;o[t*80+g++]=0){}o[t
++*80+g]=10;t%=64;n+=2;I=N(70)+5;if(n>30&&(I-x)*(I-x)+n*
n>1600&&R()){O=0;F=(x=0x1!=sizeof(' '))?k=1+N(2),i=12-k+N(
8),N(4):(k=17+N(5),i=0,r()[0]?O=.1:  0);for(u=U=-.05;u<32;
U=k+i+i*.5*sin((u+=.05)+F))for( K=0   ;K< U;K+=.1)if((bu=K*
sin(u/5),g=I+cos( u/5) *K)>=0&&g  <     79  )o[g+(int)(t+44+
bu*(.5-(bu>0?3*O:  O)   ) )%64*  80      ]  =32;x*=02//* */2
-1;n=O+x?n=I+(x?0   :N     (k)-   k           /2),g=(t+42  )%
64,m=-~g%64,x?g=m          =-~        m%64:0  ,n>5?o[g*80   +
n-3]=o[m*80+n-3]=       0:   0              ,n <75?o[g*80+n
+2]=o[m*80+n+2]=0   :0:0;                      x=I;}h=-~h%64
;m=0;}putchar((g=o [h*                          80+m++])?g:_);
if(g){w(_);}}void W                               (const char*_
){for(;*_;w(*_++));}                               int main(int a
,char**_){while(a--)d              +=_[a          ]-(char*)0;W( \
"#include<stdio.h>typed"             "e"         "f\40int\40O;v"
"oid o(O _){putchar(_);}O"                    "\40main(){O"  ""
"*_[512],**p=_,**d,b,q;for(b=0;b"        "++<512;p=_+q)_[q"    \
"=(p-_+1)*9%512]=(O*)p;") ;      for(;(g= getchar())-EOF;p=
q){q=p;for(v=512;p-q-g&&q-p-              g;  v--)q=-~q*9%512
;W("o(");if(p>q)w(y),w(45);w(                      40);w(y^=20
);w(075);for(a=0;a<v;a++)w(42);                      for(W("(O**"
);a--;w(42)){}w(41);w(y^024);w(                      41);if(p<=q)w(
45),w(y^20);W(");");}for(a=7;a-6                      ;W(a<6?"{;}":""
))for(a  =0;a  <6 &&   !o[h*80+m                       +a];a++){}W("r"
"etu"  /*J   */       "rn+0;}\n"                             );return
/*                      "#*/0                                   ;}

but it is, as far as I can tell, using three tricks in order to detect which C or C++ dialect is used:

• // comments
C90 does not have // comments, so constructs of the form
int i = 2 //**/2
;

can be used to differentiate it from the other C and C++ revisions, as C90 compiles this as
int i = 2 //**/2
;

while C++ and the more recent revisions of C compiles this as
int i = 2 //**/2
;

• Type of character constants
Character constants, such as 'a', have type int in C while C++ use the type char. This means that sizeof('a') evaluates to a different value for C and C++.
• Wide string literals
C11 and C++11 have wide string literals where for example U"hello!" is a string with characters of type char32_t. This can be used with a macro
#define R(U) sizeof(U"a"[0])

that is used as R(""). For C11 and C++ this expands to
sizeof(U"a"[0])

which evaluates to 4, while the older revisions of the standards treat U and "a" as two tokens, and the macro expands to
sizeof("""a"[0])

which evaluates to 1.

## Wednesday, July 20, 2016

### Using MathJax in Blogger

I have added MathJax support to this blog so that I can typeset mathematics using a large subset of $$\LaTeX$$ commands. This functionality works in comments too, which has the nice side effect that it can be used to work around Blogger's limited formatting. In particular, formatting source code in comments can be done as
$$\verb|int foo(void)| \\ \verb|{| \\ \verb| return 0;| \\ \verb|}| \\$$

which is rendered with most of the formatting intact:
$$\verb|int foo(void)| \\ \verb|{| \\ \verb| return 0;| \\ \verb|}|$$

One annoying thing is that the $$\LaTeX$$ commands are not processed when writing or previewing the comment, but they are rendered correctly when it is published...

#### How to enable MathJax in Blogger

MathJax need to be loaded and configured when the page is loaded. This is done by adding the following
<script type="text/javascript" async="async"
src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML,Safe">
</script>

in the <head> block by editing the blog's template (choose "Template", and click the "Edit HTML" button). This only adds the support to the desktop template, so you need to enable it separately for mobile by pressing the gear button, and choosing the "custom" mobile template (that generates a mobile template from your desktop template).

There are several configurations to choose between, with support for $$\LaTeX$$, MathML, and AsciiMath notation, and control over how much functionality is loaded up front, and how much is loaded on-demand. I have chosen the TeX-AMS_CHTML which enables all $$\LaTeX$$ support, but avoids MathML and AsciiMath.

Note the ,Safe modifier added after the configuration name. This disables unsafe constructs such as running javascript from $$\LaTeX$$ commands
$E \href{javascript:alert("Einstein says so!")}{=} mc^2$

This is needed in order to prevent commenters messing up the blog by adding evil constructs in comments.

#### Testing the functionality

Below are some random formulas, just to verify that the rendering works as intended:
$\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\mu)^{2}}$
\left\{ \begin{aligned} a_1x+b_1y+c_1z &=d_1+e_1 \\ a_2x+b_2y&=d_2 \\ a_3x+b_3y+c_3z &=d_3 \end{aligned} \right.
$\require{AMScd} \begin{CD} \pi(X, x_{0}) @>\phi_{*}>> \pi(Y, \phi(x_{0}))\\ @VVuV @VVvV\\ \pi(X, x_{1}) @>\phi_{*}>> \pi(Y, \phi(x_{1})) \end{CD}$