Thursday, December 15, 2016

Pointer comparison — an invalid optimization in GCC

I wrote in an earlier blog post that GCC has a somewhat aggressive interpretation of the standard so that it compiles p == q to false if p and q are pointers derived from different objects. I'm now convinced GCC is wrong.

The problem

C does not permit pointers to point at any memory — a pointer must point to an address within an object or on the address immediately following the object, and pointer arithmetic cannot make it point outside that range. When comparing pointers, the C standard says (emphasis added)
Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.
That is, the comparison in the printf call
#include <stdio.h>

int main(void)
{
int x[4], y[4];
int *p = &x[4];
int *q = &y[0];
printf("%p %p %d\n", (void*)p, (void*)q, p == q);
return 0;
}

evaluates to true if the compiler happens to place y directly after x in memory. GCC does, however, substitute address comparison to false if the pointers are derived from different objects, and the program above may give output such as
0x7fff66339290 0x7fff66339290 0

where the comparison evaluates to false even though both pointers are the same. The argument in bug 61502 is essentially that the standard is unclear what "immediately follow" means, and applications cannot rely on memory layout anyway (so there is no real use case). I believe both are wrong.

Use case

One use case for this kind of comparison in firmware for small embedded devices — you have a framework that supports different hardware configurations, but you want to avoid adding an #ifdef for every optional functionality. Instead, each implemented functionality exports a structure containing configuration data, and the linker collects them into an address range and generates a symbol __start for the start address, and __end for the address immediately following the array. The good thing about this is that the array contains exactly what is present at link time — elements are added to the array by just adding files to the link command, and functionality can be included/excluded by just relinking, without any need for additional configuration.

C does not have a convenient way of representing the __end label, so it is typically expressed as
extern struct my_struct __start[];
extern struct my_struct __end[];

where the linker guarantees that __end is immediately following __start. The array is processed as
void foo(void)
{
for (struct my_struct *x = __start; x != __end; x++)
do_something(x);
}

which is compiled to an infinite loop if GCC optimizes comparisons of pointers derived from two objects as always being different. The workaround suggested in the GCC mailing lists is to insert some inline assembly to confuse the compiler
void foo(void)
{
struct my_struct *x = __start;
struct my_struct *y = __end;
asm("":"+r"(x));
asm("":"+r"(y));
for (; x != y; x++)
do_something(x);
}

but this should not be needed for the natural reading of the standard excerpt quoted above.

This particular use case seems to work now as a side-effect by the "missing" optimization of global constant addresses, so the nasty workaround is not needed. But some GCC developers still claim the code is invalid...

The meaning of "following"

Comment #3 of the bug report argues that GCC is correct:
Except within a larger object, I'm not aware of any reason the cases of two objects following or not following each other in memory must be mutually exclusive. (If the implementation can track the origins of bit-patterns and where copies of those bit-patterns have got to, it might have a compacting garbage collector that relocates objects and changes what's adjacent to what, for example — I think such implementations are within the scope of what the C standard is intended to support. Or if you're concerned about how this changes bit-patterns of pointers, imagine that a C pointer is a (object key, offset) pair, and that comparison first converts the C pointer into a hardware address, where it's the mapping from object keys to hardware addresses that changes as a result of garbage collection rather than anything about the representation of the pointer.)
I do not think this is a correct interpretation. For example, DR #077 asks the question "Is the address of an object constant throughout its lifetime?", and the committee answers (emphasis added)
The C Standard does not explicitly state that the address of an object remains constant throughout the life of the object. That this is the intent of the Committee can be inferred from the fact that an address constant is considered to be a constant expression. The framers of the C Standard considered that it was so obvious that addresses should remain constant that they neglected to craft words to that effect.
That is, compacting garbage collection is not within the scope of what the C standard supports. I'm even less convinced by the key/offset example — this is the same situation as "normal" hardware, as you can view the virtual address as a pair (page, offset), and it is still the case that x[16] is immediately following x[15] for an array
uint32_t x[32];

even if x crosses a page boundary so that x[15] and x[16] are placed on different pages.

Friday, December 9, 2016

GCC __attribute__((pure)) and C++ exceptions

John Regehr tweeted the example
int foo(void) __attribute__((pure));

int bar(void) {
int acc = 0;
for (int i = 0; i < 100; ++i)
acc += foo();
return acc;
}

that is optimized to
_Z3barv:
pushq   %rax
callq   _Z3foov
imull   $100, %eax, %eax popq %rcx retq  by clang, but is not optimized by GCC. The obvious question is "why is it not optimized?" GCC does actually optimize this if it is compiled as C, but not if compiled as C++. The reason is that GCC allows pure functions to throw exceptions in C++ which prevents some optimizations — bar is optimized as expected in C++ too if foo is marked as noexcept, or if exceptions are disabled by passing the command-line option -fno-exceptions to the compiler. There are additional restrictions on how pure functions may be optimized. The description of the pure attribute in the GCC manual says Many functions have no effects except the return value and their return value depends only on the parameters and/or global variables. Such a function can be subject to common subexpression elimination and loop optimization just as an arithmetic operator would be. These functions should be declared with the attribute pure. The possible use of global variables limits how calls to the pure functions may be moved. For example, int foo(void) __attribute__((pure)); int a; int baz(void) { int acc = 0; acc += foo(); a = 0; acc += foo(); return acc; }  cannot be optimized to the equivalent of int baz(void) { a = 0; return 2 * foo(); }  as it is possible that the result of foo() depends on the global variable a. There is another attribute, const, for functions without side effects that do not depend on global variables Many functions do not examine any values except their arguments, and have no effects except the return value. Basically this is just slightly more strict class than the pure attribute below, since function is not allowed to read global memory. Decorating foo with the const attribute would optimize the function baz above, but the compiler is still restricted in how it can move functions, even if they are known not to throw exceptions and are marked as const — they may take arbitrarily long time to execute, so the compiler should not move a function call so that it is executed on a path where it was not executed in the original code. This means that the optimization opportunities of pure and const functions are rather limited, and the possibility of throwing exceptions does not affect it that much — a pure or const function that throws an exception will always do it for the same input, so they can still be optimized by subexpression elimination, memoization, etc. The only extra restriction is that the first call of a function must not be moved over some other construct that can throw an exception. So GCC is overly conservative when handling the loop in bar — the optimization is valid even if foo may throw — but most more complex cases cannot be handled. For example, int foo() __attribute__((pure)); int foo2(int) __attribute__((pure)); int bar2() { int acc = 0; int acc2 = 0; for (int i = 0; i < 100; i++) { acc2 += foo2(i); acc += foo(); } return acc + acc2; }  cannot be optimized to the equivalent of int bar2() { int acc2 = 0; for (int i = 0; i < 100; i++) { acc2 += foo2(i); } return 100 * foo() + acc2; }  as this will give a different result if both foo() and foo2(1) throws (while foo2(0) does not throw). There are ways to handle some of these more complex cases (e.g. peeling the first iteration of the loop to get the first calls in the correct order, and then optimizing the rest of the loop), but this is not common enough that it is worth complicating the optimizers by special-casing the few cases that can be optimized. Wednesday, November 30, 2016 Loop optimization, and microbenchmarking I read a blog post "Reducing the Performance Impact of Debug Code in C Libraries" that describes a way to handle overhead of debug code (compile two versions with/without debug code enabled, and load the right version dynamically). The blog post's approach is a reasonable way to solve the problem, but the example illustrating the overhead of debug code surprised me: #include <stdio.h> #include <stdlib.h> #define N 100000000 // Adjust for your machine int main(void) { int debug_on = !!getenv("DEBUG"); unsigned long sum = 0; for (int i = 0 ; i < N ; i++) { for (int j = 0 ; j < N ; j++) { #ifdef WITH_DEBUG if (debug_on) printf("Debug print.\n"); #endif sum += i; } } printf("Sum: %lu\n", sum); return 0; }  The blog post claims that the version compiled with WITH_DEBUG defined (introducing the overhead of "if (false)") is 50% slower than the one compiled without it — I had thought that both versions would have the same running time, so I tested the program compiled with "gcc -O3", and the difference was even bigger on my computer... The reason I thought both versions would have identical performance is that GCC has an optimization -funswitch-loops (enabled at -O3) that optimizes loops having if-statements where the condition is unchanged within the loop, such as for (int j = 0 ; j < N ; j++) { if (debug_on) printf("Debug print.\n"); sum += i; }  The optimization duplicates the loop, optimizes one as if the condition is true and the other as if it is false, and selects the correct version at runtime. That is, the loop above is transformed to if (debug_on) { for (int j = 0 ; j < N ; j++) { printf("Debug print.\n"); sum += i; } } else { for (int j = 0 ; j < N ; j++) { sum += i; } }  so the performance should be the same with/without WITH_DEBUG defined — both should end up running code resulting from the loop without the if-statement! I looked some into the details of what is happening. Compiling without defining WITH_DEBUG makes GCC determine that the inner loop for (int j = 0 ; j < N ; j++) { sum += i; }  calculates sum += (unsigned long)i * N;  This, in turn, makes the outer loop calculate the sum of an arithmetic sequence, which the compiler knows how to turn into a constant. The result is that the whole program is transformed to the equivalent of int main(void) { getenv("DEBUG"); unsigned long sum = 996882102603448320ul; printf("Sum: %lu\n", sum); return 0; }  Compiling with WITH_DEBUG defined determines that the inner loop does not change debug_on, and the loop is "unswitched". The sums are determined to be "i * N" as in the previous case, and the compiler sees that both branches do the same calculation that can be moved out and combined. The result is that the inner loop is transformed to if (debug_on) { for (int j = 0 ; j < N ; j++) printf("Debug print.\n"); } sum += (unsigned long)i * N;  The outer loop could now be unswitched but that is not happening, so the compiler continues by noticing that the sum can be directly calculated as in the previous case, and the resulting optimized program is equivalent to int main(void) { int debug_on = !!getenv("DEBUG"); for (int i = 0 ; i < N ; i++) if (debug_on) for (int j = 0 ; j < N ; j++) printf("Debug print.\n"); unsigned long sum = 996882102603448320ul; printf("Sum: %lu\n", sum); return 0; }  The reason the program is not fully optimized is that the optimization passes are not run first on the inner loop as in my description above — they are (mostly) run over the whole program, one optimization pass at a time, and each pass requires the code to be simple enough for the optimizations to trigger. In this case, the outer loop would have been unswitched if sum had been optimized before the unswitching pass was run on the outer loop. Dealing with this is one of the "fun" parts of compiler development — you implement an amazing optimization pass that can do wonders, but the stupid real-world code need to be simplified by other passes (later in the chain) before your pass can do its magic. It is, of course, easy to fix this by adding additional optimization passes before (and in the middle of) your new pass, but the users will complain that the compiler is too slow and switch to a faster, less optimizing, compiler... So one important part of compiler development is to ensure that the pass order is good enough to handle most reasonable cases with a limited number of optimization passes. This also means that simple micro-benchmarking may not give a true and fair view of how code is optimized in reality — I have seen many cases where micro-benchmarks in the compiler's test suite are optimized as expected, while the optimization is almost guaranteed not to trigger for real world code due to pass ordering issues. So your micro-benchmark may show that a code construct X is faster than code construct Y (or that compiler Z is better than compiler W), but the behavior in more complex real-world usage may be very different... Thursday, November 24, 2016 Tentative variable definitions, and -fno-common A comment in a Reddit thread on my previous blog post claims that the code is optimized if the variables are declared in the same translation unit as the usage. That is, the claim is that int a; int b; int foo(void) { return &a != &b; }  compiles to code returning a constant value. This is true for C++ _Z3foov: movl$1, %eax
ret

but compiling as C still generates
foo:
movl    $a, %eax cmpq$b, %rax
setne   %al
movzbl  %al, %eax
ret

The reason is that an external declaration without an initializer is what the C standard calls a tentative definition. The tentative definition works in essentially the same way as if it were defined extern (but with the difference that the linker creates the variable if it is not defined in any translation unit), so GCC must assume that the linker may assign the same address for a and b in the same way as for normal extern-declared variables. It is possible to make the variables "real" definitions by initializing them with a value, so changing the declarations to
int a = 0;
int b = 0;

enables the compiler to optimize the function in C too.

One other way to make GCC optimize this is to specify the command-line option -fno-common that disables generation of "common symbols" which are used for emitting tentative definitions. This has the benefit of improving code generation for some CPU architectures, such as ARM. As an example, consider the function
int a;
int b;

int foo(void)
{
return a + b;
}

The ARM architecture need the variables' addresses placed in registers in order to access the values, so the compiler will generate code similar to
foo:
ldr     r0, [r2]        /* Load a */
ldr     r3, [r3]        /* Load b */
bx      lr
.L2:
.word   a
.word   b

The compiler could do a better job if it knew how the variables are laid out in memory, as it then only needs to load one address and use relative addressing for the other. But that is not possible when a and b are tentative definitions, as the linker may place them wherever it chooses. Using -fno-common will, however, place a and b in the data (or BSS) segment exactly as they are laid out by the compiler, and the code can be optimized based on knowledge that b is placed right after a in memory1
foo:
ldr     r0, [r2]        /* Load a */
ldr     r3, [r2, #4]    /* Load b */
bx      lr
.L2:
.word   a


1. A real ARM compiler would use an ldmia instruction in this case, which is even better. But I choose to use this naive code generation in the example in order to show the general principle, without discussing too many ARM-specific details.

Thursday, November 10, 2016

"missing" optimizations — constant address comparison

Sometimes the compiler intentionally fails to optimize certain constructs for arguably good reasons. For example, compiling the following with GCC
extern int a;
extern int b;

int foo(void)
{
return &a != &b;
}

generates code doing a comparison
foo:
movl    $a, %eax cmpq$b, %rax
setne   %al
movzbl  %al, %eax
ret

even though the C standard ensures the addresses of a and b are different.

It seems to be a bit unclear why GCC keeps this comparison, but the discussion in the bug 78035 mentions the C defect report DR #078, and the expressiveness of the ELF format. DR #078 notes that
unsigned int h(void)
{
return memcpy != memmove;
}

may return 0, which happens on implementations where the C standard library uses the same code for memcpy and memmove (the C language cannot do that, but the standard library does not need to be written in C). This does not mean that the compiler must be able to handle different symbols mapping to the same address — it only says that C programs must not assume too much about the standard library. But ELF supports exporting multiple symbols for the same address, and GCC tries to honor ELF possibilities (such as the symbol interposition that is limiting optimizations for shared libraries).

I'm not convinced it makes sense for GCC to keep these comparisons in the generated code — other optimizations, such as alias analysis, treats global symbols as having different addresses, so it is likely that other optimizations will make the code fail if it has two symbols with the same address. For example,
extern int a;
extern int b;

int foo(void)
{
a = 1;
b = 5;
a++;
return &a != &b;
}

optimizes the accesses to a and b as if they have different addresses, even though the comparison is emitted:
foo:
movl    $a, %eax movl$5, b(%rip)
movl    $2, a(%rip) cmpq$b, %rax
setne   %al
movzbl  %al, %eax
ret

This missing optimization does probably not make any difference in reality (although I could imagine some macro or template that relies on this being optimized), but this inconsistency in what is optimized annoys me...

Sunday, November 6, 2016

Inlining — shared libraries are special

The way shared libraries work affect how the code can be optimized, so GCC must be more conservative with inlining when building shared libraries (i.e. when compiling with -fpic or -fPIC).

Consider the functions
int foo(void)
{
return 23;
}

int bar(void)
{
return 19 + foo();
}

Compiling this with "gcc -O3" inlines foo into bar
foo:
movl    $23, %eax ret bar: movl$42, %eax
ret

but that is not the case when compiling using "gcc -O3 -fPIC"
foo:
movl    $23, %eax ret bar: subq$8, %rsp
call    foo@PLT
addq    $8, %rsp addl$19, %eax
ret

The reason is that ELF permits symbols in shared libraries to be overridden by the dynamic linker — a typical use case is to use LD_PRELOAD to load a debug library that contains logging versions of some functions. This has the effect that GCC cannot know that it is the foo above that really is called by bar, and thus cannot inline it. It is only exported symbols that can be overridden, so anonymous namespaces and static functions are optimized as usual, as are functions defined as "extern inline" (the compiler is told to inline, so it may assume the function will not be overridden).

The missed optimizations from this are especially noticeable when doing link-time optimization — the benefit of LTO is that the compiler can see the whole library and inline between files, but this is not possible if those functions may be replaced. This problem makes all interprocedural optimizations (such as devirtualization) ineffective, not only inlining.

There are two ways to get GCC to optimize shared libraries in the same way as normal code
• Use the command line option -fno-semantic-interposition
• Avoid exporting unnecessary symbols by passing -fvisibility=hidden to the compiler, and manually export the needed symbols using a linker export map or by decorating the functions with
__attribute__((__visibility__("default")))


Monday, October 17, 2016

Inlining — main is special

I wrote in my previous post that I assumed Jason compiled using the -O2 optimization level for the example in his CppCon 2016 talk, as it worked for me when I used -O3. But I was wrong — Jason used -O3, but his example was slightly different compared to mine. I used the function
int bar()
{
return std::string("a").size() + std::string("b").size();
}

that optimizes to just returning a constant, while Jason used
int main()
{
return std::string("a").size() + std::string("b").size();
}

which does not optimize. The only difference is the name of the function...

The reason for the difference is that GCC knows that main is special — it has the property that it is called only once. That is, it is a cold function, which makes the inlining less aggressive.

GCC is propagating the "called only once" information where possible, so functions for which the compiler can see all callers (such as static functions, functions in an anonymous namespace, or when compiling using -fwhole-program) are also marked as "called only once" if they are called exactly once from such a function. For example, the compiler can see that bar is called only once in
static int bar()
{
return std::string("a").size() + std::string("b").size();
}

int main()
{
return bar();
}

and the string code is not inlined. Removing the static prevents GCC from marking bar as called only once (as it could be called from outside the translation unit), and the string code will now be inlined.

On the other hand, if the "called only once" function contains loops, then GCC can infer that it makes sense to optimize the content of the loop as it is executed multiple times. So for example
int main()
{
int k = 0;
for (int i = 0; i < 10000; i++)
k += std::string("a").size() + std::string("b").size();
return k;
}

gets the strings inlined, and it optimizes to
main:
mov     eax, 20000
ret


Sunday, October 16, 2016

C++ and code inlining

Jason Turner mentions in his CppCon 2016 talk that GCC optimizes the function
int foo()
{
return std::string("a").size();
}

to just a return of a constant value
_Z3foov:
movl    \$1, %eax
ret

while
int bar()
{
return std::string("a").size() + std::string("b").size();
}

generates a less optimized result involving calls to constructors, destructors, etc. The difference between the two cases comes from how GCC handles inlining.

The basic idea behind the GCC inliner is to inline greedily as long as it estimates that the code size does not increase too much (where the growth limit depends on optimization level, if the function is hot/cold, etc.). One important special case is functions that are called exactly once — they do not increase code size if they are inlined, as the inlining just moves the code from the callee into the caller.

STL usage often expands to a large amount of code, for example, the string constructor used in the examples above calls
template<typename _CharT, typename _Traits, typename _Alloc>
template<typename _InIterator>
void
basic_string<_CharT, _Traits, _Alloc>::
_M_construct(_InIterator __beg, _InIterator __end,
std::forward_iterator_tag)
{
// NB: Not required, but considered best practice.
if (__gnu_cxx::__is_null_pointer(__beg) && __beg != __end)
std::__throw_logic_error(__N("basic_string::"
"_M_construct null not valid"));

size_type __dnew =
        static_cast<size_type>(std::distance(__beg, __end));

if (__dnew > size_type(_S_local_capacity))
{
_M_data(_M_create(__dnew, size_type(0)));
_M_capacity(__dnew);
}

// Check for out_of_range and length_error exceptions.
__try
{ this->_S_copy_chars(_M_data(), __beg, __end); }
__catch(...)
{
_M_dispose();
__throw_exception_again;
}

_M_set_length(__dnew);
}

This in turn calls several other functions, and more than 50 function calls need to be inlined in order to fully optimize foo. The temporary code increase of inlinling before the code gets optimized is over the limit allowed by -O2, but foo has exactly one string, so the compiler can inline this without increasing code size, and the compiler can eventually eliminate everything. But the bar function does not get the constructors inlined, as it constructs two strings...

This can have surprising effects when trying to understand how well code optimize. For example, we saw that foo optimizes to return a constant value, so let us add one more identical function
int foo()
{
return std::string("a").size();
}

int foo2()
{
return std::string("b").size();
}

We now construct two strings, which prevents inlining. So neither foo nor foo2 will be fully optimized when compiled with -O2.

The -O3 optimization level allows more inlining, so all examples in this blog post optimize fully when compiled using -O3 (so I assume Jason used -O2 for his examples). In general, -O2 does optimizations that do not involve a space-speed tradeoff, while -O3 allows the code size to increase if the resulting code is faster (which is good if you have enough cache).

To conclude:
• It is not enough to look at trivial code snippets if you want to verify that your complex templated code is being optimized as you expect.
• You may want to use -O3 rather than -O2 when compiling code for modern architectures.

Sunday, September 25, 2016

The cost of clearing memory

The technical report "The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V" looks at how the RISC-V ISA compares to ARM and Intel by analyzing the result from running SPEC CINT2006. One thing that surprised me was that 30% of the instructions executed when running the 403.gcc benchmark (which compiles a few files using a modified GCC 3.2) is from doing memset!

The RISC-V memset loop writes 16 bytes in 4 instructions
// RV64G, 4 instructions to move 16 bytes
4a3814:   sd    a1, 0(a4)
4a3818:   sd    a1, 8(a4)
4a3820:   bltu  a4, a3, 4a3814

which is somewhat less efficient compared to ARM and Intel that writes more data per instruction
// armv8, 6 instructions to move 64 bytes
6f0928:   stp   x7, x7, [x8,#16]
6f092c:   stp   x7, x7, [x8,#32]
6f0930:   stp   x7, x7, [x8,#48]
6f0934:   stp   x7, x7, [x8,#64]!
6f0938:   subs  x2, x2, #0x40
6f093c:   b.ge  6f0928

so this should translate to about 10% of the executed ARM instructions doing memset. But that is still much more than what I would have guessed.

I do not have access to SPEC, and I have been too lazy to try to replicate with other data, but a quick literature search indicates that this is not as insane as I thought. The papers I have found look at the cost of clearing data in garbage collection implementations, and they seem to get a similar result for the cost. For example "Why Nothing Matters: The Impact of Zeroing" says
We show that existing approaches of zero initialization are surprisingly expensive. On three modern IA32 architectures, the direct cost is around 2.7-4.5% on average and as much as 12.7% of all cycles, in a high-performance Java Virtual Machine (JVM), without accounting for indirect costs due to cache displacement and memory bandwidth consumption.

Monday, September 12, 2016

Mobile GPUs — power, performance, area

I read an IWOCL 2016 conference paper "OpenCL-Based Mobile GPGPU Benchmarking: Methods and Challenges" that gives some good advice on how to measure GPU performance on mobile devices. The paper focuses on micro benchmarks, and it is hard to use these to draw relevant conclusions — especially as mobile GPUs are rather different compared to the desktop CPUs that most developers are used to. Below are some of the things that were unclear to me when I started working on the shader compiler for a mobile GPU.

Power consumption and heat

The major design constraint for mobile devices is power consumption as it is hard to get rid of heat without large heatsinks and fans. The amount of heat that can be handled varies a lot between devices depending on the size and how they are built, but it corresponds to somewhere between 1W for a low-end phone and 7W for a tablet — and that includes power for CPU, GPU, memory, radio, etc.

It takes a while for the temperature to rise, so the hardware may temporarily run hotter (and faster), but the device will eventually throttle the performance if it is running too hot. This means that benchmarking results are not that interesting unless you know how long that performance can be sustained, and measurements such as peak performance tend to say just how over-dimensioned the hardware is.

I find it amusing that the discussion in the paper suggest that you want high performance numbers
[...] benchmarks with long running time may trigger thermal gating in certain cases, resulting in lower performance result.
I usually want realistic results, but wanting high numbers makes sense if you (as the authors) work at a hardware vendor and want a high number in the marketing material. They also suggest
Running the benchmark in a temperature-controlled environment is one option; if such an option is not available, adding idle periods between workloads may reduce chances of high system temperature.
which is especially important if your SoC has a heat problem. 😏

Dynamic Voltage and Frequency Scaling

Power consumption increases roughly quadratically with clock frequency1 — for example, raising the frequency from 600MHz to 700MHz on Exynos 5433 increases power consumption by 42%. This means it is better to lower the clock frequency and keep the GPU running 100% of the time instead of, for example, running it on full speed 75% of the time and being idle the remaining 25%.

This performance tuning is done by Dynamic Voltage and Frequency Scaling (DVFS). It is hard to make a good implementation as different applications have different requirements, and there is no "correct" tradeoff. For example, should a high end game run at full speed, and be throttled (i.e. reduced to unplayable frame rate) when the device overheats, or should it run on a lower sustainable speed from the beginning? Different device vendors implement DVFS in different ways, so two phones with the same GPU may behave differently.

Different operations need a different amount of power, and a good DVFS implementation uses this when adjusting the voltage and frequency. For example, memory operations consumes much more power than arithmetic operations, and this is used in Exynos to use lower voltage/frequency for shaders using more memory operations. This is "fun" when optimizing shaders, as a faster shader (as measured in number of clock cycles) does not need to run faster in reality if it uses more power hungry instructions and thus get a lower clock frequency.

Power- and area-efficiency

GPU workloads are embarrassingly parallel, so it is easy to double the performance of the hardware if you are allowed to increase the power and chip area — just place two identical GPUs in the package! In the same way, you can get much improved power efficiency by using two GPUs and running them with halved frequency. This means that you need to look at metrics such as "performance per area power" when comparing GPU architectures.

This is annoying when developing GPUs, as most ideas for improving the performance means that some hardware block becomes more complicated, with the result that the size increases and it consumes more power. And it does not make much sense making the GPU 10% faster if it also need 10% more area and power...

1. The dynamic power consumption is actually $$P_{dyn}=CV^{2}f$$ where $$C$$ is capacitance, $$V$$ is voltage, and $$f$$ is frequency. This varies linearly with the frequency, but increased frequency need a higher voltage, and the power consumption thus varies superlinearly with the frequency.

Sunday, August 14, 2016

Surprising properties of C null pointer constants

A previous blog post mentioned two properties of NULL that many developers find surprising:
• NULL does not need to be 0 when stored in memory — the compiler is allowed to transform the value when casting between integer and pointer types, so
union {
uintptr_t u;
void *p;
} u;
u.p = NULL;
printf("0x%" PRIxPTR "\n", u.u);

does not need to print 0, even though u.p==(void*)0 evaluates to true.
• NULL does not need to be a pointer — it is valid for an implementation to define NULL as
#define NULL 0

so it is not portable to pass NULL to a va_arg function, as this will fail for architectures where sizeof(int)!=sizeof(void*).

Today I found a third case in the paper "Beyond the PDP-11: Architectural support for a memory-safe C abstract machine" — casting an integer variable having the value 0 to a pointer type is not guaranteed to produce a null pointer. For example, it is implementation-defined if the comparison
int zero = 0;
bool b = (void*)0 == (void*)zero;

evaluates to true or false. Both (void*)0 and (void*)zero cast the value 0 to void*, so I had naively thought that both would produce a null pointer, but the standard does only guarantee this for (void*)0 as it is a constant expression
An integer constant expression with the value 0, or such an expression cast to type void *, is called a null pointer constant. If a null pointer constant is converted to a pointer type, the resulting pointer, called a null pointer, is guaranteed to compare unequal to a pointer to any object or function.
while the result is implementation-defined when casting a variable
An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation.
So (void*)0 is a null pointer while (void*)zero is converted in an implementation-defined manner,  which may or may not produce a null pointer...

Sunday, July 24, 2016

Code behaving differently in C90, C99, C11, C++98, and C++11

There are some subtle differences between the revisions of the C standard that makes it possible to create programs that behaves differently depending on if they are compiled as C90, C99, or C11. Similarly, C++ is mostly a superset of C, but there are constructs that produce different results for C and C++.

This is used in Don Yang's contribution to the 2015 International Obfuscated C Code Contest to create a program that produces different output depending on if it is compiled as C89, C99, C11, C++98, C++11. For C90 it prints stars of the form
*********************************************              ***               **
***********************     *****************             ******             **
*********************        ****************           **********           **
********************         ****************         **************         **
******        ******         *****************     ****************************
******           **          **************************************************
******                      *************************(O************************
*******                     *     *********************************************
*********                              ****************************************
***********                             ***************************************
************                            ***************************************
*********                             *****************************************
*******                     ***************************************************
******                       ****************)d));o((d=************************
******           **          **************************************************
******         *****         *********************      ***********************
********************(         O******************        **********************
**********************       *******************          *********************
************************     *******************          **************)p)-p);
o(d-(p=*****************************************          *********************

For C99 the stars have eyes, C++11 prints circles, etc. (There is more to this. The program reads text from standard input, and the output is obfuscated C90 source code that prints that text — all the * characters are pointer dereferences!)

The source code for the program is a little bit hard to read
                                           #define r(R) R"()"
/*[*/#include  /**/<stdio.h>
#include<math.h>/*!![crc=0f527cd2]*/
float I,bu,k,i,F,u,U,K,O;char o[5200];int
#define R(U) (sizeof('U')==1||sizeof(U"1"[0])==1)
h=0,t=-1,m=80,n=26,d,g,p=0,q=0,v=0,y=112,x=40;  float
N(float/*x*/_){g=1<<30;d=-~d*1103515245&--g;return  d*_
/g;}void/**/w(int/**/_){if(t<0){for(g=0;g<5200;o[g++   ]=
0);for(;g;o[g+79]=10)g-=80;for(t=37;g<62;o[80+g++]=32)   ;
}if(m&&o[h*80+m-1]==10){for(g=0;g<79;o[t*80+g++]=0){}o[t
++*80+g]=10;t%=64;n+=2;I=N(70)+5;if(n>30&&(I-x)*(I-x)+n*
n>1600&&R()){O=0;F=(x=0x1!=sizeof(' '))?k=1+N(2),i=12-k+N(
8),N(4):(k=17+N(5),i=0,r()[0]?O=.1:  0);for(u=U=-.05;u<32;
U=k+i+i*.5*sin((u+=.05)+F))for( K=0   ;K< U;K+=.1)if((bu=K*
sin(u/5),g=I+cos( u/5) *K)>=0&&g  <     79  )o[g+(int)(t+44+
bu*(.5-(bu>0?3*O:  O)   ) )%64*  80      ]  =32;x*=02//* */2
-1;n=O+x?n=I+(x?0   :N     (k)-   k           /2),g=(t+42  )%
64,m=-~g%64,x?g=m          =-~        m%64:0  ,n>5?o[g*80   +
n-3]=o[m*80+n-3]=       0:   0              ,n <75?o[g*80+n
+2]=o[m*80+n+2]=0   :0:0;                      x=I;}h=-~h%64
;m=0;}putchar((g=o [h*                          80+m++])?g:_);
if(g){w(_);}}void W                               (const char*_
){for(;*_;w(*_++));}                               int main(int a
,char**_){while(a--)d              +=_[a          ]-(char*)0;W( \
"#include<stdio.h>typed"             "e"         "f\40int\40O;v"
"oid o(O _){putchar(_);}O"                    "\40main(){O"  ""
"*_[512],**p=_,**d,b,q;for(b=0;b"        "++<512;p=_+q)_[q"    \
"=(p-_+1)*9%512]=(O*)p;") ;      for(;(g= getchar())-EOF;p=
q){q=p;for(v=512;p-q-g&&q-p-              g;  v--)q=-~q*9%512
;W("o(");if(p>q)w(y),w(45);w(                      40);w(y^=20
);w(075);for(a=0;a<v;a++)w(42);                      for(W("(O**"
);a--;w(42)){}w(41);w(y^024);w(                      41);if(p<=q)w(
45),w(y^20);W(");");}for(a=7;a-6                      ;W(a<6?"{;}":""
))for(a  =0;a  <6 &&   !o[h*80+m                       +a];a++){}W("r"
"etu"  /*J   */       "rn+0;}\n"                             );return
/*                      "#*/0                                   ;}

but it is, as far as I can tell, using three tricks in order to detect which C or C++ dialect is used:

• // comments
C90 does not have // comments, so constructs of the form
int i = 2 //**/2
;

can be used to differentiate it from the other C and C++ revisions, as C90 compiles this as
int i = 2 //**/2
;

while C++ and the more recent revisions of C compiles this as
int i = 2 //**/2
;

• Type of character constants
Character constants, such as 'a', have type int in C while C++ use the type char. This means that sizeof('a') evaluates to a different value for C and C++.
• Wide string literals
C11 and C++11 have wide string literals where for example U"hello!" is a string with characters of type char32_t. This can be used with a macro
#define R(U) sizeof(U"a"[0])

that is used as R(""). For C11 and C++ this expands to
sizeof(U"a"[0])

which evaluates to 4, while the older revisions of the standards treat U and "a" as two tokens, and the macro expands to
sizeof("""a"[0])

which evaluates to 1.

Wednesday, July 20, 2016

Using MathJax in Blogger

I have added MathJax support to this blog so that I can typeset mathematics using a large subset of $$\LaTeX$$ commands. This functionality works in comments too, which has the nice side effect that it can be used to work around Blogger's limited formatting. In particular, formatting source code in comments can be done as
$$\verb|int foo(void)| \\ \verb|{| \\ \verb| return 0;| \\ \verb|}| \\$$

which is rendered with most of the formatting intact:
$$\verb|int foo(void)| \\ \verb|{| \\ \verb| return 0;| \\ \verb|}|$$

One annoying thing is that the $$\LaTeX$$ commands are not processed when writing or previewing the comment, but they are rendered correctly when it is published...

How to enable MathJax in Blogger

MathJax needs to be loaded and configured when the page is loaded. This is done by adding the following
<script type="text/javascript" async="async"
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS_CHTML,Safe">
</script>

in the <head> block by editing the blog's template (choose "Template", and click the "Edit HTML" button). This only adds the support to the desktop template, so you need to enable it separately for mobile by pressing the gear button, and choosing the "custom" mobile template (that generates a mobile template from your desktop template).

There are several configurations to choose between, with support for $$\LaTeX$$, MathML, and AsciiMath notation, and control over how much functionality is loaded up front, and how much is loaded on-demand. I have chosen the TeX-AMS_CHTML which enables all $$\LaTeX$$ support but avoids MathML and AsciiMath.

Note the ,Safe modifier added after the configuration name. This disables unsafe constructs such as running javascript from $$\LaTeX$$ commands
$E \href{javascript:alert("Einstein says so!")}{=} mc^2$

This is needed in order to prevent commenters messing up the blog by adding evil constructs in comments.

Testing the functionality

Below are some random formulas, just to verify that the rendering works as intended:
$\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\mu)^{2}}$
\left\{ \begin{aligned} a_1x+b_1y+c_1z &=d_1+e_1 \\ a_2x+b_2y&=d_2 \\ a_3x+b_3y+c_3z &=d_3 \end{aligned} \right.
$\require{AMScd} \begin{CD} \pi(X, x_{0}) @>\phi_{*}>> \pi(Y, \phi(x_{0}))\\ @VVuV @VVvV\\ \pi(X, x_{1}) @>\phi_{*}>> \pi(Y, \phi(x_{1})) \end{CD}$

The post was updated 2017-04-20 with a new link to fetch MathJax.js as the old MathJax CDN is shutting down.
The post was updated 2017-05-12 to use the bugfixed 2.7.1MathJax release.

Thursday, July 7, 2016

Checked C

Microsoft Research recently announced a work-in-progress project, Checked C, that adds bounds checking to pointers and arrays:
Checked C adds new pointer types and array types that are bounds-checked, yet layout-compatible with existing pointer and array types. In keeping with the low-level nature of C, programmers control the placement of bounds information in data structures and the flow of bounds information through programs. Static checking enforces the integrity of the bounds information and allows the eliding of some dynamic checking. Dynamic checking enforces the integrity of memory accesses at runtime when static checking cannot. Checked C is backwards-compatible: existing C programs work “as is”. Programmers incrementally opt-in to bounds checking, while maintaining binary compatibility.
Below are my summary of, and comments on, the Checked C version 0.5 specification.

Overview

Checked C extends C with checked arrays and pointers where memory accesses are checked at runtime, and a runtime error is produced if the memory access is out of range. The compiler is allowed to (and expected to) eliminate checks that it knows are always true.

Checked arrays are created using the checked keyword, so
int x checked[5];

creates a checked array x having 5 elements. There are three new checked pointer types that are declared using syntax borrowed from C++:
• array_ptr<T> — A pointer to an element of an array of type T values. This pointer type works as a normal C pointer (but with bounds checking).
• ptr<T> — A pointer to a value of type T. Pointer arithmetic is not allowed on this pointer type.
• span<T> — The span pointer works in the same way as the array_ptr, but it is represented differently in the generated code (see below).
As an example
ptr<int> p;

declares a checked pointer to an int.

The checked pointer types can have const and volatile modifiers, so a pointer to a constant integer is written as
ptr<const int> p;

while
int x;
const ptr<int> p = &x;

defines a pointer that cannot be modified.

The checked arrays and pointers are used in the same way as normal C arrays and pointers, so a checked pointer p can be dereferenced as *p, and an array_ptr<T> or span<T> pointer p can be dereferenced using expressions such as *(p+4) or p[4].

The array_ptr<T> and span<T> pointers need bounds to be defined before they may be dereferenced. Defining the bounds for a pointer p is done using the count and bounds keywords
• p : count(len) — the number of elements that are accessible beginning at p
• p : bounds(low, high) — the range of memory that can be accessed through p
The bounds are placed on the declaration, such as
array_ptr<int> p : count(n) = malloc((sizeof(int) * n);

Using this pointer as p[i] will conceptually add a check
dynamic_check(0 <= i && i < n);

right before the access.1 Pointer assignment transfer the bounds in the natural way, so q will get the bound from p in
array_ptr<int> q = p;

The array_ptr<T> and ptr<T> pointers have the same size as normal C pointers (such that sizeof(array_ptr<int>) is equal to sizeof(int*)) so checked pointers can be used without changing the layout of strucures. This means that the bounds are maintained locally by the compiler. The span<T> pointers do however keep the bounds within the type, so sizeof(span<int>) is larger than sizeof(int*).

It is possible to add additional constraints to handle things like this aligned memcpy
int aligned_memcpy(array_ptr<char> dest : count(len) where aligned(dest, 4),
array_ptr<char> src : count(len) where aligned(src, 4),
int len where len % 4 == 0);

although the constraint specifications seems to be a bit under-specified in the document, and I am not completely sure how they work in detail...

Undefined behavior

Doing all of this is a bit meaningless if code such as a[i+j] can make anything happen because of undefined behavior from overflow of i+j, so Checked C defines the behavior for some things that invokes undefined behavior in standard C.

Checked C requires that signed integer overflow produces a value or a runtime error:
• To be able to maintain pointer bounds safety, it is important that signed integer overflow produce a defined value. When a signed integer expression produces an out-of-range value, either (1) the operation must convert that value to an in-range integer value or (2) the expression shall produce a runtime error. The conversion must be a function of only the input values of the expression.
• Integer division by 0 shall also produce a runtime error or produce a defined value.
Checked C does also define pointers to work more like hardware pointers than what is the case in standard C. The checked pointers are treated in the same way as unsigned integers (all values are valid, even if they do not point at an object), but they produce runtime errors for pointer wrap and pointer arithmetic involving NULL. The rules for undefined behavior for unchecked pointers are modified in a similar way:
Unchecked pointers shall be treated as addresses of locations in memory, just as checked pointers are treated as addresses. The addresses shall be unsigned integers with a defined range of 0 to UINTPTR_MAX:
• Comparison of pointers for all different kinds of pointers shall be defined as the corresponding integer comparison.
• Subtraction p - r of two pointers p and r of type T where one pointer is a checked pointer and the other is an unchecked pointer shall be done following the rules for subtraction of checked pointers, treating the unchecked pointer as a checked pointer in those rules.

Bounds evaluation

The bounds are evaluated each time a pointer is checked, so the program need to be careful when updating variables used in a bounds declaration. The compiler must report an error when the bound is extended
int sum(array_ptr<int> start : bounds(start, end), array_ptr<int> end)
{
end = end + 1; // bounds(start, end) does not hold after this,
// so program is rejected
start[5] = 0;
...
}

but Checked C allows modifying bounds in this way, so for example
array_ptr<int> x : bounds(x, high) = ...
int sum = 0;
while (x < high) {
sum += *x;
x++;
}

is fine as the bound is reduced when x is incremented.

And there are more problems... For example, let e be an array_ref<T> with bound(x+1, x+5). This will not work when assigning
x = e;

as the range depends on x. Or consider this example
w = ...
where w : bounds(x, x + y);
int t = *w + (y = tmp);

The bounds for w depends on y, but y is modified in the same expression that dereferences w, and it is unclear if y is updated before or after w is checked. The compiler must reject the code for both of these examples.

A big part of the specification deals with this, and there are rules for which expressions are valid in bounds declarations, and how to do data flow analysis to verify that variables are allowed to be changed. But data flow analysis is expensive, so there are restriction that limit how much the compiler need to check, with the result that small changes to the code may push the program over the limit and thus fail to compile.

This would be so much simpler if the bounds were evaluated where declared. The compiler could place the bounds in hidden temporary variables, but this is rejected in the rationale:
We considered eager evaluation, but rejected it because it would turn array_ptr types into span types. When bounds expressions are always eagerly evaluated, the results need to be stored somewhere so that they can be used when v is used. For local variables, hidden temporary variables could be introduced. This breaks the design principle of not introducing hidden costs, though.
I do not understand what they mean by this... I would say that the current specification adds hidden costs as the bounds may be evaluated each time the pointer is used, while keeping the bounds in hidden variables will only evaluate them once. Hidden variables may increase the register pressure, but the current specification will most likely increase the live ranges for the variables used in the bounds, which also increases register pressure, so I do not expect a difference in reality.

Doing eager evaluation would however cause problems for array_ptr<T> pointers in structures. They are currently handled as
struct S {
array_ptr<int> arr : count(len);
int len;
}

where the variables used in the bounds calculations lives in the structure. I have not thought this through in detail, but I think it would make sense to forbid derefencing such pointers, and require the program to copy them to a local variable in order to use them. I do not think this is a big problem, as I would guess that most pointers in arrays are of the ptr<T> type, which can be used directly as they do not have bounds.

1. The real checking is somewhat more complex, and it also checks that the count(n) is valid (i.e. that n is less than (INTPTR_MAX/sizeof(int)), etc.)

Updated 2016-07-07: Added clarification in note 1.

Sunday, June 19, 2016

Evolution of C programming practices – Unix 1973–2015

I found a recent paper that also looks at how the BSD code base has evolved, but from a very different perspective compared to my code-size investigation.

The paper "The Evolution of C Programming Practices: A Study of the Unix Operating System 1973–2015" investigates coding style, and tests seven hypotheses by looking at metrics (line length, number of volatile in the source code, etc.) in 66 releases of Unix from 1973 to 2014. The hypotheses are
1. Programming practices reflect technology affordances (e.g. developers may be more liberal with screen space when using high resolution displays)
2. Modularity increases with code size
3. New language features are increasingly used to saturation point
4. Programmers trust the compiler for register allocation
5. Code formatting practices converge to a common standard
6. Software complexity evolution follows self correction feedback mechanisms
and the result is that they seem to be true, as interpreted through the metrics.

There are several things that annoys me with this paper...

Source code change over time

The earliest source code releases in the study are the "Research systems" from Bell Laboratories, which are followed by Bell-32V, different BSD versions from University of California, BSD386, and FreeBSD. Of these, it is only FreeBSD that continues after 1995, so the early range 1973–1995 consists of multiple variants of the OS developed by different groups in parallel, while the data 1996–2014 consists of FreeBSD only. This may affect the interpretation of the data...

Consider for example this plot from the paper, showing the mean comment size
 CMCHAR — mean comment size
The first part of the plot that consists of multiple variants of the OS have points all over the place, while the second part is FreeBSD only, and is rather constant. Several of the metrics have this kind of distribution, and I am not convinced that the data can be interpreted as evolution (rather than different projects have different coding style), or that fitting a cubic spline to the data have much value for this data set.

Language change over time

Some of the metrics are for new functionality in the C language, and are used for testing hypothesis 3 — "new language features are increasingly used to saturation point". But that hypothesis is in some sense obviously true for most of the metrics. For example, K&R C did not have volatile, but C90 requires volatile in order to prevent the compiler from optimizing away memory accesses to memory mapped hardware registers. Furthermore, there are only a finite number of places where it makes sense to use it, so it obviously saturates.

So the metrics mostly measure that the projects switched from K&R C to standardized C, and most of the graphs looks like

 DVOL — volatile keyword density
That is, first no usage, then the feature is introduced in all relevant places when the project updates the compiler.

One other issue here is that the header files are excluded in the calculated metrics. I do not think it affects the result of the conclusion (as it is a rather weak statement anyway), but I have often seen things like
#define HW_REG volatile unsigned int

being defined in header files, and such usages are not included in the data (and this skews the data in the discussion of restrict, where the code uses it as __restrict in order to be able to define it in a suitable way for old compilers that do not handle restrict).

Release dates

The data points for the releases have somewhat random dates. One issue is that the paper use each release's mean file date (the average of the files' last modification time) instead of the release date (that is why the graphs stop at November 2010, even though FreeBSD 10 was released in 2014). The idea is that this better reflects the age of the code base, but this has the effect of compressing some of the data points (especially the clustering around 1993-1994), and it makes the spline fitting even more suspect.

One other problem is that the original data used by the researchers seems to have incorrect timestamps. For example, 4.3BSD Net/1 was released in 1989, but is listed as 1993-12-25 in the paper. The same is true for at least the Net/2 release too, which was released in 1991, but the paper list it as 1993-07-02.

Sunday, May 29, 2016

20 years of NetBSD code bloat

I started with NetBSD in the mid-nineties, on a Sun SPARC ELC with 32Mbyte of memory, where I used GCC and Emacs on X11 with FVWM as the window manager. I'm still using GCC, Emacs, and FVWM with the same configuration files (updated for pointless changes in Emacs and FVWM), but I now need much more memory and CPU performance... I thought it would be interesting to investigate why.

Much of this is likely due to GCC — the GCC 2.7.2 cc1 from NetBSD 1.2 has a text segment size of 1073152 bytes, while GCC 4.8.5 used for today's NetBSD has a size of 11012557, and I believe the dynamic memory and instruction count used during compilation has increased much more than that. But the operating system has also increased in size (for example, NetBSD 1.2 libc.so is 331776 bytes, compared to 1390980 today), and I plan to start my investigation by looking at the OS.

Building NetBSD

One thing I like about NetBSD is that it avoids pointless changes, so the build process has been the same for a long time. In addition, is is easy to cross-compile it for different architectures — in fact, NetBSD is always cross-compiled, even when building e.g. a x86 distribution on a x86 architecture.

There is a script build.sh that sets up everything, so building a release for e.g. Sun SPARC can be done as
./build.sh -u -U -m sparc release

This first compiles the tools needed for the build, then builds the full release of NetBSD, and packages the result. This works on "any" operating system — you can build NetBSD on e.g. Linux and Mac OS X.

Cross-compiling individual components are also trivial. The tools are per default placed in a directory named after the host platform, in my case obj/tooldir.NetBSD-6.1.5-i386/bin. The most useful tool here is a wrapper for make that sets up the environment for cross-compiling. This can be used to compile any component, e.g.
cd lib/libc
../../obj/tooldir.NetBSD-6.1.5-i386/bin/nbmake-sparc

will build libc.so.

Some results

As a first step, I have built a NetBSD release from trunk 1 January each year 2002–2016, mostly just in order to see that it was as easy to build as I thought, and to try to get a feel for what can be interesting to investigate closer. I stopped at 2002, as the build.sh was not available earlier...

I built i386 (which is the NetBSD name for x86, even though it does not support i386 CPUs any longer) and sun3. The reason I choose them is that they represent two extremes of architectures — sun3 is a hardware platform that was obsolete already in the mid-nineties, so software size "should not" increase for it unless new functionality is added, while the i386 hardware is still used, and it has got more features, bigger caches etc., during this time, which may increase code size.

And the code size has increased during this time. For example, this is how libc.so, libm.so, and the kernel have grown:

These plots do not say much, and code increase is not necessarily "bad", even on constrained platforms, if unused functionality never get paged in. My plan is to look into the details of the reason for these increases (such as new functionality, support for more hardware, compiler changes, careless developers, etc.), to get a feel for how much each reason contributes. Please let me know if there are some specific questions you want me to investigate!

Thursday, May 19, 2016

Type-based alias analysis in C

There was some concern that type-based alias analysis could cause problems when it was implemented in GCC, but the conclusion (as expressed in the GCC development list) was that all other compilers already did this optimization,1 so most code had already been fixed:
There's less and less of this code IMHO — modern compilers have been doing this kind of alias analysis for some time and as a result folks have been forced to fix their code. Of course this doesn't apply to Linux and some other free software projects since they only use gcc.
I guess that mail was a bit too optimistic — it was written in 1998, and I still see lots of broken code and complaints about -fstrict-aiasing...

What type-based alias analysis means

An informal description of the type-based alias analysis rules is that every memory location has a type, and you are only allowed to access the memory using the "correct" type (the compiler can, therefore, assume that two accesses with incompatible types do not alias). The C11 standard describes this in 6.5 "Expressions":
An object shall have its stored value accessed only by an lvalue expression that has one of the following types:
• a type compatible with the effective type of the object,
• a qualified version of a type compatible with the effective type of the object,
• a type that is the signed or unsigned type corresponding to the effective type of the object,
• a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
• an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
• a character type.
As a concrete example, a variable of type int can only be accessed through an int* (including unsigned int*, const int*, etc.) or through a char* (including unsigned char*, const char*, etc.).

These restrictions allow the compiler to reorder operations when the types ensure that they access different objects. Consider the function
int i;

void foo(float *f)
{
i = 23;
*f = 0.0;
i = i + 19;
}

*f cannot modify i as it has a different type, so the compiler is allowed to move the store to i over the store to *f, and the function is optimized to
void foo(float *f)
{
*f = 0.0;
i = 42;
}


Note that the type-based aliasing rules only talk about how to access objects, and not about pointer casting — it is allowed to cast pointers between incompatible types (as long as you follow the rules for pointer casts) such as
int i;
float *f = (float *)&i;

But accessing this as *f will now invoke undefined behavior, as the object pointed to is an int, and it is not allowed to access it by an expression having a float type.

type punning — union

There are cases where you must access data using an incorrect type, and there are a few ways to do this. The usual cases are where you need to get the bitwise representation of some value, and we will consider the example from the Wikipedia article on type punning that negates a floating point number by changing the most significant bit. The examples will assume 32-bit int and float.

The naive version of just casting the a pointer to int* does not work
bool is_negative(float x)
{
unsigned int *ui = (unsigned int *)&x;
return (*ui & 0x80000000u) != 0; // Undef behavior reading float as int
}

as it breaks the type-based aliasing rules. The best way to solve this is in most cases to use a union2
bool is_negative(float x)
{
union
{
unsigned int ui;
float f;
} u;
u.f = x;
return (u.ui & 0x80000000u) != 0;
}

Both GCC and LLVM are smart enough to generate as efficient code as you would have expected from the invalid version above.

The union trick requires that all accesses are done through the union — the result is not defined when accessing through a pointer, even if the pointer has the "correct" type
bool is_negative(float x)
{
union
{
unsigned int ui;
float f;
} u;
u.f = x;
unsigned int *ui = &u.ui;
return (*ui & 0x80000000u) != 0;  // Undefined behavior
}


type punning — character pointer

Character pointers can be used to access any type, so the is_negative function can be implemented as
bool is_negative(float x)
{
unsigned char *p = (unsigned char *)&x;
return (p[3] & 0x80) != 0;
}

assuming a little-endian architecture.

Note that int8_t is not guaranteed to be of character type. That is, the following function may be invalid
bool is_negative(float x)
{
uint8_t *p = (uint8_t *)&x;
return (p[3] & 0x80) != 0;  // Possible undefined behavior
}

Treating int8_t as a character type is the reasonable thing to do, and I would assume all compilers do this. But there are developers that think this is a bug — see the discussion in GCC bug 66110...

type punning — memcpy

A third way to do the type punning is using memcpy
bool is_negative(float x)
{
unsigned int ui;
memcpy(&ui, &x, 4);
return (ui & 0x80000000u) != 0;
}

Both GCC and LLVM are smart enough to optimize away the memcpy, and generate similar code as the version using a union.

This type punning does only work if the destination is a variable — you cannot use malloc:ed memory for this. The reason is that memcpy copies the effective type from its source when writing to allocated memory
bool is_negative(float x)
{
unsigned int *p = malloc(4);
if (p == NULL)
abort();
memcpy(p, &x, 4);                // Effective type of *p is now float
return (*p & 0x80000000u) != 0;  // Undef behavior reading float as int
}


allocated memory

Memory returned from malloc does not have a type, so each memory location gets an effective type when it is written. Subsequent reads must then be done according to the type-based aliasing rules as usual.

The type of the allocated memory can be updated by writing with a new type
void *p = malloc(4);
if (p == NULL)
abort();
*(float *)p = 1.0;      // Effective type is float
do_something(p);
*(int *)p = 0;          // Effective type is now int

which allows the buffer being used for different things over its lifetime.

This may have some surprising effects, such as the examples in GCC bug 69776 and 70484. Consider the function
int f(int *pi, long *pl)
{
*pi = 1;
*pl = 0;
return *(char *)pi;
}

The type-based aliasing rules say that int and long cannot alias, and this function can be optimized to
int f(int *pi, long *pl)
{
*pi = 1;
*pl = 0;
return 1;
}

But it is possible for the pointers to alias if both point to the same malloc:ed memory, so the following will print a different value depending on if the optimization is done or not
int main(void)
{
void *p = malloc(sizeof(long));
if (p == NULL)
abort();
printf("%d\n", f(p, p));
return 0;
}

It is a bit unclear exactly what the C standard requires in these cases, and you can argue that the optimization is invalid in this case. Recent versions of GCC are conservative and do not optimize this, but older versions are more aggressive, so it prudent to try to avoid playing tricks with type changes in allocated memory regardless of what the standard says...

1 Many developers seem to think that type-based aliasing was introduced by C99, but that is not true; C90 has essentially the same aliasing rules (although C99 contain some minor clarifications/improvements). I guess the reason for the belief is that GCC added -fstrict-aliasing at roughly the same time as it implemented C99.
2 Many discussions of type punning (such as the Wikipedia article) says that type punning through a union is a GCC extension. The background of this is that C90 said
[...] if a member of a union object is accessed after a value has been stored in a different member of the object, the behavior is implementation-defined.
I believe the committee intended it to work, but made it "implementation-defined" as the concrete result depends on the implementation (byte order, trap representations, etc.). But "implementation-defined" lets the compiler do whatever it wants, as long as the behavior is documented (and the original implementation in GCC did, in fact, have additional restrictions, although that was fixed before the release). GCC documents this works for C90 too, so it is in some sense a GCC extension...