Tuesday, December 12, 2017

This is part five of a series “Writing a GCC back end”.

The previous post about GCC’s low-level IR did only contain the minimum to get started – this post continues with a bit more of the functionality used in machine descriptions.

define_expand

The define_insn definition describes an insn implementing the named functionality (used when converting the higher level IR to RTL), but there are cases where the named functionality must expand to more than one insn, or when the RTL should be generated differently depending on the operands – this is handled by define_expand.

The general format of define_expand is essentially the same as for define_insn – it consists of a name, an RTL template, a condition string containing C++ code to enable/disable the instruction pattern, and a section of C++ code (called preparation statements) that can be used to generate RTL.

Our first example of how to use define_expand is from the Arm handling of rotlsi3 (“rotate left”) – Arm does only have a “rotate right” instruction, but rotating left by x is the same as rotating right by 32-x, and we can generate this as:
(define_expand "rotlsi3"
[(set (match_operand:SI              0 "s_register_operand" "")
(rotatert:SI (match_operand:SI 1 "s_register_operand" "")
(match_operand:SI 2 "reg_or_int_operand" "")))]
"TARGET_32BIT"
{
if (CONST_INT_P (operands[2]))
operands[2] = GEN_INT ((32 - INTVAL (operands[2])) % 32);
else
{
rtx reg = gen_reg_rtx (SImode);
emit_insn (gen_subsi3 (reg, GEN_INT (32), operands[2]));
operands[2] = reg;
}
})

The preparation statements are used to modify the second operand given to rotlsi3 to calculate the correct value for the code generated by the RTL template.

define_expand may choose to skip the generation of the RTL template by executing DONE in the preparation statements. An example of how this is used is the Moxie implementation of mulsidi3 that creates a 64-bit result from multiplying two 32-bit values. The Moxie back end generates this pattern as two instructions (one that calculates the upper 32 bits, and one that calculates the lower 32 bits) where all work is done by the preparation statements
(define_expand "mulsidi3"
[(set (match_operand:DI 0 "register_operand" "")
(mult:DI (sign_extend:DI (match_operand:SI 1 "register_operand" "0"))
(sign_extend:DI (match_operand:SI 2 "register_operand" "r"))))]
""
{
rtx hi = gen_reg_rtx (SImode);
rtx lo = gen_reg_rtx (SImode);

emit_insn (gen_mulsi3_highpart (hi, operands[1], operands[2]));
emit_insn (gen_mulsi3 (lo, operands[1], operands[2]));
emit_move_insn (gen_lowpart (SImode, operands[0]), lo);
emit_move_insn (gen_highpart (SImode, operands[0]), hi);
DONE;
})

Note: This define_expand contains an RTL template, but it is not needed as RTL will not be generated from it. It would have been enough to specify the operands, as in
(define_expand "mulsidi3"
[(match_operand:DI 0 "register_operand")
(match_operand:SI 1 "register_operand")
(match_operand:SI 2 "register_operand")]
""
{
rtx hi = gen_reg_rtx (SImode);
rtx lo = gen_reg_rtx (SImode);

emit_insn (gen_mulsi3_highpart (hi, operands[1], operands[2]));
emit_insn (gen_mulsi3 (lo, operands[1], operands[2]));
emit_move_insn (gen_lowpart (SImode, operands[0]), lo);
emit_move_insn (gen_highpart (SImode, operands[0]), hi);
DONE;
})


DONE works as a return statement, and it can be used in conditional code to let the preparation statements override the RTL template for special cases while using the RTL template for the general case. The Arm smaxsi3 instruction pattern (calculating the maximum of two signed integers) uses this to handle two special cases more efficiently:
(define_expand "smaxsi3"
[(parallel [
(set (match_operand:SI 0 "s_register_operand" "")
(smax:SI (match_operand:SI 1 "s_register_operand" "")
(match_operand:SI 2 "arm_rhs_operand" "")))
(clobber (reg:CC CC_REGNUM))])]
"TARGET_32BIT"
{
if (operands[2] == const0_rtx || operands[2] == constm1_rtx)
{
/* No need for a clobber of the condition code register here.  */
emit_insn (gen_rtx_SET (operands[0],
gen_rtx_SMAX (SImode, operands[1],
operands[2])));
DONE;
}
})

The problem it solves is that the general case translates to the instructions
cmp   r1, r2
movge r0, r2
movlt r0, r1

which clobbers the condition code, but the special cases max(x,0) and max(x,-1) can be generated as
bic   r0, r1, r1, asr #31

and
orr   r0, r1, r1, asr #31

which do not clobber CC. This optimization could be handled by later optimization passes, but doing it as early as possible (i.e. when generating the RTL) gives more freedom to all later passes to optimize cases that otherwise would have been prevented by the clobbering.

Note: The RTL always have the constant as the second operand for commutative binary operations, so the code does not need to check the first operand.

One other way to return from the preparation statements is to execute FAIL which has the effect of ignoring the instruction pattern in the same way as if the pattern had been disabled by the condition string. An example of this is the AArch64 movmemdi instruction pattern implementing a memory block move:
(define_expand "movmemdi"
[(match_operand:BLK 0 "memory_operand")
(match_operand:BLK 1 "memory_operand")
(match_operand:DI 2 "immediate_operand")
(match_operand:DI 3 "immediate_operand")]
"!STRICT_ALIGNMENT"
{
if (aarch64_expand_movmem (operands))
DONE;
FAIL;
})

This calls the target-specific aarch64_expand_movmem function that checks if it makes sense to expand the block move inline (that is, if the result will be relatively small) and generates a sequence of move insns if that is the case. If not, it just returns false, which makes this pattern call FAIL, and GCC will ignore this instruction pattern and generate a call to memcpy instead.

The unspec and unspec_volatile expression codes

The RTL template in define_insn contains expressions describing the functionality of the instruction pattern, which enables the optimizers to reason about the insn. Many architectures have instructions that cannot be described by this (or where it does not make sense to describe the functionality, such as instructions for AES encryption – the optimizers cannot take advantage of this description anyway). These cases are handled by describing the instructions using an unspec or unspec_volatile expression which the compiler treats as a black box – the only knowledge the compiler has is what is described by the predicates and register constraints.

One example of how this is used the AArch64 set_fpsr insn that writes to the floating-point status register
(define_insn "set_fpsr"
[(unspec_volatile [(match_operand:SI 0 "register_operand" "r")] UNSPECV_SET_FPSR)]
""
"msr\\tfpsr, %0")

This describes a volatile instruction (i.e. an instruction with side effects) that takes a register operand. The last operand to the unspec and unspec_volatile expressions is an integer that identifies the instruction (the backend may have several different unspec instructions, and each gets a different number) – these are by convention defined as enumerations called unspec and unspecv
(define_c_enum "unspecv" [
UNSPECV_EH_RETURN           ; Represent EH_RETURN
UNSPECV_GET_FPCR            ; Represent fetch of FPCR content.
UNSPECV_SET_FPCR            ; Represent assign of FPCR content.
UNSPECV_GET_FPSR            ; Represent fetch of FPSR content.
UNSPECV_SET_FPSR            ; Represent assign of FPSR content.
UNSPECV_BLOCKAGE            ; Represent a blockage
UNSPECV_PROBE_STACK_RANGE   ; Represent stack range probing.
])


Attributes

It is possible to add extra information to an insn (such as the length or scheduling constraints) that the back end may take advantage of when generating the code. This is done by defining attributes
(define_attr name list-of-values default)

where
• name is a string containing the name of the attribute.
• list-of-values is either a string that specifies a comma-separated list of values that can be assigned to the attribute, or an empty string to specify that the attribute takes numeric values.
• default is an expression that gives the value of this attribute for insns whose definition does not include an explicit value for the attribute.
For example,
(define_attr "length" "" (const_int 2))

defines a numeric attribute “length” having the default value 2. The back end may define attributes with any name, but a few names have a specific usage in GCC. For example, the length attribute contains the instruction’s length measuerd in bytes, and is used when calculating branch distance for architectures where different instruction are used for “short” and “long” branches.

Attribute values are assigned to insns by attaching a set_attr to the define_insn as in
(define_insn "*call"
[(call (mem:QI (match_operand:SI 0 "nonmemory_operand" "i,r"))
(match_operand 1 "" ""))]
""
"@
jsra\\t%0
jsr\\t%0"
[(set_attr "length" "6,2")])

This gives the length attribute the value 6 if the first alternative was matched, and 2 if the second alternative was matched.

The attributes can be accessed from C++ code by calling the auto-generated function get_attr_name, as in
int len = get_attr_length (insn);

The return type of get_attr_name for attributes defined with a list-of-values is an enum of the possible values.

All functionality is described in “GNU Compiler Collection Internals”:

Sunday, October 29, 2017

Excessive GCC memory usage for large std::bitset arrays

The C++ slack channel had a discussion last week about the code
#include <array>
#include <bitset>
#include <cstddef>

constexpr std::size_t N = 100000;
std::array<std::bitset<N>, N> elems;

int main() {}

that makes GCC consume about 9 gigabytes of memory in the parsing phase of the compilation. This does not happen when using C-style arrays, so changing the definition of elems to
std::bitset<N> elems[N];

makes the code compile without needing an excessive amount of memory. So why does GCC consume all this memory while parsing, and only when using std::array?

The reason has to do with deficiencies in GCC’s implementation of constexpr. To see what is happening, we start by expanding the include files and removing everything not needed for the program:
namespace std
{
typedef long unsigned int size_t;
}

namespace std __attribute__ ((__visibility__ ("default")))
{
template<typename _Tp, std::size_t _Nm>
struct __array_traits
{
typedef _Tp _Type[_Nm];
};

template<typename _Tp, std::size_t _Nm>
struct array
{
typedef std::__array_traits<_Tp, _Nm> _AT_Type;
typename _AT_Type::_Type _M_elems;
};
}

namespace std __attribute__ ((__visibility__ ("default")))
{
template<size_t _Nw>
struct _Base_bitset
{
typedef unsigned long _WordT;

_WordT _M_w[_Nw];

constexpr _Base_bitset() noexcept
: _M_w() { }
};

template<size_t _Nb>
class bitset
: private _Base_bitset<((_Nb) / (8 * 8) + ((_Nb) % (8 * 8) == 0 ? 0 : 1))>
{
};
}

constexpr std::size_t N = 100000;
std::array<std::bitset<N>, N> elems;

int main() {}

_Base_bitset has a constexpr constructor, and this makes GCC end up building the whole elems array in memory at compile time. The array is large (1,250,400,000 bytes) and GCC need to use even more memory as it represents the array by building AST nodes for the elements.

The constexpr keyword does not mean that the compiler must evaluate at compile time – it only means that it can be evaluated at compile time if the result is used where only constant expressions are allowed. I had assumed that compile-time evaluation is not needed in this case, but GCC seems to always evaluate constexpr at compile time when instantiating templates. Anyway, GCC could use a more efficient representation that does not need to keep the whole array expanded in memory...

The C-style array is not expanded in memory at compile time as it is not defined as a template. The bitset class is still expanded, but it is small, and the compiler only wastes 12,504 bytes of memory by expanding it.

Saturday, September 16, 2017

Useful GCC warning options not enabled by -Wall -Wextra

GCC can warn about questionable constructs in the source code, but most such warnings are not enabled by default – developers need to use the options -Wall and -Wextra to get all generally useful warnings. There are many additional warning options that are not enabled by -Wall -Wextra as they may produce too many false positive warnings or be targeted to a specific obscure use case, but I think a few of them (listed below) may be useful for general use.

-Wduplicated-cond

Warn about duplicated condition in if-else-if chains, such as
int foo(int a)
{
int b;
if (a == 0)
b = 42;
else if (a == 0)
b = 43;
return b;
}

Note: -Wduplicated-cond was added in GCC 6.

-Wduplicated-branches

Warn when an if-else has identical branches, such as
int foo(int a)
{
int b;
if (a == 0)
b = 42;
else
b = 42;
return b;
}

It also warns for conditional operators having identical second and third expressions
int foo(int a)
{
int b;
b = (a == 0) ? 42 : 42;
return b;
}

Note: -Wduplicated-branches was added in GCC 7.

-Wlogical-op

Warn about use of logical operations where a bitwise operation probably was intended, such as
int foo(int a)
{
a = a || 0xf0;
return a;
}

It also warns when the operands of logical operations are the same
int foo(int a)
{
if (a < 0 && a < 0)
return 0;
return 1;
}

Note: -Wlogical-op was added in GCC 4.3.

-Wrestrict

Warn when the compiler detects that an argument passed to a restrict or __restrict qualified parameter alias with another parameter.
void bar(char * __restrict, char * __restrict);

void foo(char *p)
{
bar(p, p);
}

Note: -Wrestrict was added in GCC 7.

-Wnull-dereference

Warn when the compiler detects paths that dereferences a null pointer.
void foo(int *p, int a)
{
int *q = 0;
if (0 <= a && a < 10)
q = p + a;
*q = 1;  // q may be NULL
}

Note: -Wnull-dereference was added in GCC 6.

-Wold-style-cast

Warn if a C-style cast to a non-void type is used within a C++ program.
int *foo(void *p)
{
return (int *)p;
}

Note: -Wold-style-cast was added before GCC 3.
Note: -Wold-style-cast is only available for C++.

-Wuseless-cast

Warn when an expression is cast to its own type within a C++ program.
int *foo(int *p)
{
return static_cast<int *>(p);
}

Note: -Wuseless-cast was added in GCC 4.8.
Note: -Wuseless-cast is only available for C++.

-Wjump-misses-init

Warn if a goto statement or a switch statement jumps forward across the initialization of a variable, or jumps backward to a label after the variable has been initialized.
int foo(int a)
{
int b;
switch (a)
{
case 0:
b = 0;
int c = 42;
break;
default:
b = c;  // c not initialized here
}
return b;
}

Note: -Wjump-misses-init was added in GCC 4.5.
Note: -Wjump-misses-init is only available for C – jumping over variable initialization is an error in C++.

-Wdouble-promotion

Warn when a value of type float is implicitly promoted to double.

Floating point constants have the type double, which makes it easy to accidentally compute in a higher precision than intended. For example,
float area(float radius)
{
}

does all the computation in double precision instead of float. There is normally no difference in performance between float and double for scalar x86 code (although there may be a big difference for small, embedded, CPUs), but double may be much slower after vectorization as only half the number of elements fit in the vectors compared to float values.

Note: -Wdouble-promotion was added in GCC 4.6.

-Wshadow

Warn when a local variable or type declaration shadows another variable, parameter, type, or class member.
int result;

int foo(int *p, int len)
{
int result = 0;  // Shadows the global variable
for (int i = 0; i < len; i++)
result += p[i];
return result;
}

Note: -Wshadow was added before GCC 3.

-Wformat=2

The -Wformat option warns when calls to printf, scanf, and similar functions have an incorrect format string or when the arguments do not have the correct type for the format string. The option is enabled by -Wall, but it can be made more aggressive by adding -Wformat=2 which adds security-related warnings. For example, it warns for
#include <stdio.h>

void foo(char *p)
{
printf(p);
}

that may be a security hole if the format string came from untrusted input and contains ‘%n’.

Note: -Wformat=2 was added in GCC 3.0.

Friday, September 8, 2017

Follow-up on “Why undefined behavior may call a never-called function”

I have recieved several questions on the previous blog post about what happens for more complex cases, such as
#include <cstdlib>

typedef int (*Function)();

static Function Do;

static int EraseAll() {
return system("rm -rf /");
}

static int LsAll() {
return system("ls /");
}

void NeverCalled() {
Do = EraseAll;
}

void NeverCalled2() {
Do = LsAll;
}

int main() {
return Do();
}

where the compiler will find three possible values for Do: EraseAll, LsAll, and 0.

The value 0 is eliminated from the set of possible values for the call in main, in the same way as for the simpler case, but the compiler cannot change the indirect call to a direct call as there are still two possible values for the function pointer, and clang generates the expected
main:
jmpq    *Do(%rip)

But a compiler could transform the line
return Do();

to
if (Do == LsAll)
return LsAll();
else
return EraseAll();

that has the same surprising effect of calling a never-called function. This transformation would be silly in this case as the cost of the extra comparison is similar to the cost of the eliminated indirect call, but it may be a good optimization when the compiler can determine that the result will be faster (for example, if the functions can be simplified after inlining). I don’t know if this is implemented in clang/LLVM — I could not get this to happen when writing some small test-programs. But, for example, GCC’s implementation of devirtualization can do it if -fdevirtualize-speculatively is enabled, so this is not a hypothetical optimization (GCC does, however, not take advantage of undefined behavior in this case, so it will not insert calls to never-called functions).

Monday, September 4, 2017

Why undefined behavior may call a never-called function

My twitter feed has recently been filled with discussions about the following program
#include <cstdlib>

typedef int (*Function)();

static Function Do;

static int EraseAll() {
return system("rm -rf /");
}

void NeverCalled() {
Do = EraseAll;
}

int main() {
return Do();
}

that clang compiles to
main:
movl    $.L.str, %edi jmp system .L.str: .asciz "rm -rf /"  That is, the compiled program executes “rm -rf /” even though the original program never calls EraseAll! Clang is allowed to do this – the function pointer Do is initialized to 0 as it is a static variable, and calling 0 invokes undefined behavior – but it may seem strange that the compiler chooses to generate this code. It does, however, follow naturally from how compilers analyze programs... Eliminating function pointers can give big performance improvements – especially for C++ as virtual functions are generated as function pointers and changing these to direct calls enable optimizations such as inlining. It is in general hard to track the possible pointer values through the code, but it is easy in this program – Do is static and its address is not taken, so the compiler can trivially see all writes to it and determines that Do must have either the value 0 or the value EraseAll (as NeverCalled may have been called from, for example, a global constructor in another file before main is run). The compiler can remove 0 from the set of possible values when processing the call to Do as it would invoke undefined behavior, so the only possible value is EraseAll and the compiler changes return Do();  to return EraseAll();  I’m not too happy with taking advantage of undefined behavior in order to eliminate possible pointer values as this has a tendency to affect unrelated code, but there may be good reasons for clang/LLVM doing this (for example, it may be common that devirtualization is prevented as the set of possible pointer values contain a 0 because the compiler finds a spurious pure virtual function). Update: I wrote a follow-up post discussing a slightly more complex case. Tuesday, August 29, 2017 GCC target description macros and functions This is part four of a series “Writing a GCC back end”. The functionality that cannot be handled by the machine description in machine.md is implemented using target macros and functions in machine.h and machine.c. Starting from an existing back end that is similar to you target will give you a reasonable implementation for most of these, but you will need to update the implementation related to the register file and addressing modes to correctly describe your architecture. This blog post describes the relevant macros and functions, with examples from the Moxie back end. Inspecting the RTL Some of the macros and functions are given RTL expressions that they need to inspect in order to do what is expected. RTL expressions are represented by a type rtx that is accessed using macros. Assume we have a variable rtx x;  containing the expression (plus:SI (reg:SI 100) (const_int 42))  We can now access the different parts of it as • GET_CODE(x) returns the operation PLUS • GET_MODE(x) returns the operation’s machine mode SImode • XEXP(x,0) returns the first operand – an rtx corresponding to (reg:SI 100) • XEXP(x,1) returns the second operand – an rtx corresponding to (const_int 42) XEXP() returns an rtx which is not right for accessing the value of a leaf expression (such as reg) – those need to be accessed using a type-specific macro. For example, the integer value from (const_int 42) is accessed using INTVAL(), and the register number from (reg:SI 100) is accessed using REGNO(). For an example of how this is used in the back end, suppose our machine have load and store instructions that accept an address that is a constant, a register, or a register plus a constant in the range [-32768, 32767], and we need a function that given an address expression tells if it is a valid address expression for the target. We could write that function as bool valid_address_p (rtx x) { if (GET_CODE (x) == SYMBOL_REF || GET_CODE (x) == LABEL_REF || GET_CODE (x) == CONST) return true; if (REG_P (x)) return true; if (GET_CODE(x) == PLUS && REG_P (XEXP (x, 0)) && CONST_INT_P (XEXP (x, 1)) && IN_RANGE (INTVAL (XEXP (x, 1)), -32768, 32767)) return true; return false; }  Registers The register names are specified as #deefine REGISTER_NAMES \ { \ "$fp", "$sp", "$r0", "$r1", \ "$r2", "$r3", "$r4", "$r5", \ "$r6", "$r7", "$r8", "$r9", \ "$r10", "$r11", "$r12", "$r13", \ "?fp", "?ap", "$pc", "?cc"      \
}

The names are used for generating the assembly files, and for the GCC extension letting programmers place variables in specified registers
register int *foo asm ("$r12");  The Moxie architecture does only have 18 registers, but it adds two fake registers ?fp and ?ap for the frame pointer and argument pointer (used to access the function’s argument lists). This is a common strategy in GCC back ends to simplify code generation and elimination of unneeded frame pointer and argument pointers, and there there is a mechanism (ELIMINABLE_REGS) that rewrites these fake registers to real registers (such as the stack pointer). The GCC RTL contains virtual registers (which are called pseudo registers in GCC) before the real registers (called hard registers) are allocated. The registers are represented as an integer, where 0 is the first hard register, 1 is the second hard register, etc., and the pseudo registers follows after the last hard register. The back end specifies the number of hard registers by specifying the first pseudo register #define FIRST_PSEUDO_REGISTER 20 Some of the registers, such as the program counter $pc, cannot be used by the register allocator, so we need to specify which registers the register allocator must avoid
#define FIXED_REGISTERS             \
{                                 \
1, 1, 0, 0,                     \
0, 0, 0, 0,                     \
0, 0, 0, 0,                     \
0, 0, 0, 1,                     \
1, 1, 1, 1                      \
}

where 1 means that it cannot be used by the register allocator. Similarly, the register allocator needs to know which registers may be changed by calling a function
#define CALL_USED_REGISTERS         \
{                                 \
1, 1, 1, 1,                     \
1, 1, 1, 1,                     \
0, 0, 0, 0,                     \
0, 0, 1, 1,                     \
1, 1, 1, 1                      \
}

where 1 means that the register is clobbered by function calls.

It is possible to specify the order in which the register allocator allocates the registers. The Moxie back end does not do this, but it could have done it with something like the code below that would use register 2 ($r0) for the first allocated register, register 3 ($r1) for the second allocated register, etc.
#define REG_ALLOC_ORDER                     \
{                                         \
/* Call-clobbered registers */          \
2, 3, 4, 5, 6, 7, 14                    \
/* Call-saved registers */              \
8, 9, 10, 11, 12, 13,                   \
/* Registers not for general use.  */   \
0, 1, 15, 16, 17, 18, 19                \
}

For an example of where this can be used, consider an architecture with 16 registers and “push multiple” and “pop multiple” instructions that push/pop the last $$n$$ registers. The instructions are designed to store call-saved registers when calling a function, and it makes sense to allocate the call-saved registers in reverse order so the push/pop instructions save as few registers as possible
#define REG_ALLOC_ORDER                         \
{                                             \
/* Call-clobbered registers */              \
0, 1, 2, 3,                                 \
/* Call-saved registers */                  \
15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4    \
}


There are usually restrictions on how registers can be used (e.g. the Moxie $pc register cannot be used in arithmetic instructions), and these restrictions are described by register classes. Each register class defines a set of registers that can be used in the same way (for example, that can be used in arithmetic instructions). There are three standard register classes that must be defined • ALL_REGS – containing all of the registers. • NO_REGS – containing no registers. • GENERAL_REGS – used for the ‘r’ and ‘g’ constraints. and the back end can add its own as needed. The register classes are implemented as enum reg_class { NO_REGS, GENERAL_REGS, SPECIAL_REGS, CC_REGS, ALL_REGS, LIM_REG_CLASSES }; #define N_REG_CLASSES LIM_REG_CLASSES #define REG_CLASS_NAMES \ { \ "NO_REGS", \ "GENERAL_REGS", \ "SPECIAL_REGS", \ "CC_REGS", \ "ALL_REGS" \ } #define REG_CLASS_CONTENTS \ { \ { 0x00000000 }, /* Empty */ \ { 0x0003FFFF }, /*$fp, $sp,$r0 to $r13, ?fp */ \ { 0x00040000 }, /*$pc */                        \
{ 0x00080000 }, /* ?cc */                        \
{ 0x000FFFFF }  /* All registers */              \
}

REG_CLASS_CONTENTS consists of a bit mask for each register class, telling with registers are in the set. Architectures having more than 32 registers use several values for each register class, such as
    { 0xFFFFFFFF, 0x000FFFFF }  /* All registers */  \


The back end also need to implement the REGNO_REG_CLASS macro that returns the register class for a register. There are in general several register classes containing the registers, and the macro should return a minimal class (i.e. one for which no smaller class contains the register)
#define REGNO_REG_CLASS(REGNO) moxie_regno_to_class[ (REGNO) ]

const enum reg_class riscv_regno_to_class[FIRST_PSEUDO_REGISTER] = {
GENERAL_REGS, GENERAL_REGS, GENERAL_REGS, GENERAL_REGS,
GENERAL_REGS, GENERAL_REGS, GENERAL_REGS, GENERAL_REGS,
GENERAL_REGS, GENERAL_REGS, GENERAL_REGS, GENERAL_REGS,
GENERAL_REGS, GENERAL_REGS, GENERAL_REGS, GENERAL_REGS,
GENERAL_REGS, GENERAL_REGS, SPECIAL_REGS, CC_REGS
};


The back end may also need to set a few target macros detailing how values fit in registers, such as
/* A C expression that is nonzero if it is permissible to store a
value of mode MODE in hard register number REGNO (or in several
registers starting with that one).  */
#define HARD_REGNO_MODE_OK(REGNO,MODE) 1

See section 17.7.2 in “GNU Compiler Collection Internals” for a list of those macros.

Addressing modes varies a bit between architectures – most can do “base + index” addressing, but there are often restrictions on the format of the index etc. The general functionality is described by five macros:
/* Specify the machine mode that pointers have.
After generation of rtl, the compiler makes no further distinction
between pointers and any other objects of this machine mode.  */
#define Pmode SImode

/* Maximum number of registers that can appear in a valid memory

/* A macro whose definition is the name of the class to which a
valid base register must belong.  A base register is one used in
an address which is the register value plus a displacement.  */
#define BASE_REG_CLASS GENERAL_REGS

/* A macro whose definition is the name of the class to which a
valid index register must belong.  An index register is one used
in an address where its value is either multiplied by a scale
factor or added to another register (as well as added to a
displacement).  */
#define INDEX_REG_CLASS NO_REGS

/* A C expression which is nonzero if register number NUM is suitable
for use as a base register in operand addresses.  */
#ifdef REG_OK_STRICT
#define REGNO_OK_FOR_BASE_P(NUM)   \
(HARD_REGNO_OK_FOR_BASE_P(NUM)    \
|| HARD_REGNO_OK_FOR_BASE_P(reg_renumber[(NUM)]))
#else
#define REGNO_OK_FOR_BASE_P(NUM)   \
((NUM) >= FIRST_PSEUDO_REGISTER || HARD_REGNO_OK_FOR_BASE_P(NUM))
#endif

/* A C expression which is nonzero if register number NUM is suitable
for use as an index register in operand addresses.  */
#define REGNO_OK_FOR_INDEX_P(NUM) 0

where
#define HARD_REGNO_OK_FOR_BASE_P(NUM) \
((unsigned) (NUM) < FIRST_PSEUDO_REGISTER \
&& (REGNO_REG_CLASS(NUM) == GENERAL_REGS \
|| (NUM) == HARD_FRAME_POINTER_REGNUM))

is a helper macro the Moxie back end has introduced.

The macros REGNO_OK_FOR_BASE_P and REGNO_OK_FOR_INDEX_P have two versions – the normal version that accepts all pseudo registers as well as the valid hard registers, and one “strict” version that only accepts the valid hard registers. The strict version is used from within the register allocator, so it needs to check pseudo registers against the reg_number array to accept cases where the pseudo register is in the process of being allocated to a hard register.

The back end also needs to implement the TARGET_ADDR_SPACE_LEGITIMATE_ADDRESS_P hook that is used by the compiler to check that the address expression is valid (this is needed as the hardware may have more detailed constraints that cannot be expressed by the macros above). This is done similarly to the valid_address_p example in “inspecting the IR” above, but it needs to handle the “strict” case too. The Moxie back end implements this as
static bool
moxie_reg_ok_for_base_p (const_rtx reg, bool strict_p)
{
int regno = REGNO (reg);

if (strict_p)
return HARD_REGNO_OK_FOR_BASE_P (regno)
|| HARD_REGNO_OK_FOR_BASE_P (reg_renumber[regno]);
else
return !HARD_REGISTER_NUM_P (regno)
|| HARD_REGNO_OK_FOR_BASE_P (regno);
}

static bool
rtx x, bool strict_p,
{

if (GET_CODE(x) == PLUS
&& REG_P (XEXP (x, 0))
&& moxie_reg_ok_for_base_p (XEXP (x, 0), strict_p)
&& CONST_INT_P (XEXP (x, 1))
&& IN_RANGE (INTVAL (XEXP (x, 1)), -32768, 32767))
return true;
if (REG_P (x) && moxie_reg_ok_for_base_p (x, strict_p))
return true;
if (GET_CODE (x) == SYMBOL_REF
|| GET_CODE (x) == LABEL_REF
|| GET_CODE (x) == CONST)
return true;
return false;
}



Finally, TARGET_PRINT_OPERAND_ADDRESS must be implemented to output the correct syntax for the address expressions in the assembly file generated by the compiler:
static void
moxie_print_operand_address (FILE *file, machine_mode, rtx x)
{
switch (GET_CODE (x))
{
case REG:
fprintf (file, "(%s)", reg_names[REGNO (x)]);
break;

case PLUS:
switch (GET_CODE (XEXP (x, 1)))
{
case CONST_INT:
fprintf (file, "%ld(%s)",
INTVAL(XEXP (x, 1)), reg_names[REGNO (XEXP (x, 0))]);
break;
case SYMBOL_REF:
fprintf (file, "(%s)", reg_names[REGNO (XEXP (x, 0))]);
break;
case CONST:
{
rtx plus = XEXP (XEXP (x, 1), 0);
if (GET_CODE (XEXP (plus, 0)) == SYMBOL_REF
&& CONST_INT_P (XEXP (plus, 1)))
{
fprintf (file,"+%ld(%s)", INTVAL (XEXP (plus, 1)),
reg_names[REGNO (XEXP (x, 0))]);
}
else
abort();
}
break;
default:
abort();
}
break;

default:
break;
}
}



All functionality is described in “GNU Compiler Collection Internals”:
• Chapter 13 describes the RTL representation, including macros and functions for accessing it.
• Chapter 17 describes all target description macros and functions.

Tuesday, August 22, 2017

GCC low-level IR and basic code generation

This is part three of a series “Writing a GCC back end”.

Compilers are usually described as having three parts – a front end that handles the source language, a middle that optimizes the code using a target-independent representation, and a back end doing the target-specific code generation. GCC is no different in this regard – the front end generates a target-independent representation (GIMPLE) that is used when optimizing the code, and the result is passed to the back end that converts it to its own representation (RTL) and generates the code for the program.

The back end IR

The back end’s internal representation for the code consists of a linked list of objects called insns. An insn corresponds roughly to an assembly instruction, but there are insns representing labels, dispatch tables for switch statements, etc.. Each insnsn is constructed as a tree of expressions, and is usually written in an LISP-like syntax. For example,
(reg:m n)

is an expression representing a register access, and
(plus:m x y)

represents adding the expressions x and y. An insn adding two registers may combine these as
(set (reg:m r0) (plus:m (reg:m r1) (reg:m r2)))

The m in the expressions denotes a machine mode that defines the size and representation of the data object or operation in the expression. There are lots of machine modes, but the most common are
• QI – “Quarter-Integer” mode represents a single byte treated as an integer.
• HI – “Half-Integer” mode represents a two-byte integer.
• SI – “Single Integer” mode represents a four-byte integer.
• DI – “Double Integer” mode represents an eight-byte integer.
• CC – “Condition Code” mode represents the value of a condition code (used to represent the result of a comparison operation).
GCC supports architectures where a byte is not 8 bits, but this blog series will assume 8-bit bytes (mostly as I often find it clearer to talk about a 32-bit value in examples, instead of a more abstract SImode value).

Overview of the back end operation

The back end runs a number of passes over the IR, and GCC will output the resulting RTL for each pass when -fdump-rtl-all is passed to the compiler.

The back end starts by converting GIMPLE to RTL, and a small GIMPLE function such as
foo (int a)
{
int _2;

<bb 2> [100.00%]:
_2 = a_1(D) * 42;
return _2;
}

is expanded to
(note 1 0 4 NOTE_INSN_DELETED)
(note 4 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 2 4 3 2 (set (reg/v:SI 73 [ a ])
(reg:SI 10 a0 [ a ])) "foo.c":2 -1
(nil))
(note 3 2 6 2 NOTE_INSN_FUNCTION_BEG)
(insn 6 3 7 2 (set (reg:SI 75)
(const_int 42 [0x2a])) "foo.c":3 -1
(nil))
(insn 7 6 8 2 (set (reg:SI 74)
(mult:SI (reg/v:SI 73 [ a ])
(reg:SI 75))) "foo.c":3 -1
(nil))
(insn 8 7 12 2 (set (reg:SI 72 [ <retval> ])
(reg:SI 74)) "foo.c":3 -1
(nil))
(insn 12 8 13 2 (set (reg/i:SI 10 a0)
(reg:SI 72 [ <retval> ])) "foo.c":4 -1
(nil))
(insn 13 12 0 2 (use (reg/i:SI 10 a0)) "foo.c":4 -1
(nil))

The generated RTL corresponds mostly to real instructions even at this early stage in the back end, but the generated code it is inefficient and registers are still virtual.

The next step is to run optimization passes on the RTL. This is the same kind of optimization passes that have already been run on GIMPLE (constant folding, dead code elimination, simple loop optimizations, etc.), but they can do a better job with knowledge of the target architecture. For example, loop optimizations may transform loops to better take advantage of loop instructions and addressing modes, and dead code elimination may see that the operations working on the upper part of
foo (long long int a, long long int b)
{
long long int _1;
long long int _4;

<bb 2> [100.00%]:
_1 = a_2(D) + b_3(D);
_4 = _1 & 255;
return _4;
}

on a 32-bit architecture are not needed (as the returned value is always 0) and can thus be eliminated.

After this, instructions are combined and spit to better instruction sequences by peep-hole optimizations, registers are allocated, the instructions are scheduled, and the resulting RTL dump contains all the information about which instructions are selected and registers allocated
(note 1 0 4 NOTE_INSN_DELETED)
(note 4 1 17 [bb 2] NOTE_INSN_BASIC_BLOCK)
(note 17 4 2 NOTE_INSN_PROLOGUE_END)
(note 2 17 3 NOTE_INSN_DELETED)
(note 3 2 7 NOTE_INSN_FUNCTION_BEG)
(note 7 3 15 NOTE_INSN_DELETED)
(insn 15 7 12 (set (reg:SI 15 a5 [75])
(const_int 42 [0x2a])) "foo.c":4 132 {*movsi_internal}
(expr_list:REG_EQUIV (const_int 42 [0x2a])
(nil)))
(insn 12 15 13 (set (reg/i:SI 10 a0)
(mult:SI (reg:SI 10 a0 [ a ])
(reg:SI 15 a5 [75]))) "foo.c":4 15 {mulsi3}
(nil)))
(insn 13 12 21 (use (reg/i:SI 10 a0)) "foo.c":4 -1
(nil))
(note 21 13 19 NOTE_INSN_EPILOGUE_BEG)
(jump_insn 19 21 20 (simple_return) "foo.c":4 211 {simple_return}
(nil)
-> simple_return)
(barrier 20 19 16)
(note 16 20 0 NOTE_INSN_DELETED)


Instruction patterns

The target architecture’s instructions are described in machine.md using instruction patterns. A simple instruction pattern looks like
(define_insn "mulsi3"
[(set (match_operand:SI 0 "register_operand" "=r")
(mult:SI (match_operand:SI 1 "register_operand" "r")
(match_operand:SI 2 "register_operand" "r")))]
""
"mul\t%0,%1,%2")

which defines an insn with a name mulsi3 that generates a mul instruction for a 32-bit multiplication.

The first operand in the instruction pattern is a name that is used in debug dumps and when writing C++ code that generates RTL. For example,
emit_insn (gen_mulsi3 (dst, src1, src2));

generates a mulsi3 insn. The back end does not in general need to generate RTL, but the rest of GCC does, and the name tells GCC that it can use the pattern to accomplish a certain task. It is therefore important that the named patterns implement the functionality that GCC expects, but names starting with * are ignored and thus safe to use for non-standard instruction patterns. The back end should only implement the named patterns that make sense for the architecture – GCC will do its best to emit code for missing patterns using other strategies. For example, a 32-bit multiplication will be generated as a call to __mulsi3 in libgcc if the target does not have a mulsi3 instruction pattern.

The next part of the instruction pattern is the RTL template that describes the semantics of the instruction that is generated by the insn
[(set (match_operand:SI 0 "register_operand" "=r")
(mult:SI (match_operand:SI 1 "register_operand" "r")
(match_operand:SI 2 "register_operand" "r")))]

This says that the instruction multiplies two registers and places the result in a register. This example is taken from RISC-V that has a multiplication instruction without any side-effects, but some architectures (such as X86_64) sets condition flags as part of the operation, and that needs to be expressed in the RTL template, such as
[(parallel [(set (match_operand:SI 0 "register_operand" "=r")
(mult:SI (match_operand:SI 1 "register_operand" "r")
(match_operand:SI 2 "register_operand" "r")))
(clobber (reg:CC FLAGS_REG))])]

where the parallel expresses that the two operations are done in as a unit.

The instruction’s operands are specified with expressions of the form
(match_operand:SI 1 "register_operand" "r")
consisting of four parts:
• match_operand: followed by the machine mode of the operand.
• The operand number used as an identifier when referring the operand.
• A predicate telling what kind of operands are valid ("register_operand" means that it must be a register).
• A constraint string describing the details of the operand ("r" means it must be a general register). These are the same constraints as are used in inline assembly (but the instruction patterns support additional constraints that are not allowed in inline assembly).
The predicate and constraint string contain similar information, but they are used in different ways:
• The predicate is used when generating the RTL. As an example, when generating RTL for a GIMPLE function
foo (int a)
{
int _2;

<bb 2> [100.00%]:
_2 = a_1(D) * 42;
return _2;
}

then the GIMPLE to RTL converter will generate the multiplication as a mulsi3 insn. The predicates will be checked for the result _2 and the operands a_1(D) and 42, and it is determined that 42 is not valid as only registers are allowed. GCC will, therefore, insert a movsi insn that moves the constant into a register.
• The constraints are used when doing the register allocation and final instruction selection. Many architectures have constraints on the operands, such as m68k that has 16 registers (a0a7 and d0d7), but only d0d7 are allowed in a muls instruction. This is expressed by a constraint telling the register allocator that it must use register d0d7.

The string after the RTL template contains C++ code that disables the instruction pattern if it evaluates to false (an empty string is treated as true). This is used when having different versions of the architecture, such as one for small CPUs that do not have multiplication instructions, and one for more advanced cores that can do multiplication. This is typically handled by letting machine.opt generate a global variable TARGET_MUL that can be set by an option such as -mmul and -march, and this global variable is placed in the condition string
(define_insn "mulsi3"
[(set (match_operand:SI 0 "register_operand" "=r")
(mult:SI (match_operand:SI 1 "register_operand" "r")
(match_operand:SI 2 "register_operand" "r")))]
"TARGET_MUL"
"mul\t%0,%1,%2")

so that the instruction pattern it is disabled (and the compiler will thus generate a call to libgcc) when multiplication is not available.

The resulting instruction is emitted using the output template
"mul\t%0,%1,%2"

where %0, %1, ..., are substituted with the corresponding operands.

More about define_insn

Many architectures need more complex instruction handling than the RISC-V mul instruction described above, but define_insn is flexible enough to handle essentially all cases that occur for real CPUs.

Let’s say our target can multiply a register and an immediate integer and that this requires the first operand to be an even-numbered register, while multiplying two registers requires that the first operand is an odd-numbered register (this is not as strange as it may seem – some CPU designs use such tricks to save one bit in the instruction encoding). This is easily handled by defining a new predicate in predicates.md
(define_predicate "reg_or_int_operand"
(ior (match_code "const_int")
(match_operand 0 "register_operand")))

accepting a register or an integer, and new constraints in constraints.md that require an odd or an even register
(define_register_constraint "W" "ODD_REGS"
"An odd-numbered register.")

(define_register_constraint "D" "EVEN_REGS"
"An even-numbered register.")

where ODD_REGS and EVEN_REGS are register classes (see part four in this series). We can now use this in the instruction pattern
(define_insn "mulsi3"
[(set (match_operand:SI 0 "register_operand" "=r,r")
(mult:SI (match_operand:SI 1 "register_operand" "%W,D")
(match_operand:SI 2 "reg_or_int_operand" "r,i")))]
""
"mul\t%0,%1,%2")

The constraint strings now list two alternatives – one for the register/register case and one for the register/integer case. And there is a % character added to tell the back end that the operation is commutative (so that the code generation may switch the order of the operands if the integer is the first operand, or help the register allocation for cases such as
_1 = _2 * 42;
_3 = _2 * _4;

to let it change the order of arguments on the second line to avoid inserting an extra move to make _2 an even register on the first line and an odd register on the second line).

Sometimes the different alternatives need to generate different instructions, such as the instruction multiplying two registers being called mulr and the multiplication with an integer being called muli. This can be handled by starting the output template string with an @ character and listing the different alternatives in the same order as in the constraint strings
(define_insn "mulsi3"
[(set (match_operand:SI 0 "register_operand" "=r,r")
(mult:SI (match_operand:SI 1 "register_operand" "%W,D")
(match_operand:SI 2 "reg_or_int_operand" "r,i")))]
""
"@
mulr\t%0,%1,%2
muli\t%0,%1,%2")

Finally, it is possible to write general C++ code that is run when outputting the instruction, so the previous could have been written as
(define_insn "mulsi3"
[(set (match_operand:SI 0 "register_operand" "=r,r")
(mult:SI (match_operand:SI 1 "register_operand" "%W,D")
(match_operand:SI 2 "reg_or_int_operand" "r,i")))]
""
{
return which_alternative == 0 ? "mulr\t%0,%1,%2" : "muli\t%0,%1,%2";
})

This is usually not that useful for define_insn but may reduce the number of instruction patterns when the instruction names depend on the configuration. For example, mulsi3 in the RISC-V back end must generate mulw in 64-bit mode and mul in 32-bit mode, which is implemented as
(define_insn "mulsi3"
[(set (match_operand:SI 0 "register_operand" "=r")
(mult:SI (match_operand:SI 1 "register_operand" "r")
(match_operand:SI 2 "register_operand" "r")))]
""
{ return TARGET_64BIT ? "mulw\t%0,%1,%2" : "mul\t%0,%1,%2"; })

where TARGET_64BIT is a global variable defined in riscv.opt.

This blog post has only scratched the surface of the RTL and machine description functionality, but everything is documented in “GNU Compiler Collection Internals”:

Sunday, August 13, 2017

Getting started with a GCC back end

This is part two of a series “Writing a GCC back end”.

Most CPU architectures have a common subset – they have instructions doing arithmetics and bit operations on a few general registers, an instruction that can write a register to memory, and an instruction that can read from memory and place the result in a register. It is therefore easy to make a compiler that can compile simple straight-line functions by taking an existing back end and restricting it to this common subset. This is enough to start running the test suite, and it is then straightforward to address one deficiency at a time (adding additional instructions, addressing modes, ABI, etc.).

My original thought was that the RISC-V back end would be a good choice as a starting point – the architecture is fully documented, and it is a new, actively maintained, backend that does not use legacy APIs. But the RISC-V back end has lots of functionality (such as support for multiple ISA profiles, 32- and 64-bit modes, and features such as position-independent code, exception handling and debug information) and the work of reducing it became unnecessarily complicated when I tried...

I now think it is better to start from one of the minimal back ends, such as the back end for the Moxie architecture. Moxie seems to be a good choice as there is also a blog series “How To Retarget the GNU Toolchain in 21 Patches” describing step-by-step how it was developed. The blog series is old, but GCC has a very stable API, so it is essentially the same now (I once updated a GCC backend from GCC 4.3 to GCC 4.9, which were released 6 years apart, and only a few lines needed to be modified...).

One thing missing from the Moxie blog series is how to build the compiler and how to configure and run the test-suite, but I blogged about that a while back in “Running the GCC test-suite for epiphany-sim”.

Sunday, August 6, 2017

The structure of a GCC back end

This is part one of a series “Writing a GCC back end”.

The GCC back end is configured in gcc/config.host and the implementation is placed in directories machine under gcc/config and gcc/common/config where “machine” is the name of the back end (for example, i386 for the x86 architecture).

The back end places some functionality in libgcc. For example, architectures that do not have an instruction for integer division will instead generate a call to a function __divsi3 in libgcc. libgcc is configured in libgcc/config.host and target-specific files are located in a directory machine under libgcc/config.

gcc/config.gcc

config.gcc is a shell script that parses the target string (e.g. x86_64-linux-gnu) and sets variables pointing out where to find the rest of the back end and how to compile it. The variables that can be set are documented at the top of the config.gcc file.

The only variable that must be set is cpu_type that specifies machine. Most targets also set extra_objs that specifies extra object files that should be linked into the compiler, tmake_file that contains makefile fragments that compiles those extra objects (or sets makefile variables modifying the build), and tm_file that adds header files containing target-specific information.

A typical configuration for a simple target (such as ft32-unknown-elf) looks something like
cpu_type=ft32
extra_parts="$extra_parts crti.o crtn.o crtbegin.o crtend.o"  libgcc/config/machine The libgcc/config/machine directory contains extra files that may be needed for the target architecture. Simple implementations typically only contain a crti.S and crtn.S (crtbegin/crtend and the makefile support for building all of these have default implementation) and a file sfp-machine.h containing defaults for the soft-float implementation. 1. GCC is written in C++03 these days, but the structure has not been changed since it was written in C. Friday, August 4, 2017 Writing a GCC back end It is surprisingly easy to design a CPU (see for example Colin Riley’s blog series) and I was recently asked how hard it is to write a GCC back end for your new architecture. That too is easy provided you have done it once before. But the first time is quite painful... I plan to write some blog posts the coming weeks that will try to ease the pain by showing what is involved in creating a “working” back end that is capable of compiling simple functions, give some pointers to how to proceed to make this production-ready, and in general provide the overview I would have liked before I started developing my backend (GCC has a good reference manual, “GNU Compiler Collection Internals”, describing everything you need to know, but it is a bit overwhelming when you start...) The series covers the following (I’ll update the list with links to the posts as they become available) 1. The structure of a GCC back end • Which files you need to write/modify 2. Getting started with a GCC back end • Pointers to resources describing how to set up the initial back end 3. Low-level IR and basic code generation • How the low-level IR works • How the IR is lowered to instructions • How to write simple instruction patterns 4. Target description macros and functions • Working with the RTL • Describing the registers (names, register classes, allocation order, ...) • Addressing modes 5. More about instruction patterns • define_expand • The unspec and unspec_volatile expression codes • Attributes 6. Improving the performance • Cost model • Peephole optimization • Tuning the optimization passes 7. Instruction scheduling • Describing the processor pipeline Monday, July 24, 2017 Phoronix SciMark benchmarking results Phoronix recently published an article “Ryzen Compiler Performance: Clang 4/5 vs. GCC 6/7/8 Benchmarks”, and there are many results in that article that surprises me... One such is the result for SciMark that shows that GCC generates much slower code than LLVM – there is a big difference in several tests, and the composite score is 25% lower. I do not have any Ryzen CPU to test on, but my testing on Broadwell shows very little difference between GCC and LLVM when SciMark is compiled with -O3 -march=x86-64 as in the article, and the Ryzen microarchitecture should not introduce that big difference. And the reported numbers seem low... The Phoronix testing also shows strange performance variations between different GCC versions that I don’t see in my testing – I see a performance increase for each newer version of the compiler. The published test results are made running scripts available at OpenBenchmarking.org, and looking at the build script for SciMark shows that it is built as cc$CFLAGS -o scimark2 -O *.c -lm

Note the -O – this overrides the optimization level set by \$CFLAGS and explains at least some of the discrepancies in the test results.1 GCC maps -O to the -O1 optimization level that is meant to be a good choice to use while developing – it optimizes the code, but focuses as much on fast compilation time and good debug information as on producing fast code. LLVM maps -O to -O2 that is a “real” optimization level that prioritizes performance, so it is not surprising that LLVM produces faster code in this case.

So the benchmarking result does not show what is intended, and both compilers can do better than what the test results show...

1. I get similar results as the article when I use -O, but my result for FFT is very different...

Thursday, July 20, 2017

I have recently seen a number of “is X faster than Y?” discussions where micro benchmarks are used to determine the truth. But performance measuring is hard and may depend on seemingly irrelevant details...

Consider for example this code calculating a histogram
int histogram[256];

void calculate_histogram(unsigned char *p, int len)
{
memset(histogram, 0, sizeof(histogram));
for (int i = 0; i < len; i++)
histogram[p[i]]++;
}

The performance “should not” depend on the distribution of the values in the buffer p, but running this on a buffer with all bytes set to 0 and one buffer with random values gives me the result (using the Google benchmark library and this code)
Benchmark                Time           CPU Iterations
------------------------------------------------------
BM_cleared/4096       7226 ns       7226 ns      96737
BM_random/4096        2049 ns       2049 ns     343001

That is, running on random data is 3.5x faster compared to running on all-zero data! The reason for this is that loads and stores are slow, and the CPU tries to improve performance by executing later instructions out of order. But it cannot proceed with a load before the previous store to that address has been done,1 which slows down progress when all loop iterations read and write the same memory address histogram[0] .

This is usually not much of a problem for normal programs as they have more instructions that can be executed out of order, but it is easy to trigger this kind of CPU corner cases when trying to measure the performance of small code fragments, which results in the benchmark measuring something else than intended. Do not trust benchmark results unless you can explain the performance and know how it applies to your use case...

1. The CPU does “store to load forwarding” that saves cycles by enabling the load to obtain the data directly from the store operation instead of through memory, but it still comes with a cost of a few cycles.

Tuesday, July 4, 2017

Strict aliasing in C90 vs. C99 – and how to read the C standard

I often see claims that the strict aliasing rules were introduced in C99, but that is not true – the relevant part of the standard is essentially the same for C90 and C99. Some compilers used the strict aliasing rules for optimization well before 1999 as was noted in this 1998 post to the GCC mailing list (that argues that enabling strict aliasing will not cause many problems as most software already has fixed their strict aliasing bugs to work with those other compilers...)

C99 – 6.5 Expressions

The C standard does not talk about “strict aliasing rules”, but they follow from the text in “6.5 Expressions”:
An object shall have its stored value accessed only by an lvalue expression that has one of the following types:73
• a type compatible with the effective type of the object,
• a qualified version of a type compatible with the effective type of the object,
• a type that is the signed or unsigned type corresponding to the effective type of the object,
• a a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
• an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
• a character type.

73 The intent of this list is to specify those circumstances in which an object may or may not be aliased.
Note the footnote that says that the intention of these rules is to let the compiler determine that objects are not aliased (and thus be able to optimize more aggressively).

C90 – 6.3 Expressions

The corresponding text in C90 is located in “6.3 Expressions”:
An object shall have its stored value accessed only by an lvalue that has one of the following types:36
• the declared type of the object,
• a qualified version of the declared type of the object,
• a type that is the signed or unsigned type corresponding to the declared type of the object,
• a type that is the signed or unsigned type corresponding to a qualified version of the declared type of the object,
• an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
• a character type.

36 The intent of this list is to specify those circumstances in which an object may or may not be aliased.
It is similar to the text in C99, and it even has the footnote that says it is meant to be used to determine if an object may be aliased or not, so C90 allows optimizations using the strict aliasing rules.

But standard have bugs, and those can be patched by publishing technical corrigenda, so it is not enough to read the published standard to see what is/is not allowed. There are two technical corrigenda published for C90 (ISO/IEC 9899 TCOR1 and ISO/IEC 9899 TCOR2), and the TCOR1 updates the two first bullet points. The corrected version of the standard says
An object shall have its stored value accessed only by an lvalue that has one of the following types:36
• a type compatible with the declared type of the object,
• a qualified version of a type compatible with the declared type of the object,
• a type that is the signed or unsigned type corresponding to the declared type of the object,
• a type that is the signed or unsigned type corresponding to a qualified version of the declared type of the object,
• an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
• a character type.

36 The intent of this list is to specify those circumstances in which an object may or may not be aliased.
The only difference compared to C99 is that it does not talk about effective type, which makes it unclear how malloc:ed memory is handled as it does not have a declared type. This is discussed in the defect report DR 28 that asks if it is allowed to optimize
void f(int *x, double *y) {
*x = 0;
*y = 3.14;
*x = *x + 2;
}

to
void f(int *x, double *y) {
*x = 0;
*y = 3.14;
*x = 2; /* *x known to be zero */
}

if x and y point to malloc:ed memory, and the committee answered (citing the bullet point list from 6.3)
We must take recourse to intent. The intent is clear from the above two citations and from Footnote 36 on page 38: The intent of this list is to specify those circumstances in which an object may or may not be aliased.
Therefore, this alias is not permitted and the optimization is allowed.
In summary, yes, the rules do apply to dynamically allocated objects.
That is, the allocated memory gets its declared type when written and the subsequent reads must be done following the rules in the bullet-point list, which is essentially the same as what C99 says.

One difference between C90 and C99

There is one difference between the C90 and C99 strict aliasing rules in how unions are handled – C99 allows type-punning using code such as
union a_union {
int i;
float f;
};

int f() {
union a_union t;
t.f = 3.0;
return t.i;
}

while this is implementation-defined in C90 per 6.3.2.3
[...] if a member of a union object is accessed after a value has been stored in a different member of the object, the behavior is implementation-defined.

And the committee is pragmatic; DR 464 is a case where the defect report asks to add an example for a construct involving the #line directive that some compilers get wrong, but the committee thought it was better to make it unspecified behavior
So just because the standard says something does not mean that it is the specified behavior. One other fun example of this is DR 476 where the standard does not make sense with respect to the behavior of volatile: