Sunday, September 25, 2016

The cost of clearing memory

The technical report "The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V" looks at how the RISC-V ISA compares to ARM and Intel by analyzing the result from running SPEC CINT2006. One thing that surprised me was that 30% of the instructions executed when running the 403.gcc benchmark (which compiles a few files using a modified GCC 3.2) is from doing memset!

The RISC-V memset loop writes 16 bytes in 4 instructions
// RV64G, 4 instructions to move 16 bytes
4a3814:   sd    a1, 0(a4)
4a3818:   sd    a1, 8(a4)
4a381c:   addi  a4, a4, 16
4a3820:   bltu  a4, a3, 4a3814
which is somewhat less efficient compared to ARM and Intel that writes more data per instruction
// armv8, 6 instructions to move 64 bytes
6f0928:   stp   x7, x7, [x8,#16]
6f092c:   stp   x7, x7, [x8,#32]
6f0930:   stp   x7, x7, [x8,#48]
6f0934:   stp   x7, x7, [x8,#64]!
6f0938:   subs  x2, x2, #0x40
6f093c:   b.ge  6f0928
so this should translate to about 10% of the executed ARM instructions doing memset. But that is still much more than what I would have guessed.

I do not have access to SPEC, and I have been too lazy to try to replicate with other data, but a quick literature search indicates that this is not as insane as I thought. The papers I have found look at the cost of clearing data in garbage collection implementations, and they seem to get a similar result for the cost. For example "Why Nothing Matters: The Impact of Zeroing" says
We show that existing approaches of zero initialization are surprisingly expensive. On three modern IA32 architectures, the direct cost is around 2.7-4.5% on average and as much as 12.7% of all cycles, in a high-performance Java Virtual Machine (JVM), without accounting for indirect costs due to cache displacement and memory bandwidth consumption.

Monday, September 12, 2016

Mobile GPUs — power, performance, area

I read an IWOCL 2016 conference paper "OpenCL-Based Mobile GPGPU Benchmarking: Methods and Challenges" that gives some good advice on how to measure GPU performance on mobile devices. The paper focuses on micro benchmarks, and it is hard to use these to draw relevant conclusions — especially as mobile GPUs are rather different compared to the desktop CPUs that most developers are used to. Below are some of the things that were unclear to me when I started working on the shader compiler for a mobile GPU.

Power consumption and heat

The major design constraint for mobile devices is power consumption as it is hard to get rid of heat without large heatsinks and fans. The amount of heat that can be handled varies a lot between devices depending on the size and how they are built, but it corresponds to somewhere between 1W for a low-end phone and 7W for a tablet — and that includes power for CPU, GPU, memory, radio, etc.

It takes a while for the temperature to rise, so the hardware may temporarily run hotter (and faster), but the device will eventually throttle the performance if it is running too hot. This means that benchmarking results are not that interesting unless you know how long that performance can be sustained, and measurements such as peak performance tend to say just how over-dimensioned the hardware is.

I find it amusing that the discussion in the paper suggest that you want high performance numbers
[...] benchmarks with long running time may trigger thermal gating in certain cases, resulting in lower performance result.
I usually want realistic results, but wanting high numbers makes sense if you (as the authors) work at a hardware vendor and want a high number in the marketing material. They also suggest
Running the benchmark in a temperature-controlled environment is one option; if such an option is not available, adding idle periods between workloads may reduce chances of high system temperature. 
which is especially important if your SoC has a heat problem. 😏

Dynamic Voltage and Frequency Scaling

Power consumption increases roughly quadratically with clock frequency1 — for example, raising the frequency from 600MHz to 700MHz on Exynos 5433 increases power consumption by 42%. This means it is better to lower the clock frequency and keep the GPU running 100% of the time instead of, for example, running it on full speed 75% of the time and being idle the remaining 25%.

This performance tuning is done by Dynamic Voltage and Frequency Scaling (DVFS). It is hard to make a good implementation as different applications have different requirements, and there is no "correct" tradeoff. For example, should a high end game run at full speed, and be throttled (i.e. reduced to unplayable frame rate) when the device overheats, or should it run on a lower sustainable speed from the beginning? Different device vendors implement DVFS in different ways, so two phones with the same GPU may behave differently.

Different operations need a different amount of power, and a good DVFS implementation uses this when adjusting the voltage and frequency. For example, memory operations consumes much more power than arithmetic operations, and this is used in Exynos to use lower voltage/frequency for shaders using more memory operations. This is "fun" when optimizing shaders, as a faster shader (as measured in number of clock cycles) does not need to run faster in reality if it uses more power hungry instructions and thus get a lower clock frequency.

Power- and area-efficiency

GPU workloads are embarrassingly parallel, so it is easy to double the performance of the hardware if you are allowed to increase the power and chip area — just place two identical GPUs in the package! In the same way, you can get much improved power efficiency by using two GPUs and running them with halved frequency. This means that you need to look at metrics such as "performance per area power" when comparing GPU architectures.

This is annoying when developing GPUs, as most ideas for improving the performance means that some hardware block becomes more complicated, with the result that the size increases and it consumes more power. And it does not make much sense making the GPU 10% faster if it also need 10% more area and power...


1. The dynamic power consumption is actually \(P_{dyn}=CV^{2}f\) where \(C\) is capacitance, \(V\) is voltage, and \(f\) is frequency. This varies linearly with the frequency, but increased frequency need a higher voltage, and the power consumption thus varies superlinearly with the frequency.

Sunday, August 14, 2016

Surprising properties of C null pointer constants

A previous blog post mentioned two properties of NULL that many developers find surprising:
  • NULL does not need to be 0 when stored in memory — the compiler is allowed to transform the value when casting between integer and pointer types, so
    union {
        uintptr_t u;
        void *p;
    } u;
    u.p = NULL;
    printf("0x%" PRIxPTR "\n", u.u);
    
    does not need to print 0, even though u.p==(void*)0 evaluates to true.
  • NULL does not need to be a pointer — it is valid for an implementation to define NULL as
    #define NULL 0
    
    so it is not portable to pass NULL to a va_arg function, as this will fail for architectures where sizeof(int)!=sizeof(void*).

Today I found a third case in the paper "Beyond the PDP-11: Architectural support for a memory-safe C abstract machine" — casting an integer variable having the value 0 to a pointer type is not guaranteed to produce a null pointer. For example, it is implementation-defined if the comparison
int zero = 0;
bool b = (void*)0 == (void*)zero;
evaluates to true or false. Both (void*)0 and (void*)zero cast the value 0 to void*, so I had naively thought that both would produce a null pointer, but the standard does only guarantee this for (void*)0 as it is a constant expression
An integer constant expression with the value 0, or such an expression cast to type void *, is called a null pointer constant. If a null pointer constant is converted to a pointer type, the resulting pointer, called a null pointer, is guaranteed to compare unequal to a pointer to any object or function.
while the result is implementation-defined when casting a variable
An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation.
So (void*)0 is a null pointer while (void*)zero is converted in an implementation-defined manner,  which may or may not produce a null pointer...

Sunday, July 24, 2016

Code behaving differently in C90, C99, C11, C++98, and C++11

There are some subtle differences between the revisions of the C standard that makes it possible to create programs that behaves differently depending on if they are compiled as C90, C99, or C11. Similarly, C++ is mostly a superset of C, but there are constructs that produce different results for C and C++.

This is used in Don Yang's contribution to the 2015 International Obfuscated C Code Contest to create a program that produces different output depending on if it is compiled as C89, C99, C11, C++98, C++11. For C90 it prints stars of the form
*********************************************              ***               **
***********************     *****************             ******             **
*********************        ****************           **********           **
********************         ****************         **************         **
******        ******         *****************     ****************************
******           **          **************************************************
******                      *************************(O************************
*******                     *     *********************************************
*********                              ****************************************
***********                             ***************************************
************                            ***************************************
*********                             *****************************************
*******                     ***************************************************
******                       ****************)d));o((d=************************
******           **          **************************************************
******         *****         *********************      ***********************
********************(         O******************        **********************
**********************       *******************          *********************
************************     *******************          **************)p)-p);
o(d-(p=*****************************************          *********************
For C99 the stars have eyes, C++11 prints circles, etc. (There is more to this. The program reads text from standard input, and the output is obfuscated C90 source code that prints that text — all the * characters are pointer dereferences!)

The source code for the program is a little bit hard to read
                                           #define r(R) R"()"
                          /*[*/#include  /**/<stdio.h>
                      #include<math.h>/*!![crc=0f527cd2]*/
                   float I,bu,k,i,F,u,U,K,O;char o[5200];int
              #define R(U) (sizeof('U')==1||sizeof(U"1"[0])==1)
            h=0,t=-1,m=80,n=26,d,g,p=0,q=0,v=0,y=112,x=40;  float
           N(float/*x*/_){g=1<<30;d=-~d*1103515245&--g;return  d*_
          /g;}void/**/w(int/**/_){if(t<0){for(g=0;g<5200;o[g++   ]=
          0);for(;g;o[g+79]=10)g-=80;for(t=37;g<62;o[80+g++]=32)   ;
         }if(m&&o[h*80+m-1]==10){for(g=0;g<79;o[t*80+g++]=0){}o[t
         ++*80+g]=10;t%=64;n+=2;I=N(70)+5;if(n>30&&(I-x)*(I-x)+n*
        n>1600&&R()){O=0;F=(x=0x1!=sizeof(' '))?k=1+N(2),i=12-k+N(
        8),N(4):(k=17+N(5),i=0,r()[0]?O=.1:  0);for(u=U=-.05;u<32;
        U=k+i+i*.5*sin((u+=.05)+F))for( K=0   ;K< U;K+=.1)if((bu=K*
       sin(u/5),g=I+cos( u/5) *K)>=0&&g  <     79  )o[g+(int)(t+44+
       bu*(.5-(bu>0?3*O:  O)   ) )%64*  80      ]  =32;x*=02//* */2
      -1;n=O+x?n=I+(x?0   :N     (k)-   k           /2),g=(t+42  )%
      64,m=-~g%64,x?g=m          =-~        m%64:0  ,n>5?o[g*80   +
     n-3]=o[m*80+n-3]=       0:   0              ,n <75?o[g*80+n
     +2]=o[m*80+n+2]=0   :0:0;                      x=I;}h=-~h%64
    ;m=0;}putchar((g=o [h*                          80+m++])?g:_);
   if(g){w(_);}}void W                               (const char*_
  ){for(;*_;w(*_++));}                               int main(int a
  ,char**_){while(a--)d              +=_[a          ]-(char*)0;W( \
 "#include<stdio.h>typed"             "e"         "f\40int\40O;v"
 "oid o(O _){putchar(_);}O"                    "\40main(){O"  ""
"*_[512],**p=_,**d,b,q;for(b=0;b"        "++<512;p=_+q)_[q"    \
"=(p-_+1)*9%512]=(O*)p;") ;      for(;(g= getchar())-EOF;p=
q){q=p;for(v=512;p-q-g&&q-p-              g;  v--)q=-~q*9%512
;W("o(");if(p>q)w(y),w(45);w(                      40);w(y^=20
);w(075);for(a=0;a<v;a++)w(42);                      for(W("(O**"
 );a--;w(42)){}w(41);w(y^024);w(                      41);if(p<=q)w(
   45),w(y^20);W(");");}for(a=7;a-6                      ;W(a<6?"{;}":""
      ))for(a  =0;a  <6 &&   !o[h*80+m                       +a];a++){}W("r"
         "etu"  /*J   */       "rn+0;}\n"                             );return
             /*                      "#*/0                                   ;}
but it is, as far as I can tell, using three tricks in order to detect which C or C++ dialect is used:

  • // comments
    C90 does not have // comments, so constructs of the form
    int i = 2 //**/2
        ;
    
    can be used to differentiate it from the other C and C++ revisions, as C90 compiles this as
    int i = 2 //**/2
        ;
    
    while C++ and the more recent revisions of C compiles this as
    int i = 2 //**/2
        ;
    
  • Type of character constants
    Character constants, such as 'a', have type int in C while C++ use the type char. This means that sizeof('a') evaluates to a different value for C and C++.
  • Wide string literals
    C11 and C++11 have wide string literals where for example U"hello!" is a string with characters of type char32_t. This can be used with a macro
    #define R(U) sizeof(U"a"[0])
    
    that is used as R(""). For C11 and C++ this expands to
    sizeof(U"a"[0])
    
    which evaluates to 4, while the older revisions of the standards treat U and "a" as two tokens, and the macro expands to
    sizeof("""a"[0])
    
    which evaluates to 1.

Wednesday, July 20, 2016

Using MathJax in Blogger

I have added MathJax support to this blog so that I can typeset mathematics using a large subset of \(\LaTeX\) commands. This functionality works in comments too, which has the nice side effect that it can be used to work around Blogger's limited formatting. In particular, formatting source code in comments can be done as
\(
  \verb|int foo(void)| \\
  \verb|{| \\
  \verb|    return 0;| \\
  \verb|}| \\
\)
which is rendered with most of the formatting intact:
\(
  \verb|int foo(void)| \\
  \verb|{| \\
  \verb|    return 0;| \\
  \verb|}|
\)

One annoying thing is that the \(\LaTeX\) commands are not processed when writing or previewing the comment, but they are rendered correctly when it is published...

How to enable MathJax in Blogger

MathJax need to be loaded and configured when the page is loaded. This is done by adding the following
<script type="text/javascript" async="async"
  src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML,Safe">
</script>
in the <head> block by editing the blog's template (choose "Template", and click the "Edit HTML" button). This only adds the support to the desktop template, so you need to enable it separately for mobile by pressing the gear button, and choosing the "custom" mobile template (that generates a mobile template from your desktop template).

There are several configurations to choose between, with support for \(\LaTeX\), MathML, and AsciiMath notation, and control over how much functionality is loaded up front, and how much is loaded on-demand. I have chosen the TeX-AMS_CHTML which enables all \(\LaTeX\) support, but avoids MathML and AsciiMath.

Note the ,Safe modifier added after the configuration name. This disables unsafe constructs such as running javascript from \(\LaTeX\) commands
\[E \href{javascript:alert("Einstein says so!")}{=} mc^2\]
This is needed in order to prevent commenters messing up the blog by adding evil constructs in comments.

Testing the functionality

Below are some random formulas, just to verify that the rendering works as intended:
\[
  \sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\mu)^{2}}
\]
\[
  \left\{
  \begin{aligned}
    a_1x+b_1y+c_1z &=d_1+e_1 \\
    a_2x+b_2y&=d_2 \\
    a_3x+b_3y+c_3z &=d_3
  \end{aligned}
  \right.
\]
\[
  \require{AMScd}
  \begin{CD}
    \pi(X, x_{0}) @>\phi_{*}>> \pi(Y, \phi(x_{0}))\\
    @VVuV @VVvV\\
    \pi(X, x_{1}) @>\phi_{*}>> \pi(Y, \phi(x_{1}))
  \end{CD}
\]

Thursday, July 7, 2016

Checked C

Microsoft Research recently announced a work-in-progress project, Checked C, that adds bounds checking to pointers and arrays:
Checked C adds new pointer types and array types that are bounds-checked, yet layout-compatible with existing pointer and array types. In keeping with the low-level nature of C, programmers control the placement of bounds information in data structures and the flow of bounds information through programs. Static checking enforces the integrity of the bounds information and allows the eliding of some dynamic checking. Dynamic checking enforces the integrity of memory accesses at runtime when static checking cannot. Checked C is backwards-compatible: existing C programs work “as is”. Programmers incrementally opt-in to bounds checking, while maintaining binary compatibility.
Below are my summary of, and comments on, the Checked C version 0.5 specification.

Overview

Checked C extends C with checked arrays and pointers where memory accesses are checked at runtime, and a runtime error is produced if the memory access is out of range. The compiler is allowed to (and expected to) eliminate checks that it knows are always true.

Checked arrays are created using the checked keyword, so
int x checked[5];
creates a checked array x having 5 elements. There are three new checked pointer types that are declared using syntax borrowed from C++:
  • array_ptr<T> — A pointer to an element of an array of type T values. This pointer type works as a normal C pointer (but with bounds checking).
  • ptr<T> — A pointer to a value of type T. Pointer arithmetic is not allowed on this pointer type.
  • span<T> — The span pointer works in the same way as the array_ptr, but it is represented differently in the generated code (see below).
As an example
ptr<int> p;
declares a checked pointer to an int.

The checked pointer types can have const and volatile modifiers, so a pointer to a constant integer is written as
ptr<const int> p;
while
int x;
const ptr<int> p = &x;
defines a pointer that cannot be modified.

The checked arrays and pointers are used in the same way as normal C arrays and pointers, so a checked pointer p can be dereferenced as *p, and an array_ptr<T> or span<T> pointer p can be dereferenced using expressions such as *(p+4) or p[4].

The array_ptr<T> and span<T> pointers need bounds to be defined before they may be dereferenced. Defining the bounds for a pointer p is done using the count and bounds keywords
  • p : count(len) — the number of elements that are accessible beginning at p
  • p : bounds(low, high) — the range of memory that can be accessed through p
The bounds are placed on the declaration, such as
array_ptr<int> p : count(n) = malloc((sizeof(int) * n);
Using this pointer as p[i] will conceptually add a check
dynamic_check(0 <= i && i < n);
right before the access.1 Pointer assignment transfer the bounds in the natural way, so q will get the bound from p in
array_ptr<int> q = p;
The array_ptr<T> and ptr<T> pointers have the same size as normal C pointers (such that sizeof(array_ptr<int>) is equal to sizeof(int*)) so checked pointers can be used without changing the layout of strucures. This means that the bounds are maintained locally by the compiler. The span<T> pointers do however keep the bounds within the type, so sizeof(span<int>) is larger than sizeof(int*).

It is possible to add additional constraints to handle things like this aligned memcpy
int aligned_memcpy(array_ptr<char> dest : count(len) where aligned(dest, 4),
                   array_ptr<char> src : count(len) where aligned(src, 4),
                   int len where len % 4 == 0);
although the constraint specifications seems to be a bit under-specified in the document, and I am not completely sure how they work in detail...

Undefined behavior

Doing all of this is a bit meaningless if code such as a[i+j] can make anything happen because of undefined behavior from overflow of i+j, so Checked C defines the behavior for some things that invokes undefined behavior in standard C.

Checked C requires that signed integer overflow produces a value or a runtime error:
  • To be able to maintain pointer bounds safety, it is important that signed integer overflow produce a defined value. When a signed integer expression produces an out-of-range value, either (1) the operation must convert that value to an in-range integer value or (2) the expression shall produce a runtime error. The conversion must be a function of only the input values of the expression.
  • Integer division by 0 shall also produce a runtime error or produce a defined value.
Checked C does also define pointers to work more like hardware pointers than what is the case in standard C. The checked pointers are treated in the same way as unsigned integers (all values are valid, even if they do not point at an object), but they produce runtime errors for pointer wrap and pointer arithmetic involving NULL. The rules for undefined behavior for unchecked pointers are modified in a similar way:
Unchecked pointers shall be treated as addresses of locations in memory, just as checked pointers are treated as addresses. The addresses shall be unsigned integers with a defined range of 0 to UINTPTR_MAX:
  • Comparison of pointers for all different kinds of pointers shall be defined as the corresponding integer comparison.
  • Subtraction p - r of two pointers p and r of type T where one pointer is a checked pointer and the other is an unchecked pointer shall be done following the rules for subtraction of checked pointers, treating the unchecked pointer as a checked pointer in those rules.

Bounds evaluation

The bounds are evaluated each time a pointer is checked, so the program need to be careful when updating variables used in a bounds declaration. The compiler must report an error when the bound is extended
int sum(array_ptr<int> start : bounds(start, end), array_ptr<int> end)
{
    end = end + 1; // bounds(start, end) does not hold after this,
                   // so program is rejected
    start[5] = 0;
    ...
}
but Checked C allows modifying bounds in this way, so for example
array_ptr<int> x : bounds(x, high) = ...
int sum = 0;
while (x < high) {
    sum += *x;
    x++;
}
is fine as the bound is reduced when x is incremented.

And there are more problems... For example, let e be an array_ref<T> with bound(x+1, x+5). This will not work when assigning
x = e;
as the range depends on x. Or consider this example
w = ...
where w : bounds(x, x + y);
int t = *w + (y = tmp);
The bounds for w depends on y, but y is modified in the same expression that dereferences w, and it is unclear if y is updated before or after w is checked. The compiler must reject the code for both of these examples.

A big part of the specification deals with this, and there are rules for which expressions are valid in bounds declarations, and how to do data flow analysis to verify that variables are allowed to be changed. But data flow analysis is expensive, so there are restriction that limit how much the compiler need to check, with the result that small changes to the code may push the program over the limit and thus fail to compile.

This would be so much simpler if the bounds were evaluated where declared. The compiler could place the bounds in hidden temporary variables, but this is rejected in the rationale:
We considered eager evaluation, but rejected it because it would turn array_ptr types into span types. When bounds expressions are always eagerly evaluated, the results need to be stored somewhere so that they can be used when v is used. For local variables, hidden temporary variables could be introduced. This breaks the design principle of not introducing hidden costs, though.
I do not understand what they mean by this... I would say that the current specification adds hidden costs as the bounds may be evaluated each time the pointer is used, while keeping the bounds in hidden variables will only evaluate them once. Hidden variables may increase the register pressure, but the current specification will most likely increase the live ranges for the variables used in the bounds, which also increases register pressure, so I do not expect a difference in reality.

Doing eager evaluation would however cause problems for array_ptr<T> pointers in structures. They are currently handled as
struct S {
    array_ptr<int> arr : count(len);
    int len;
}
where the variables used in the bounds calculations lives in the structure. I have not thought this through in detail, but I think it would make sense to forbid derefencing such pointers, and require the program to copy them to a local variable in order to use them. I do not think this is a big problem, as I would guess that most pointers in arrays are of the ptr<T> type, which can be used directly as they do not have bounds.


1. The real checking is somewhat more complex, and it also checks that the count(n) is valid (i.e. that n is less than (INTPTR_MAX/sizeof(int)), etc.)

Updated 2016-07-07: Added clarification in note 1.

Sunday, June 19, 2016

Evolution of C programming practices – Unix 1973–2015

I found a recent paper that also looks at how the BSD code base has evolved, but from a very different perspective compared to my code-size investigation.

The paper "The Evolution of C Programming Practices: A Study of the Unix Operating System 1973–2015" investigates coding style, and tests seven hypotheses by looking at metrics (line length, number of volatile in the source code, etc.) in 66 releases of Unix from 1973 to 2014. The hypotheses are
  1. Programming practices reflect technology affordances (e.g. developers may be more liberal with screen space when using high resolution displays)
  2. Modularity increases with code size
  3. New language features are increasingly used to saturation point
  4. Programmers trust the compiler for register allocation
  5. Code formatting practices converge to a common standard
  6. Software complexity evolution follows self correction feedback mechanisms
  7. Code readability increases
and the result is that they seem to be true, as interpreted through the metrics.

There are several things that annoys me with this paper...

Source code change over time

The earliest source code releases in the study are the "Research systems" from Bell Laboratories, which are followed by Bell-32V, different BSD versions from University of California, BSD386, and FreeBSD. Of these, it is only FreeBSD that continues after 1995, so the early range 1973–1995 consists of multiple variants of the OS developed by different groups in parallel, while the data 1996–2014 consists of FreeBSD only. This may affect the interpretation of the data...

Consider for example this plot from the paper, showing the mean comment size
CMCHAR — mean comment size
The first part of the plot that consists of multiple variants of the OS have points all over the place, while the second part is FreeBSD only, and is rather constant. Several of the metrics have this kind of distribution, and I am not convinced that the data can be interpreted as evolution (rather than different projects have different coding style), or that fitting a cubic spline to the data have much value for this data set.

Language change over time

Some of the metrics are for new functionality in the C language, and are used for testing hypothesis 3 — "new language features are increasingly used to saturation point". But that hypothesis is in some sense obviously true for most of the metrics. For example, K&R C did not have volatile, but C90 requires volatile in order to prevent the compiler from optimizing away memory accesses to memory mapped hardware registers. Furthermore, there are only a finite number of places where it makes sense to use it, so it obviously saturates.

So the metrics mostly measure that the projects switched from K&R C to standardized C, and most of the graphs looks like

DVOL — volatile keyword density
That is, first no usage, then the feature is introduced in all relevant places when the project updates the compiler.

One other issue here is that the header files are excluded in the calculated metrics. I do not think it affects the result of the conclusion (as it is a rather weak statement anyway), but I have often seen things like
#define HW_REG volatile unsigned int
being defined in header files, and such usages are not included in the data (and this skews the data in the discussion of restrict, where the code uses it as __restrict in order to be able to define it in a suitable way for old compilers that do not handle restrict).

Release dates

The data points for the releases have somewhat random dates. One issue is that the paper use each release's mean file date (the average of the files' last modification time) instead of the release date (that is why the graphs stop at November 2010, even though FreeBSD 10 was released in 2014). The idea is that this better reflects the age of the code base, but this has the effect of compressing some of the data points (especially the clustering around 1993-1994), and it makes the spline fitting even more suspect.

One other problem is that the original data used by the researchers seems to have incorrect timestamps. For example, 4.3BSD Net/1 was released in 1989, but is listed as 1993-12-25 in the paper. The same is true for at least the Net/2 release too, which was released in 1991, but the paper list it as 1993-07-02.