[Jump straight to results from January 2010.]
This project aims to help compiler developers improve their products by giving them actionable test cases that pinpoint missed optimizations. As the results make clear, there is plenty of room for improvement: no compiler is able to consistently generate smaller code than any other compiler, even when considering substantially similar compilers such as adjacent minor versions of gcc.
The idea is to compile hundreds of thousands of functions with as many compilers as possible and then, for each pair of compilers, rank the output by an "embarrassment factor" which is the ratio of the object code sizes. As a very simple example, not all compilers can statically predict the result of this kind of function:
int host_is_big_endian (void) {
int pattern = 0xfeedface;
unsigned char *bytewise = (unsigned char *)&pattern;
if (bytewise[0] == 0xfe) return 1;
return 0;
}
One optimizing compiler that we tested turned this function into 48 bytes of x86: more than 10 times larger than the optimal code. This optimization failure could matter for an application that uses the result of the endianness test in some other computation.
Even the best compilers have shortcomings in their ability to generate fast and compact object code. Difficulties include:
Arch Robison's paper The Impact of Economics on Compiler Optimization is the best overall summary of these issues that I know of.
Do these kinds of small-scale optimizations matter? Of course they do, particularly for people developing embedded software. When resources are tight, good developers will only create readable, modular source code if compilers can be trusted to turn this into fast and compact object code. The days of fanatically hand-optimized C code are almost over, even for embedded software. We're just trying to help put the last nails in the coffin.
We test three kinds of functions.
We wrote a CIL extension to harvest functions from many open source projects written in C. It works like this:
Each harvested function is placed into a file named by computing a hash over the function's AST. This has the useful side effect of eliminating duplicates. The hash function is deliberately a bit fuzzy, so that for example we don't get multiple functions that differ only by choice of constants.
Usually, you can find out what file and package a function came from by looking at the comments at the bottom of the harvested C file. If you see a harvested function that you believe is semantically different than the corresponding function in a source package (modulo the transformations described above) please let us know. If the changes were caused by a bug in our infrastructure we'll try to fix it.
The point of the harvesting procedure is (1) to create a level playing field for various compilers and (2) to create functions that are reasonably representative of the code that C compilers actually encounter in practice. The function inlining step is needed to prevent compilers' always-fickle inlining heuristics from distorting the results.
Here we simply harvest functions from randomly generated C programs.
We generate code using a technique called bounded exhaustive testing which means exploring the program space up to some maximum program size. In practice the entire program space cannot be explored in a meaningful way, so we explore small parts. The function at the top of this page is an example. The point of generating these silly-looking functions is to try to uncover interesting mathematical relationships exploited by some optimizers and missed by others. The current crop of generated functions is extremely simple, but we're working on generalizing it.
Each generated function is tested on about 100,000 inputs. The "correct" answer for each input is determined by majority vote across all compilers being tested. Deviations from the correct answer are characterized as miscompilations. Note that (1) we may mislabel miscompiled functions as being correct due to the highly incomplete nature of testing and (2) if one of our safe math functions contains a bug we may mislabel a correct function as being miscompiled. Please let us know if you find a bug in our utility functions.
We use the GNU binutils size program to measure code size. Like all methods for disambiguating code and data, it is imperfect. Let us know if you find a particularly egregious mistake. The root cause of these errors seems to generally be a difference in convention between compilers about whether, for example, a jump table or an exception-handling frame is considered to be code.
If you spend much time looking through compilers' output, you'll eventually see examples where a compiler's output is too good. In other words, it creates small code by emitting the wrong code. This is particularly common for functions containing the volatile qualifier. In the future we may attempt to automatically find wrong code problems, but this is non-trivial.