C++ is a divisive force. For every discussion about it, a certain percentage of people will love it and another substantial group will hate it. Above all, one feature highlights this divide: templates. Templates have many problems: they slow compile times, force implementations to be specified in headers and can bloat generated code. However they're also a powerful mechanism for programming against generic types. In the first article in this series, we saw how C++ templates have significant limitations, most notably that they can't be used across API boundaries. In the second article, we explored how templates can be implemented by using 'template generators', which are trees of compiler-generated functions that compute template arguments for a function given its path in the tree. In this final part we'll see how template generators can be optimised to minimise, or in many cases eliminate, their overhead.
Pass-through path reduction
Consider the following: int f() { return g<int, float, double>(1.0f, 2.0); } template <typename A, typename B, typename C> A g(B b, C c) { return h<A, B, C>(b, c); } template <typename A, typename B, typename C> A h(B b, C c) { // ...some code... } By the approach described above, we'd have the following paths:  • To get arguments of g(): '1'
 • To get arguments of h(): '10'
Here are the template generators: void root_template_generator(type* args, path_t path) { args[0] = { int_vtable }; args[1] = { float_vtable }; args[2] = { double_vtable }; g_template_generator(args, path, get_path_start(path)); } void g_template_generator(type* args, path_t path, size_t position) { if (position == 0) { // Path end. return; } h_template_generator(args, path, position—1); } void h_template_generator(type* args, path_t path, size_t position) { return; } However in this case the template arguments to g() are the same as the arguments it passes to h(). Hence we can merge these two transitions, giving only one transition (continue to h_template_generator), which means no bits are required in the path for g(). This gives us:  • To get arguments of g(): '1'
 • To get arguments of h(): '1'
This simplifies the template generators to: void root_template_generator(type* args, path_t path) { args[0] = { int_vtable }; args[1] = { float_vtable }; args[2] = { double_vtable }; g_template_generator(args, path, get_path_start(path)); } void g_template_generator(type* args, path_t path, size_t position) { h_template_generator(args, path, position); } void h_template_generator(type* args, path_t path, size_t position) { return; } This works because g_template_generator() knows that h_template_generator() must return the template arguments we gave it since the path ends at that point. I refer to this situation as 'pass-through' and it's a very common case. In particular it has the nice property that it is a local (non-ABI) change, hence it is an optimisation rather than a specified requirement. There's an explanation about how we handle pathological cases, such as mutually recursive functions, on the Loci website. In short, direct function calls (calls via function pointers or interfaces cannot be templated) between modules must form a directed acyclic graph (DAG), so templated recursion can only occur within a module; this allows the compiler to perform cycle detection on the templated functions within that module.
Prefix pass-through
Pass-through also applies if the template arguments of the caller are a prefix of the callee, or vice versa, such as: int f() { return g<int, float, double>(1.0f, 2.0); } template <typename A, typename B, typename C> A g(B b, C c) { return h<A, B>(b); } template <typename A, typename B> A h(B b) { // ...some code... } Here's how the template generators look after applying the pass-through optimisation: void root_template_generator(type* args, path_t path) { args[0] = { int_vtable }; args[1] = { float_vtable }; args[2] = { double_vtable }; g_template_generator(args, path, get_path_start(path)); } void g_template_generator(type* args, path_t path, size_t position) { h_template_generator(args, path, position); } void h_template_generator(type* args, path_t path, size_t position) { return; } When h() queries its template arguments, it gets an extra argument it didn't expect. Fortunately, it completely ignores that, so this works. Of course, pass-though works for the original example as well. void root_template_generator(type* args, path_t path) { args[0] = { int_vtable }; g_template_generator(args, path, get_path_start(path)); } void g_template_generator(type* args, path_t path, size_t position) { args[1] = { float_vtable }; h_template_generator(args, path, position); } void h_template_generator(type* args, path_t path, size_t position) { return; } In this case it's the other way around; g() is getting an argument it didn't expect. But again, it ignores it.
Compiler optimisations
The Loci compiler uses the LLVM compiler infrastructure as its backend, which means it has access to a large set of production-quality optimisation passes and many of these passes are effective at reducing, and in many cases eliminating, the overhead of template generators.
Example
For example, consider our template generators after pass-through path reduction: void root_template_generator(type* args, path_t path) { args[0] = { int_vtable }; g_template_generator(args, path, get_path_start(path)); } void g_template_generator(type* args, path_t path, size_t position) { args[1] = { float_vtable }; h_template_generator(args, path, position); } void h_template_generator(type* args, path_t path, size_t position) { return; } We'll then have the actual generated code for f(), g() and h(): void f_impl() { template_generator_t template_generator; template_generator.root_fn = root_template_generator; template_generator.path = 1; g_impl(template_generator); } void g_impl(template_generator_t template_generator) { // This is how we get our arguments. type_t template_args[MAX_TEMPLATE_ARGS]; template_generator.root_fn(&template_args[0], template_generator.path); h_impl(template_generator); } void h_impl(template_generator_t template_generator) { type_t template_args[MAX_TEMPLATE_ARGS]; template_generator.root_fn(&template_args[0], template_generator.path); } There are few things we can do here, but the most obvious is that the optimiser is strongly encouraged to inline templated functions whenever possible. So it'll typically just inline h() straight into g() and then g() straight into f(), giving: void f_impl() { template_generator_t template_generator; template_generator.root_fn = root_template_generator; template_generator.path = 1; type_t template_args0[MAX_TEMPLATE_ARGS]; template_generator.root_fn(&template_args0[0], template_generator.path); type_t template_args1[MAX_TEMPLATE_ARGS]; template_generator.root_fn(&template_args1[0], template_generator.path); } Then it can propagate the constants in the template generator struct. void f_impl() { type_t template_args[MAX_TEMPLATE_ARGS]; root_template_generator(&template_args[0], 1); type_t template_args[MAX_TEMPLATE_ARGS]; root_template_generator(&template_args[0], 1); } Template generators are also strongly encouraged to be inlined, giving us: void root_template_generator(type* args, path_t path) { args[0] = { int_vtable }; args[1] = { float_vtable }; } We'll then further inline this straight into f_impl(): void f_impl() { type_t template_args0[MAX_TEMPLATE_ARGS]; template_args0[0] = { int_vtable }; template_args0[1] = { float_vtable }; type_t template_args1[MAX_TEMPLATE_ARGS]; template_args1[0] = { int_vtable }; template_args1[1] = { float_vtable }; } If both g() and h() were using their arguments, they'll just be accessing members in the template_args array, such as: void f_impl() { type_t template_args0[MAX_TEMPLATE_ARGS]; template_args0[0] = { int_vtable }; template_args0[1] = { float_vtable }; type_t template_args1[MAX_TEMPLATE_ARGS]; template_args1[0] = { int_vtable }; template_args1[1] = { float_vtable }; type_t first_arg = template_args1[0]; // ...do a virtual call with the first type argument... } So, propagate constants again: void f_impl() { type_t first_arg = int_vtable; // ...do a virtual call with the first type argument... } Our virtual call will be loading something from the vtable and calling it. But we know which vtable we're using, so there'll be more inlining and more constant propagation, until we're directly calling methods of 'int' (which themselves will be inlined). As demonstrated by this example, there's a bit of churn in the optimiser as it eliminates unnecessary code, which is similar to C++ code. This occurs because abstractions such as interfaces generate code that's tailored for the most difficult case (i.e. we have no idea who we're calling) and then optimisations use local information to improve the code quality. As another abstraction, templates in Loci add to the work that needs to be done by the optimiser. This is in contrast to C++, which shifts a lot of this work to its Semantic Analysis stage. As mentioned previously this means Loci can produce debug builds a lot quicker than C++ but those builds would tend to be slower; it's up the user to select an appropriate trade-off, and runtime templates provide a lot of flexibility in that respect.
Partial evaluation
While inlining gives us a good outcome in this case, and indeed in most cases, sometimes it will generate duplicate code unnecessarily. For example, if multiple functions both call function() they could all end up with their own inline versions of function(). What we really want is to generate a version of function(), such as 'function_int()' that only operates on 'int', and then we can make a separate decision about inlining. This can be resolved using something called 'partial evaluation', which is to generate a copy of a function with an argument fixed to a specific value. The current compiler relies heavily on inlining, but it shouldn't be difficult to add extra optimisation passes to achieve this (there are also partial evaluators available for LLVM) and this will be part of future work on the compiler.
Summary
Template generators are an excellent way to implement templates because they:  • Allow templated code to be put into implementation files rather than header files.
 • Ensure that if the template implementation changes, dependent modules don't need to be recompiled.
 • Reduce compile times for debug builds.
 • Are extremely amenable to optimisation.
Templates across API boundaries fits in well within Loci, with its focus on defining and implementing clear and modular APIs, but the approaches described here are very likely to be useful within the scope of other languages.
Extra Meta-programming
C++ templates are used for meta-programming as well as type/value substitution, since they happen to be Turing complete at compile-time. This usage isn't appropriate for run-time templates, since they're no longer a compile-time mechanism. There are many solutions to this problem, including a combination of compile-time and run-time templates, however the most compelling is to use constexpr or similar. constexpr is a C++11 keyword to mark apparently normal code as being evaluated at compile-time, which leads to significantly clearer code; Loci is heading down a very similar direction.
Path length
The length of a path is [math notation] size(nodes) * SUM(for each n in nodes: log2 transitions(n)), where nodes is the list of nodes in the path from the root and transitions(n) is the number of transitions out of node n. In other words, increasing the number of levels (nodes) increases the length of the path more than increasing the number of transitions. As an example, imagine a function 'foo()' that called and passed its template arguments to 10000 functions. log2(10000 + 1) = 13.29, which means we'll need 14 bits in the path to represent its transitions. On the other hand, if you had a chain of 10000 functions, each of which uses its template arguments, and each of which passes different arguments between different modules (see 'Pass-through path reduction'), then you'll need 10000 bits in the path. One of the original constraints was to avoid heap allocation. However a path can theoretically be arbitrarily long, so you'd think it needs heap allocation. We get around this by setting an upper bound, that is so unlikely that one can be certain that it won't be exceeded in practice. For Loci a 64bit integer was chosen. This isn't just throwing the kitchen sink at the problem, we're throwing several thousand sinks. Let's look at the numbers. In this discussion, I'll talk about the 'nesting level' of templated functions. For example, a nesting level of 0 means we have a templated function that doesn't pass its template arguments to any other functions; a level of 1 means we have a templated function that passes its template arguments to other functions, but those functions don't pass their template arguments any further. Etc. Nesting level is in fact essentially the number of modules, because modules must form a DAG (directed acyclic graph); all the functions in one module can be merged into one nesting level with lots of transitions. As mentioned above, the path length is log(number of transitions), so this is an enormous reduction in path size. For a nesting level of 0, we could never exceed the upper limit. The path will always be 1 bit long (the '1' bit to indicate the path length). For a nesting level of 1, we can exceed the upper limit by having over 2^63 unique transitions in the templated function. That would be a very very long function. For a nesting level of 2, we can exceed the upper limit by having over 2^(63/2) (= 3 billion) unique transitions at both levels. For a nesting level of 3, we can exceed the upper limit by having over 2^(63/3) (= 2 million) unique transitions at all three levels. For a nesting level of 4, we can exceed the upper limit by having over 2^(63/4) (= 55109) unique transitions at all four levels. Let's speed this up. For a nesting level of 20, we can exceed the upper limit by having over 2^(63/20) (= 8.9) unique transitions at all 20 levels. For a nesting level of 32, we can exceed the upper limit by having over 2^(63/32) (= 3.9) unique transitions at all 32 levels. For a nesting level of 64, we can exceed the upper limit by having over 2^(63/64) (= 1.98) unique transitions at all 64 levels. You can see that the number of transitions per function is beginning to become reasonable, but the nesting depth is very unreasonable. Remember that each templated function mustn't be amenable to pass-through, so you'd have to be doing things like rotating the arguments every time or adding new arguments. Furthermore, as previously mentioned, the compiler can merge nesting levels for functions in the same module.
About the author
Stephen Cross is a software engineer at UK start-up, Undo Software, where he works on reversible debugging technology designed to improve software quality, robustness and security. He's very enthusiastic about program analysis and correctness, as well as efficient abstractions and low level tricks to implement these abstractions. In his own time he develops Loci, a systems programming language.