Oh, hash it all

A long running problem with C++ development, especially for games where we should try to balance performance with architecture, is that of identifiers. The problem is, in a nutshell, that C++ is a compile time driven language.  It works best for things that you bind together at compile time. As an example, when you call a function, it is expected to exist. In fact the linker will not generally create an executable for you unless it exists. It is also true that you cannot access a variable that does not exist… and in fact C++ makes it (with its class constructs) impossible (or very difficult) to access anything at all unless it is in the program database system known as “headers”.

It’s pretty much always the same story when you analyse C++. It feels like every problem always comes back to those damn header files. See, you may think your main job as a programmer is to write code that manages data… you know, create algorithms and stuff.  But this is not true. Your main job as a C++ programmer is in fact to maintain by hand a database of functions, classes and variables in a bunch of text files. Yes, sorry if the news shocks you. But it is true.

C++ programming would be a lot easier if the compiler actually maintained and optimised the database of classes, functions and variables for you. You know, little things like only “including” headers which are actually used. Of course there are many languages now that avoid this problem, such as Java and C#. The problem is that they were designed starting from C++ with it’s header file mentality, so their design is still influenced by the C++ way of thinking. In fact I will go so far as to say that many of our Object Orientated Programming principles, now copied throughout many alternative languages, have nothing to do what-so-ever to do with solving actual practical programming problems. Instead they have to do with getting around an archaic compiler-header file-linker architecture inherited from the earliest version of C.

C++, is quite simply, a HOP language. That is, a header file orientated programming language.

We have, I am afraid to say, been chasing our tails. Architectural, language and programming design decisions made for the last 20 years or whatever of C++ have all been focused on working around the inherent limitations of that ancient model, or remain influenced by its legacy. Had someone had the foresight to replace those header files with an SQL or similar database, things would have turned out very differently.

As it is, we currently stuck with C++, because it manages to straddle two difficult to reconcile things: performance and architecture.  Other languages seem to not care about performance as much as C++ does. However I believe that we can’t really go on with the mess of C++, a language that was not so much designed as congealed. C++ use naturally creates beautiful code in the same way that COBOL doesn’t. And may the late Douglas Adams forgive me.

I could go on, but I actually wanted to focus on identifiers, so I’ll attempt to get back on track…

The problem with a compile time orientated language is that it makes it hard to make things data driven. Specifically, if you want to have a configuration file, you will generally name things in this file. These things will have to have some kind of meaning to the running code. A simple example is loading an animation, or accessing some entry with a tunable parameter. In my Light engine, I can specify a scene graph style hierarchy in a configuration file. Obviously, this means my code must access different nodes by name. It is also good not to forget that imported models from any 3D art package may have nodes that require access by code. You naturally will want to be able to specify these things in a legible way.

However, you can’t just reference such things by an identifier name, because to do that you have to tell the compiler somehow at compile time what these identifiers are (as a set of enums for example). This of course defeats the idea of being data driven. The point of a configuration file is to avoid having to recompile the code, after all.

The standard solution is, of course, to use strings. However, although strings are really flexible, they are also rather inefficient, especially where you using a string as way to name something. What you want is a unique ID that is given a name by the compiler; that would be fast, but you have to compile it in. Surely there must be a way of getting the best of both worlds?

Alas, this is an intractable problem in C++. There is no elegant solution. When I found that my Light Engine prototype was choking really badly, I found that the cause was my string identifiers: no real surprise and I had taken measures  to design it so that I could retro fit faster solutions when building the system. But it did mean I had to look at an implementation to solve this problem sooner than I had hoped.

Here is what I came up with.

The initialisation file (which I call a property tree) represents a set of hierarchical data (with inheritance). In code I need to be able to access this data, which is loaded into what I call a database, because that’s what it is. It’s a storage system for data, which unlike computer memory is abstracted.

This does raise a very interesting question of course: why on earth do I need to make a database when I have memory? The answer is that memory on computer systems are addressable by index. That’s the way we build them, and I have come to realise that this is actually a rather massive limitation.  When we store something in memory, we have to remember where we put it. This is actually a major inconvenience, which adds a lot of complication to the art of programming. After all, we have to store that address of the data at some address too, and so on and on. As C++ programmers, we actually carry the burden of keeping track of all this. Indeed, this is also something that languages with built in memory management try to help with, with garbage collection and so on. There is a reason that Java does not give you access to pointers.

These thoughts lead me to an interesting idea: one of making a CPU that, in hardware, implements memory which is accessed by name, rather than index. We could call this “associative memory”. Want to create an instance of some “class”? Then instead of doing:

FooClass *foo = new FooClass();
// oh must remember to keep foo hanging around somewhere

We do something like:

new FooClass[Foo1234];
// we know what we called it, so we can access it anywhere

But I digress…

My database is pretty simple, and every value is accessed via a key in a similar manner to the windows repository. The problem is that I do not want to use strings for these keys, but nor do I want to have some kind of tool that processes the data file to enumerate the keys either. Why would I not want that second solution? Well first of all, it means that edits to the property tree file could trigger recompiles, which as stated is not ideal. But secondly, if an enumerated list is generated this will be included in most of the files in my project, so it means that such recompiles will be major recompiles taking a long time.

One trick I used a long time back was to have a compile switch so that I could change between strings and enums. This allowed the creation of different builds: either completely data driven, or partially data driven. However this is still somewhat messy (it worked very well for cross platform development between PC and Playstation 1 though).

So this time, decided to use hash keys. This method is quite common, and works well as long as you are not too afraid of possible hash collisions (which are rare enough to not be an issue… it’s a trade off and I decided after much initial resistance to go with it).

The principle of a hash key is to simply take a string and turn it into a 32 bit value in some way so that any string creates always the same ID, but different strings are unlikely to create the same ID. If they do (analysis suggests that any code base with one million keys and a 32 bit hash might have a one in 4000 chance of of a clash; not bad odds) then one can change the hash algorithm parameters until the problems goes away, or rename the problematic key.

Now we get to the tricky part: you want to say:


And you want this to create some unique value. To keep flexibility, you can make that a macro:

#define xKey(a) #a    // Use a string key
#define xKey(a) keyenums_##a // use an enum (which has to compiled in)
#define xKey(a) hashString(#a)  // turn string into a hash

It is the last of the above that I want to use: I get an ID that’s a 32 bit value, but I do not have to list that ID in some common header file. However as you can see this means that every time I want to use that key I must hash it. That’s infuriating. I really want the compiler to use the hash value, not the string. That is, if xKey(HELLO) were to result in the value 0x3F4A7D3E, then Key k = xKey(HELLO) ought to, ideally, simply store the hash value directly without having to calculate the key.

Alas, there is no way without resorting to external tools.

One solution is to have a precompile step: perhaps run through the source code before compiling. This tool would create the hash values and perhaps replace the xKey(HELLO) text with 0x3F4A7D3E and so on. More sophisticated (but essentially the same idea) is to locate every one of these macro uses and generate an include file per source file on the fly that containts enums so that the macro simply reads off the value (using the second of the three macros listed above).

This is actually quite doable, but there are a few reasons why I don’t like it as a solution. First of all, the tool that scans the source code will slow down the compile time slightly. Second, unless it parses the C++ carefully there is potential for confusion: a raw text substitution is potentially unsafe (although I admit in practice unlikely to be a problem).  Thirdly, when building in environments in clumsy IDEs such as Visual Studio, custom build rules can be quite a pain to set up and maintain (they tend to break). And forth, if you want to use enumeration rather than hashes, there is no easy way to coordinate the enumeration across all the source files.

So I have at this time experimented with a different solution to this problem, which is perhaps unconventional. Instead of preprocessing the source files, I post process the generated executable. This is actually quite elegant from the source code point of view, although clearly hacking the executable is ugly. Here’s how it works.

// a marker for key string constants

#define __token_KEYTAG "\x1f\xaa\x55\x69"

// prefix the string with the 4 byte tag, and the macro returns the tag value

#define xKey(a) ((Key)(*((volatile int *)(__token_KEYTAG #a))))
So, what does this do? Every time that you use the macro, it creates a string constant. Compilers these days will, incidentally, pool such string constants so that if you use them multiple times, they are only stored once. This string constant is prefixed with a 4 byte sequence. I have chosen the sequence to consist of non printable characters, so it is quite unlikely that any ordinary string constant will be confused with these.

After the executable has been created by the linker, I run a custom tool which I call the enumerator. It reads in the executable and figures out where the strings are stored from the exectuable headers (I have done this now for Windows and Mac executables – it’s not very hard to do), and within this initialised data section locates all the keys by searching for the four byte tag. For each located string, it applies the hash algorithm and overwrites the tag.

When the executable runs, every use of the xKey() macro will magically return the hash value for that string, which is pretty neat. This system also has the advantage that actual enumeration rather than hashing can also be used, for extremely high speed look ups: xKey(x) can be turned into a specific offset.

Don’t get me wrong, this solution is not without its problems, but currently it works fine for two platforms. Because this is an optimisation, I am not relying on it in order to make the code function; I can always replace the macro with a simple call to a hash function as a fall back, or figure out some other less flexible solution (such as compiling in the keys) for the purpose of porting.

One problem is possible confusion if the key appears by chance somewhere else other than in the use of xKey (this is unlikely; I only search the string constants if it is possible to determine where they are; at the very least I limit the search to the initialised data section. But it is something to be aware of). There may also be alignment issues (the cast to into an int of 4 bytes not on a 4 byte boundary) which can cause a performance loss (there are workarounds for this).

However, this is one of those problems that having a number of possible solutions that can be interchanged without modifying the calling code is a good idea, so it’s good to have another option available.

How about using the pre-processor to create the hash? Or using templates? As far as I can tell, there is no solution for this that uses actual string constants.  Once again we hit limitations of the C++ language, which as my students are aware, I am very fond of pointing out… it seems to me very important to understand the limitations of the tools we use.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: