POST

String Interning: Future Ryan Delivers

January 15, 2018

As I mentioned in my Conditioning Script post, the next step in comfortably dealing with numeric IDs as replacements for strings was a reverse lookup being built so that the initial input that generated those IDs can be retrieved.

At the Highest Level

It’s a pretty simple lookup table mapping IDs to strings. In C/C++ one might imagine all of the wonderful memory fragmentation that would occur. One might even start down a path of blocking out a chunk of memory and using a stack allocator or similar to store the strings. Or, “Even better” One might think, “I’ll read them directly in as a block of memory and detect the string boundaries and store off pointers!”

And certainly, One would be correct, but, honestly, this is debug data and won’t be used that frequently. It could safely be removed from release builds and, generally speaking, the allocations for the list of strings will happen sequentially and will likely not be all that fragmented when it’s all said and done.

You figured out that I’m talking about myself here, right? That I almost totally over engineered this…

Here’s what I arrived at in terms of the interface.

namespace debugging
{
    struct InternedStrings
    {
        eastl::hash_map< uint32_t, std::string > _interns;
    };

    namespace strings
    {
        uint32_t Intern( InternedStrings &interns, const char *str );
        const char* GetString( InternedStrings &interns, uint32_t id );
        bool LoadFromDisk( InternedStrings &interns, const char *path );
    }
}

InternedStrings is defined instead of just an eastl::hash_map in the storage concerns I had do in fact become an issue, or things need to be extended for other reasons.

Those three functions do what you might expect; inserting and retrieving strings from uint32_t IDs.

An Aside

This organization I’ve used here is something I’ve been moving towards as my personal organizational style. Data structures defined, as much as possible/makes sense (operators and constructors still has to be done in the class/struct), as just data with functions that work against those structures organized into namespaces.

I started playing with this after reading a compelling argument from the BitSquid guy and it’s really been quite nice. It helps with minimizing compile times as well since you don’t have to have all function declarations in the same header and forces you away from writing code that has implicit side effects. All operations against a data structure are very plainly laid out in the functions.

Content Pipeline Scrubbing

There are several specific areas of raw content where strings exist that are known to be hashed either at runtime or as a part of the content pipeline processing.

In the engine, prefabs are scene instances defined ala carte. They get used as standalone instances or as templates to be augmented by other scene instance data.

Literals being hashed in scripts
Tag names in prefabs and scene instances
All string property values of instances

In the previous post about Conditioning Scripts, I talked about how I scrub scripts. So in that case, I had a convenient spot carved out to also record the input string as something to be interned.

For the other two instances, it was fairly trivial to modify their processing to do something similar. Because this data is used for debugging purposes, I decided to cast a pretty wide net. The property values, for example, are in many cases not used as hashed values, but the amount of memory at runtime and the time it takes to extract those values are small and it’s much better to have strings you don’t use than to puzzle over a hash value as it flits through the logs.

Runtime

At runtime, the core of the engine loads the strings file (one string per line) via debugging::strings::LoadFromDisk. Each string is hashed and interned in order. Functions are exposed via the Lua scripting layer to intern more strings or retrieve one by ID. This is how I intend to bridge the gap between strings I was able to extract during resource processing and hashes that are generated at runtime. Lua code can be run to generate known variants of strings that were not able to be captured.

Runtime Strings

Runtime string generation + hashing is something I’m not completely happy about, so at some point in the future I’ll address it. The worst culprit so far is the weapons system which uses the slot ID concatinated with the ID such as:

GameObject.SetInt( handle, 'weapon', slot, 'mask', val );

That function results in something like ‘weapon0mask’ or ‘weapon1mask’ being created and hashed. All known variants being generated could help to address this. At that point the data functions would be changed to take hashed IDs instead of strings. How exactly to cleanly handle that on the scripting side is an open question. Maybe each weapon script would generate all IDs it would need and then look them up?

_ID {
  0 = { mask = Hash32( 'weapon0mask' );
        fire_time = Hash32( 'weapon0fire_time' );
      };

  1 = { mask = Hash32( 'weapon0mask' );
        fire_time = Hash32( 'weapon0fire_time' );
      };
}

GameObject.SetInt( handle, _ID[ slot ].mask, val );
GameObject.SetFloat( handle, _ID[ slot ].fire_time, val );

There’s relatively few places where this pattern is used, so perhaps that makes sense. It would provide a nice central place to acquire generated hash IDs from.

Conclusion

I’ve been very happy having excised strings from the bulk of the system and this next step brings back some of the more comfortable elements of why people often use strings for identifiers in the first place.

It’s allowed a lot of my code to be far more concise and in all cases performant. A hashmap of uint32_t to uint32_t can fit an absurd amount of data into a small place, passing by value where it makes sense feels sane, comparisons are deadly easy, etc.

⁹⁄₁₀, enjoyed it, would intern strings again

-r