Cypher-X

Hash tables are one of the most fundamental data structures in computing. They power everything from database indices to in-memory caches, and at the scale of companies like Google, they consume a significant share of total CPU and RAM. Even small improvements to hash table performance can yield massive, compounding benefits across an entire codebase.

This article explores the design principles behind Google's SwissTable — the engine behind absl::flat_hash_map and absl::flat_hash_set — and how it leverages modern hardware features like SIMD instructions and CPU cache architecture to dramatically outperform traditional hash table implementations.

The Problem with Traditional Hash Tables

The C++ standard library's std::unordered_map uses a separate chaining strategy: each bucket contains a pointer to a linked list of entries that hash to that bucket. While conceptually simple, this design has several critical performance drawbacks on modern hardware:

Pointer Chasing: Following linked-list pointers causes frequent, unpredictable cache misses. Each node in the chain may reside in an entirely different cache line, defeating the CPU's prefetcher.
Memory Overhead: Every entry requires at least one pointer (often two, for a doubly-linked list), increasing per-element memory consumption and reducing the number of elements that fit in cache.
Poor Cache Locality: Because nodes are individually heap-allocated, they are scattered across memory. Iterating over the map requires touching many disparate memory locations.

These issues are not merely theoretical. In practice, they make std::unordered_map significantly slower than what the hardware is capable of, particularly for the most common operations: find() and insert().

The SwissTable Design

SwissTable addresses these problems by fundamentally rethinking the hash table's memory layout and probing strategy. The design rests on three pillars: a flat layout, split hashing with metadata fingerprints, and SIMD-accelerated probing.

1. Flat Layout with Open Addressing

SwissTable abandons separate chaining in favor of open addressing. All key-value pairs are stored directly in a single, contiguous array. When a collision occurs, the algorithm probes for the next available slot within the same array.

This has profound implications for performance:

Cache Friendliness: Sequential slots reside in the same or adjacent cache lines, meaning the CPU prefetcher can anticipate access patterns.
Reduced Memory Overhead: There are no per-entry pointers for chaining. The overhead per slot is reduced to a single byte of metadata.
Allocation Simplicity: The entire table is a single allocation, reducing pressure on the memory allocator.

The trade-off is that elements do not have pointer stability — their addresses can change when the table resizes. For use cases that require stable pointers, alternatives like absl::node_hash_map are provided.

2. Split Hashing: H1 and H2 Fingerprints

SwissTable splits each element's hash value into two components:

H1 (upper 57 bits): Determines which group of slots to probe first.
H2 (lower 7 bits): A compact "fingerprint" stored as one byte of metadata alongside each slot.

The metadata array stores one byte per slot, which can be in one of three states: Empty, Deleted (tombstone), or Full (containing the H2 fingerprint). This split-hash approach acts as a fast, cheap pre-filter. During a lookup, the algorithm first compares the stored H2 fingerprint against the query's H2. Only if there's a match does it proceed to the more expensive full key comparison. Since the H2 comparison filters out roughly 99% of false candidates (7 bits of entropy), the number of expensive key comparisons is drastically reduced.

3. SIMD-Accelerated Metadata Probing

This is the signature innovation of SwissTable. Slots are organized into groups of 16 (matching the width of an SSE register). The corresponding 16 metadata bytes for a group are stored contiguously, fitting perfectly into a single SIMD register.

graph LR
    subgraph "Hash Lookup Flow"
        A["Compute hash(key)"] --> B["Split into H1, H2"]
        B --> C["H1 → Select group of 16 slots"]
        C --> D["SIMD: Compare H2 against 16 metadata bytes in parallel"]
        D --> E{"Any H2 matches?"}
        E -- Yes --> F["Compare full keys for matched slots"]
        E -- No --> G{"Any Empty slot in group?"}
        G -- Yes --> H["Key not found"]
        G -- No --> I["Advance to next group, repeat"]
        F --> J{"Key equals?"}
        J -- Yes --> K["Found!"]
        J -- No --> G
    end

When looking up a key:

H1 identifies the starting group.
A single SIMD instruction compares the query's H2 against all 16 metadata bytes in the group simultaneously.
The result is a bitmask indicating which slots (if any) have a matching fingerprint. Only those slots undergo a full key comparison.
If no match is found and the group contains an Empty slot, the key is definitively absent. If all slots are Full or Deleted, the algorithm moves to the next group.

This means that in the common case, a lookup touches only one or two cache lines and performs a single SIMD comparison — a dramatic improvement over chaining-based designs.

Performance Results and Impact

The SwissTable design delivers substantial improvements over prior art:

Lookups: Typically require only 1–2 cache misses, compared to several for chaining-based maps.
Memory Footprint: The flat, pointer-less layout uses roughly 1 byte of overhead per slot, compared to 8–16 bytes of pointer overhead per entry in std::unordered_map.
Scalability: The design handles both dense and sparse tables efficiently, with predictable performance across a wide range of load factors.

The one trade-off is iteration: because the table uses open addressing, iterating requires scanning the entire capacity (including empty slots), making it O(capacity) rather than O(size). In practice, however, iteration-heavy workloads are rare compared to find() and insert().

SwissTable's influence extends far beyond C++. The design has been adopted in Go's standard library (as of Go 1.24), Rust's hashbrown crate (which powers HashMap), and numerous other language ecosystems. It represents a paradigm shift in how hash tables are engineered for modern CPUs.

Conclusion

SwissTable demonstrates that to build truly high-performance data structures, you must design with the hardware in mind. By combining a cache-friendly flat layout, a split-hash fingerprinting scheme for cheap pre-filtering, and SIMD instructions for parallel metadata probing, it achieves a level of performance that traditional designs simply cannot match. For any engineer working with hash-heavy workloads, understanding these principles is essential.

Reference: Designing the Fastest Hash Table