2.6 Fast hash map for symbols - naver/lispe GitHub Wiki
I'm building a Lisp dialect called LispE, and I've implemented a slick way to manage symbols using a custom structure, binHash. It's fast and pretty efficient...
(see mapbin.h)
Lisp lives on symbols (x, foo, etc.), and you need quick checks and value lookups during evaluation. Regular hash tables are fine, but I wanted something snappier for LispE.
binHash is a bitmap-based hash table for integer keys:
- Keys: 16-bit integers (up to 65,535 symbols—plenty!).
 - 
Storage: Each bucket uses a 64-bit bitmap (
uint64_t) and an array of values. - 
How: Symbol ID splits into a bucket (
id >> 6) and a bit position (id & 63). Bitmap flags presence; array stores the value. 
template <class Z> class binHash {
    Z** table;           // Array of pointers to value arrays
    uint64_t* indexes;   // Bitmaps tracking presence
    uint16_t tsize;      // Number of buckets
    int16_t base;        // Offset to skip empty leading buckets
public:
    bool check(uint16_t r) {
        uint16_t i = (r >> binBits) - base;  // Bucket index, adjusted by base
        return (i < tsize && (indexes[i] & binVal64[r & binMin]));  // Check bit in bitmap
    }
    
    Z& operator[](uint16_t r) {
        // Bucket ID = r / 64; slot = r % 64. Since 64 = 2^6, divide is a shift right by 6,
        // remainder is a mask with first 6 bits set (63 = 0b111111)
        uint16_t i = r >> binBits;
        r &= binMin;
        if (base == -1) {
            base = i;
            i = 0;
        }
        else {
            if (i < base) {
                insert(i);
                i = 0;
            }
            else {
                i -= base;
                if (i >= tsize)
                    resize(i + (i >> 1) + 1);
            }
        }
        if (table[i] == NULL)
            table[i] = new Z[binSize];
        indexes[i] |= binVal64[r];
        return table[i][r];
    }
};- Lookup: Shift, AND—O(1), no sweat.
 - Insert: Set a bit, stash the value. Grows dynamically.
 
Symbols get IDs from strings:
unordered_map<string, int16_t> string_to_code;  // Maps symbol names to IDs
int16_t get_id(const string& sym) {
    auto it = string_to_code.find(sym);  // Look up existing ID
    if (it != string_to_code.end()) return it->second;  // Return if found
    int16_t id = string_to_code.size();  // New ID = current size
    string_to_code[sym] = id;            // Store new mapping
    return id;
}Values go into a binHash<Element*>, where Element* is a Lisp value:
binHash<Element*> variables;  // Holds symbol-to-value mappings
void store(string sym, Element* e) {
    int16_t id = get_id(sym);  // Get or create ID
    variables[id] = e;         // Store value in binHash
}
Element* lookup(string sym) {
    int16_t id = get_id(sym);  // Get ID
    return variables.check(id) ? variables[id] : nullptr;  // Return value or null
}The binHash template includes an iterator class that makes it easy to traverse all elements in the hash table:
// Iterate through all variables in a binHash
binHash<Element*> variables;
binHash<Element*>::iterator a(variables);
for (; !a.end(); a++) {
    cout << a->first << ": " << a->second << endl;
}- 
Iterator Creation: 
binHash<Element*>::iterator a(variables)creates an iterator positioned at the first element. - 
Traversal: The loop continues until 
a.end()returns true, indicating we've reached the end. - 
Access: Each iteration gives you:
- 
a->first: The key (symbol ID) - 
a->second: The value (Element pointer) 
 - 
 
This iterator is particularly efficient because it uses the bitmap structure of binHash to skip over empty slots. It first finds the next non-zero bit in the bitmap (indexes[i]), then uses bit manipulation to quickly locate the position of that bit, corresponding to a stored element.
- Speed: Bitwise operations outpace hash table lookups.
 - 
Memory Smarts: binHash keeps it lean:
- Bucket Size: Each bucket holds 64 values—nice and tidy.
 - Base Offset: Since variable IDs are close together, a base skips empty buckets before the first ID.
 - 
Clustering: IDs are pretty sequential, so one or two buckets often cover all variables. For instance, when executing a function, the local variables are stored in a 
binHash<Element*>map for fast access. Hence, 12 symbols often fit in one or two buckets (~520 bytes on 64-bit). No waste! 
 
binSet complements binHash by providing a bitmap-only version for storing sets of integers:
- Purpose: Efficiently represents collections of 16-bit integer values without associated data.
 - 
Implementation: Uses the same bitmap indexing scheme as 
binHashbut without the value arrays. - Memory Efficiency: Only stores the bitmap portion, making it extremely space-efficient for representing sets.
 
class binSet {
public:
    uint64_t* indexes;  // Array of bitmaps (64 bits each)
    uint16_t tsize;     // Number of buckets
    int16_t base;       // Base offset
    
    bool check(uint16_t r) {
        uint16_t i = (r >> binBits) - base;  // Compute bucket index
        return (i < tsize && (indexes[i] & binVal64[r & binMin]));  // Check bit
    }
...
};- Symbol Checking: Used to check if a symbol is available for a given function.
 - Flag Collections: Efficiently stores sets of flags or options.
 - Set Operations: The implementation includes operations like clear(), erase(), and the iterator (binSetIter) for traversal.
 
binHash and binSet offer the following features:
- Fast: O(1) with minimal overhead.
 - Compact: Bitmaps + clustering = low memory footprint.
 - Simple: Clean and effective implementation.
 - Flexible: Supports both value storage (binHash) and set operations (binSet).
 
These data structures have been critical in making LispE a high-performance Lisp implementation, demonstrating how careful algorithm selection and implementation can significantly impact language performance.