Hash Tables: The Workhorse of Computer Science

Introduction

If there’s one data structure that every programmer uses daily without thinking about it, it’s the hash table. Whether you’re using a Python dictionary, a JavaScript object, a Java HashMap, or a C++ unordered_map—you’re using a hash table. They’re the backbone of databases, caches, and countless applications.

Understanding hash tables deeply helps you make better architectural decisions and debug performance issues. This article explores how hash tables work, different implementations, and practical considerations.

The Core Idea

A hash table stores key-value pairs, allowing fast lookup by key. The magic is converting keys (strings, numbers, objects) into array indices through a hash function.

In the ideal case: Key → Hash Function → Index → Value (O(1) lookup)

The challenge is handling collisions—when two different keys hash to the same index.

Hash Functions

A good hash function distributes keys uniformly across the array. Properties:

Deterministic: Same key always produces same hash
Fast: Computing the hash should be quick
Uniform: Keys spread evenly across all buckets

Common hash functions:

Division method: h(k) = k mod m (choose m as prime, not power of 2)
Multiplication method: h(k) = floor(m(kA mod 1)) where A is constant
Universal hashing: Randomize hash function to prevent worst-case behavior

For strings, algorithms like FNV or MurmurHash perform well.

Collision Resolution

When two keys collide, we need a strategy:

Chaining (Separate Linking)

Each bucket is a linked list. Collisions just add to the list.

Pros: Simple, handles arbitrary load factor, delete is easy Cons: Cache performance poor, worst-case O(n)

Load factor (α): n/m = number of elements / number of buckets. Keep α < 0.75 for good performance.

Open Addressing

All elements stored in the table itself. On collision, probe for another slot.

Linear probing: Check next slot: h(k), h(k)+1, h(k)+2… Quadratic probing: Check slots with quadratic spacing Double hashing: Use second hash function for probe sequence

Pros: Better cache performance, no pointers Cons: Clustering issues, delete is tricky

Resizing and Rehashing

When load factor exceeds threshold (typically 0.7), resize the table:

Create new, larger table (typically 2x size)
Rehash all elements into new table
Use new hash function

This is expensive (O(n)) but happens infrequently. Amortized cost remains O(1).

Practical Implementations

Python dict: Uses open addressing with randomized seeds. Keys must be hashable.

Java HashMap: Uses chaining with treeify threshold (JDK 8+ converts linked list to red-black tree for long chains).

Go map: Uses open addressing with random iteration order.

Rust HashMap: Uses SipHash for DoS resistance.

Handling Unhashable Keys

Some keys can’t be hashed (lists, dicts). Solutions:

Convert to immutable equivalent (tuple)
Use custom hash based on content
Use identity hash

Advanced Topics

Consistent hashing: Used in distributed systems (Cassandra, DynamoDB). Minimizes reshuffling when adding/removing servers.

Bloom filters: Use multiple hash functions for membership testing (covered in earlier article).

Perfect hashing: Two-level scheme guaranteeing O(1) worst-case for static sets.

When to Use Hash Tables

Use when: Fast lookup by key is primary operation, keys are simple types, memory is available.

Avoid when: Ordered iteration matters (use tree), keys are complex objects, memory is tight.

Performance Considerations

Time complexity: Average O(1) lookup/insert/delete, worst O(n) with poor hash function
Space complexity: O(n) for stored elements plus overhead
Cache: Chaining is cache-unfriendly; open addressing is better
Thread safety: Not thread-safe; use concurrent implementations

Hash Function Design Deep Dive

Division Method

h(k) = k mod m

Choosing m is critical. Prime numbers far from powers of 2 distribute keys best. For example, m = 10007 (a prime) typically distributes integer keys more uniformly than m = 10000.

def division_hash(key, table_size):
    return key % table_size

Drawback: Certain patterns in keys (e.g., all multiples of m) produce worst-case clustering.

Multiplication Method

h(k) = floor(m (kA mod 1)) where A is a constant in (0, 1)

Knuth recommends A ≈ (√5 - 1) / 2 ≈ 0.6180339 (the golden ratio conjugate). This method is less sensitive to key patterns.

import math

GOLDEN_RATIO = (math.sqrt(5) - 1) / 2

def multiplication_hash(key, table_size):
    fractional = (key * GOLDEN_RATIO) % 1
    return int(table_size * fractional)

Universal Hashing

A family of hash functions H is universal if for any distinct keys x, y, the probability that h(x) = h(y) is at most 1/m.

import random

class UniversalHash:
    def __init__(self, table_size):
        self.p = 10**9 + 7  # Large prime
        self.a = random.randint(1, self.p - 1)
        self.b = random.randint(0, self.p - 1)
        self.m = table_size

    def hash(self, key):
        return ((self.a * key + self.b) % self.p) % self.m

Universal hashing guarantees expected O(1) operations regardless of input distribution.

Collision Resolution Deep Dive

Chaining Analysis

With chaining, the load factor α = n/m represents the average chain length. Search complexity is O(1 + α). For α < 1, average chain length is less than 1, giving near-constant performance.

class HashTableChaining:
    def __init__(self, capacity=16):
        self.capacity = capacity
        self.size = 0
        self.table = [[] for _ in range(capacity)]

    def _hash(self, key):
        return hash(key) % self.capacity

    def insert(self, key, value):
        idx = self._hash(key)
        for i, (k, v) in enumerate(self.table[idx]):
            if k == key:
                self.table[idx][i] = (key, value)
                return
        self.table[idx].append((key, value))
        self.size += 1

    def get(self, key):
        idx = self._hash(key)
        for k, v in self.table[idx]:
            if k == key:
                return v
        raise KeyError(key)

    def delete(self, key):
        idx = self._hash(key)
        for i, (k, v) in enumerate(self.table[idx]):
            if k == key:
                del self.table[idx][i]
                self.size -= 1
                return
        raise KeyError(key)

Open Addressing: Linear Probing

Probe sequence: h(k), h(k)+1, h(k)+2, … mod m

class HashTableLinearProbing:
    def __init__(self, capacity=16):
        self.capacity = capacity
        self.size = 0
        self.keys = [None] * capacity
        self.values = [None] * capacity

    def _hash(self, key):
        return hash(key) % self.capacity

    def insert(self, key, value):
        idx = self._hash(key)
        while self.keys[idx] is not None:
            if self.keys[idx] == key:
                self.values[idx] = value
                return
            idx = (idx + 1) % self.capacity
        self.keys[idx] = key
        self.values[idx] = value
        self.size += 1

    def get(self, key):
        idx = self._hash(key)
        while self.keys[idx] is not None:
            if self.keys[idx] == key:
                return self.values[idx]
            idx = (idx + 1) % self.capacity
        raise KeyError(key)

Primary clustering: Consecutive occupied slots form clusters, degrading performance. Once a cluster forms, it grows faster because new insertions are more likely to land in the cluster.

Open Addressing: Quadratic Probing

Probe sequence: h(k), h(k)+1², h(k)+2², h(k)+3², … mod m

Quadratic probing avoids primary clustering but suffers from secondary clustering—keys that hash to the same initial position follow the same probe sequence.

class HashTableQuadraticProbing:
    def __init__(self, capacity=16):
        self.capacity = capacity
        self.size = 0
        self.keys = [None] * capacity
        self.values = [None] * capacity

    def insert(self, key, value):
        idx = self._hash(key)
        i = 1
        while self.keys[idx] is not None:
            if self.keys[idx] == key:
                self.values[idx] = value
                return
            idx = (idx + i * i) % self.capacity
            i += 1
        self.keys[idx] = key
        self.values[idx] = value
        self.size += 1

Open Addressing: Double Hashing

Probe sequence: h₁(k), h₁(k) + h₂(k), h₁(k) + 2h₂(k), … mod m

Double hashing uses a second hash function for the step size, eliminating both primary and secondary clustering.

class HashTableDoubleHashing:
    def __init__(self, capacity=16):
        self.capacity = capacity
        self.size = 0
        self.keys = [None] * capacity
        self.values = [None] * capacity

    def _hash1(self, key):
        return hash(key) % self.capacity

    def _hash2(self, key):
        return 1 + (hash(key) % (self.capacity - 1))

    def insert(self, key, value):
        idx = self._hash1(key)
        step = self._hash2(key)
        while self.keys[idx] is not None:
            if self.keys[idx] == key:
                self.values[idx] = value
                return
            idx = (idx + step) % self.capacity
        self.keys[idx] = key
        self.values[idx] = value
        self.size += 1

Performance Comparison

Method	Cluster Type	Cache Performance	Deletion Complexity	Worst-Case Probe Count
Chaining	None	Poor (pointer chasing)	Simple	α
Linear Probing	Primary	Excellent (sequential)	Tricky (tombstones)	α/(1-α)
Quadratic Probing	Secondary	Good	Tricky	1/(1-α)
Double Hashing	None	Moderate	Tricky	1/(1-α)

Load Factor and Resizing

Load Factor Analysis

The load factor α directly impacts performance:

α < 0.5: Very fast, but memory wasteful
α ≈ 0.7: Good balance (Java HashMap default threshold)
α ≈ 0.75: Python dict default threshold
α > 0.9: Performance degrades significantly for open addressing
α > 1.0: Only possible with chaining

Expected probes for successful search in open addressing:

Linear probing: 0.5(1 + 1/(1-α))
Quadratic probing: 1 - ln(1-α) - α/2
Double hashing: -(ln(1-α))/α

Resizing Strategies

When the load factor exceeds the threshold, the table must grow:

Double the capacity: Most common choice. Ensures amortized O(1) inserts.
Rehash all entries: O(n) operation, but infrequent.
Incremental resizing: Spread the rehashing cost across multiple operations (used in some DB systems).

class ResizableHashTable(HashTableChaining):
    GROWTH_FACTOR = 2
    LOAD_FACTOR_THRESHOLD = 0.75

    def insert(self, key, value):
        super().insert(key, value)
        if self.size / self.capacity > self.LOAD_FACTOR_THRESHOLD:
            self._resize()

    def _resize(self):
        old_table = self.table
        self.capacity *= self.GROWTH_FACTOR
        self.table = [[] for _ in range(self.capacity)]
        self.size = 0
        for bucket in old_table:
            for key, value in bucket:
                self._insert_no_resize(key, value)

    def _insert_no_resize(self, key, value):
        idx = self._hash(key)
        self.table[idx].append((key, value))
        self.size += 1

Python Dict Internals

Python’s dict uses a specialized open-addressing scheme. Key design decisions:

Randomized hash seeds: Per-process randomization prevents DoS attacks
Compact table: Stores indices in a sparse array, entries in a dense array
Preserves insertion order: Since Python 3.7, dict maintains insertion order using a combined array structure
String interning: String keys are interned for faster hashing

# CPython's hash function for strings
# Uses a modified FNV (Fowler-Noll-Vo) hash
def _py_str_hash(s):
    h = 0
    for c in s:
        h = (h * 1000003) ^ ord(c)
    return h ^ len(s)

The Python dict starts with capacity 8 and resists at α ≈ 2/3. With 50% of slots reserved for growth, memory overhead is roughly 33%.

Java HashMap Internals

Java HashMap (JDK 8+) uses chaining with a critical optimization:

Initial capacity: 16 by default
Load factor: 0.75 by default
Treeify threshold: When a bucket’s chain length exceeds 8, the linked list is converted to a red-black tree
Untreeify threshold: When a bucket’s chain length drops below 6, it reverts to a linked list

// Simplified Java HashMap put (JDK 17)
final V putVal(int hash, K key, V value, boolean onlyIfAbsent) {
    Node<K,V>[] tab; Node<K,V> p; int n, i;
    if ((tab = table) == null || (n = tab.length) == 0)
        n = (tab = resize()).length;
    if ((p = tab[i = (n - 1) & hash]) == null)
        tab[i] = newNode(hash, key, value, null);
    else {
        Node<K,V> e; K k;
        if (p.hash == hash && ((k = p.key) == key || (key != null && key.equals(k))))
            e = p;
        else if (p instanceof TreeNode)
            e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
        else {
            for (int binCount = 0; ; ++binCount) {
                if ((e = p.next) == null) {
                    p.next = newNode(hash, key, value, null);
                    if (binCount >= TREEIFY_THRESHOLD - 1)
                        treeifyBin(tab, hash);
                    break;
                }
                if (e.hash == hash && ((k = e.key) == key || (key != null && key.equals(k))))
                    break;
                p = e;
            }
        }
    }
}

The bitwise AND (n - 1) & hash replaces the modulo operation when capacity is a power of 2.

Real-World Use Cases with Code

LRU Cache

Combining a hash table with a doubly linked list provides O(1) get and put:

from collections import OrderedDict

class LRUCache:
    def __init__(self, capacity):
        self.cache = OrderedDict()
        self.capacity = capacity

    def get(self, key):
        if key not in self.cache:
            return -1
        self.cache.move_to_end(key)
        return self.cache[key]

    def put(self, key, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)

Word Frequency Counter

from collections import defaultdict

def word_frequencies(text):
    freqs = defaultdict(int)
    for word in text.lower().split():
        freqs[word] += 1
    return dict(freqs)

Two-Sum Problem

def two_sum(nums, target):
    seen = {}
    for i, num in enumerate(nums):
        complement = target - num
        if complement in seen:
            return [seen[complement], i]
        seen[num] = i
    return []

External Resources

Conclusion

Hash tables are foundational data structures with nearly constant-time operations. Understanding their trade-offs—chaining vs open addressing, hash function quality, load factor management—helps you use them effectively.

Next time you reach for a dictionary or map, remember the elegant simplicity of hashing turning arbitrary keys into fast lookups.

Hash Tables: The Workhorse of Computer Science

Introduction

The Core Idea

Hash Functions

Collision Resolution

Chaining (Separate Linking)

Open Addressing

Resizing and Rehashing

Practical Implementations

Handling Unhashable Keys

Advanced Topics

When to Use Hash Tables

Performance Considerations

Hash Function Design Deep Dive

Division Method

Multiplication Method

Universal Hashing

Collision Resolution Deep Dive

Chaining Analysis

Open Addressing: Linear Probing

Open Addressing: Quadratic Probing

Open Addressing: Double Hashing

Performance Comparison

Load Factor and Resizing

Load Factor Analysis

Resizing Strategies

Python Dict Internals

Java HashMap Internals

Real-World Use Cases with Code

LRU Cache

Word Frequency Counter

Two-Sum Problem

External Resources

Conclusion

Comments

Share this article

👍 Was this article helpful?