Introduction
If there’s one data structure that every programmer uses daily without thinking about it, it’s the hash table. Whether you’re using a Python dictionary, a JavaScript object, a Java HashMap, or a C++ unordered_map—you’re using a hash table. They’re the backbone of databases, caches, and countless applications.
Understanding hash tables deeply helps you make better architectural decisions and debug performance issues. This article explores how hash tables work, different implementations, and practical considerations.
The Core Idea
A hash table stores key-value pairs, allowing fast lookup by key. The magic is converting keys (strings, numbers, objects) into array indices through a hash function.
In the ideal case: Key → Hash Function → Index → Value (O(1) lookup)
The challenge is handling collisions—when two different keys hash to the same index.
Hash Functions
A good hash function distributes keys uniformly across the array. Properties:
- Deterministic: Same key always produces same hash
- Fast: Computing the hash should be quick
- Uniform: Keys spread evenly across all buckets
Common hash functions:
- Division method: h(k) = k mod m (choose m as prime, not power of 2)
- Multiplication method: h(k) = floor(m(kA mod 1)) where A is constant
- Universal hashing: Randomize hash function to prevent worst-case behavior
For strings, algorithms like FNV or MurmurHash perform well.
Collision Resolution
When two keys collide, we need a strategy:
Chaining (Separate Linking)
Each bucket is a linked list. Collisions just add to the list.
Pros: Simple, handles arbitrary load factor, delete is easy Cons: Cache performance poor, worst-case O(n)
Load factor (α): n/m = number of elements / number of buckets. Keep α < 0.75 for good performance.
Open Addressing
All elements stored in the table itself. On collision, probe for another slot.
Linear probing: Check next slot: h(k), h(k)+1, h(k)+2… Quadratic probing: Check slots with quadratic spacing Double hashing: Use second hash function for probe sequence
Pros: Better cache performance, no pointers Cons: Clustering issues, delete is tricky
Resizing and Rehashing
When load factor exceeds threshold (typically 0.7), resize the table:
- Create new, larger table (typically 2x size)
- Rehash all elements into new table
- Use new hash function
This is expensive (O(n)) but happens infrequently. Amortized cost remains O(1).
Practical Implementations
Python dict: Uses open addressing with randomized seeds. Keys must be hashable.
Java HashMap: Uses chaining with treeify threshold (JDK 8+ converts linked list to red-black tree for long chains).
Go map: Uses open addressing with random iteration order.
Rust HashMap: Uses SipHash for DoS resistance.
Handling Unhashable Keys
Some keys can’t be hashed (lists, dicts). Solutions:
- Convert to immutable equivalent (tuple)
- Use custom hash based on content
- Use identity hash
Advanced Topics
Consistent hashing: Used in distributed systems (Cassandra, DynamoDB). Minimizes reshuffling when adding/removing servers.
Bloom filters: Use multiple hash functions for membership testing (covered in earlier article).
Perfect hashing: Two-level scheme guaranteeing O(1) worst-case for static sets.
When to Use Hash Tables
Use when: Fast lookup by key is primary operation, keys are simple types, memory is available.
Avoid when: Ordered iteration matters (use tree), keys are complex objects, memory is tight.
Performance Considerations
- Time complexity: Average O(1) lookup/insert/delete, worst O(n) with poor hash function
- Space complexity: O(n) for stored elements plus overhead
- Cache: Chaining is cache-unfriendly; open addressing is better
- Thread safety: Not thread-safe; use concurrent implementations
Hash Function Design Deep Dive
Division Method
h(k) = k mod m
Choosing m is critical. Prime numbers far from powers of 2 distribute keys best. For example, m = 10007 (a prime) typically distributes integer keys more uniformly than m = 10000.
def division_hash(key, table_size):
return key % table_size
Drawback: Certain patterns in keys (e.g., all multiples of m) produce worst-case clustering.
Multiplication Method
h(k) = floor(m (kA mod 1)) where A is a constant in (0, 1)
Knuth recommends A ≈ (√5 - 1) / 2 ≈ 0.6180339 (the golden ratio conjugate). This method is less sensitive to key patterns.
import math
GOLDEN_RATIO = (math.sqrt(5) - 1) / 2
def multiplication_hash(key, table_size):
fractional = (key * GOLDEN_RATIO) % 1
return int(table_size * fractional)
Universal Hashing
A family of hash functions H is universal if for any distinct keys x, y, the probability that h(x) = h(y) is at most 1/m.
import random
class UniversalHash:
def __init__(self, table_size):
self.p = 10**9 + 7 # Large prime
self.a = random.randint(1, self.p - 1)
self.b = random.randint(0, self.p - 1)
self.m = table_size
def hash(self, key):
return ((self.a * key + self.b) % self.p) % self.m
Universal hashing guarantees expected O(1) operations regardless of input distribution.
Collision Resolution Deep Dive
Chaining Analysis
With chaining, the load factor α = n/m represents the average chain length. Search complexity is O(1 + α). For α < 1, average chain length is less than 1, giving near-constant performance.
class HashTableChaining:
def __init__(self, capacity=16):
self.capacity = capacity
self.size = 0
self.table = [[] for _ in range(capacity)]
def _hash(self, key):
return hash(key) % self.capacity
def insert(self, key, value):
idx = self._hash(key)
for i, (k, v) in enumerate(self.table[idx]):
if k == key:
self.table[idx][i] = (key, value)
return
self.table[idx].append((key, value))
self.size += 1
def get(self, key):
idx = self._hash(key)
for k, v in self.table[idx]:
if k == key:
return v
raise KeyError(key)
def delete(self, key):
idx = self._hash(key)
for i, (k, v) in enumerate(self.table[idx]):
if k == key:
del self.table[idx][i]
self.size -= 1
return
raise KeyError(key)
Open Addressing: Linear Probing
Probe sequence: h(k), h(k)+1, h(k)+2, … mod m
class HashTableLinearProbing:
def __init__(self, capacity=16):
self.capacity = capacity
self.size = 0
self.keys = [None] * capacity
self.values = [None] * capacity
def _hash(self, key):
return hash(key) % self.capacity
def insert(self, key, value):
idx = self._hash(key)
while self.keys[idx] is not None:
if self.keys[idx] == key:
self.values[idx] = value
return
idx = (idx + 1) % self.capacity
self.keys[idx] = key
self.values[idx] = value
self.size += 1
def get(self, key):
idx = self._hash(key)
while self.keys[idx] is not None:
if self.keys[idx] == key:
return self.values[idx]
idx = (idx + 1) % self.capacity
raise KeyError(key)
Primary clustering: Consecutive occupied slots form clusters, degrading performance. Once a cluster forms, it grows faster because new insertions are more likely to land in the cluster.
Open Addressing: Quadratic Probing
Probe sequence: h(k), h(k)+1², h(k)+2², h(k)+3², … mod m
Quadratic probing avoids primary clustering but suffers from secondary clustering—keys that hash to the same initial position follow the same probe sequence.
class HashTableQuadraticProbing:
def __init__(self, capacity=16):
self.capacity = capacity
self.size = 0
self.keys = [None] * capacity
self.values = [None] * capacity
def insert(self, key, value):
idx = self._hash(key)
i = 1
while self.keys[idx] is not None:
if self.keys[idx] == key:
self.values[idx] = value
return
idx = (idx + i * i) % self.capacity
i += 1
self.keys[idx] = key
self.values[idx] = value
self.size += 1
Open Addressing: Double Hashing
Probe sequence: h₁(k), h₁(k) + h₂(k), h₁(k) + 2h₂(k), … mod m
Double hashing uses a second hash function for the step size, eliminating both primary and secondary clustering.
class HashTableDoubleHashing:
def __init__(self, capacity=16):
self.capacity = capacity
self.size = 0
self.keys = [None] * capacity
self.values = [None] * capacity
def _hash1(self, key):
return hash(key) % self.capacity
def _hash2(self, key):
return 1 + (hash(key) % (self.capacity - 1))
def insert(self, key, value):
idx = self._hash1(key)
step = self._hash2(key)
while self.keys[idx] is not None:
if self.keys[idx] == key:
self.values[idx] = value
return
idx = (idx + step) % self.capacity
self.keys[idx] = key
self.values[idx] = value
self.size += 1
Performance Comparison
| Method | Cluster Type | Cache Performance | Deletion Complexity | Worst-Case Probe Count |
|---|---|---|---|---|
| Chaining | None | Poor (pointer chasing) | Simple | α |
| Linear Probing | Primary | Excellent (sequential) | Tricky (tombstones) | α/(1-α) |
| Quadratic Probing | Secondary | Good | Tricky | 1/(1-α) |
| Double Hashing | None | Moderate | Tricky | 1/(1-α) |
Load Factor and Resizing
Load Factor Analysis
The load factor α directly impacts performance:
- α < 0.5: Very fast, but memory wasteful
- α ≈ 0.7: Good balance (Java HashMap default threshold)
- α ≈ 0.75: Python dict default threshold
- α > 0.9: Performance degrades significantly for open addressing
- α > 1.0: Only possible with chaining
Expected probes for successful search in open addressing:
- Linear probing: 0.5(1 + 1/(1-α))
- Quadratic probing: 1 - ln(1-α) - α/2
- Double hashing: -(ln(1-α))/α
Resizing Strategies
When the load factor exceeds the threshold, the table must grow:
- Double the capacity: Most common choice. Ensures amortized O(1) inserts.
- Rehash all entries: O(n) operation, but infrequent.
- Incremental resizing: Spread the rehashing cost across multiple operations (used in some DB systems).
class ResizableHashTable(HashTableChaining):
GROWTH_FACTOR = 2
LOAD_FACTOR_THRESHOLD = 0.75
def insert(self, key, value):
super().insert(key, value)
if self.size / self.capacity > self.LOAD_FACTOR_THRESHOLD:
self._resize()
def _resize(self):
old_table = self.table
self.capacity *= self.GROWTH_FACTOR
self.table = [[] for _ in range(self.capacity)]
self.size = 0
for bucket in old_table:
for key, value in bucket:
self._insert_no_resize(key, value)
def _insert_no_resize(self, key, value):
idx = self._hash(key)
self.table[idx].append((key, value))
self.size += 1
Python Dict Internals
Python’s dict uses a specialized open-addressing scheme. Key design decisions:
- Randomized hash seeds: Per-process randomization prevents DoS attacks
- Compact table: Stores indices in a sparse array, entries in a dense array
- Preserves insertion order: Since Python 3.7, dict maintains insertion order using a combined array structure
- String interning: String keys are interned for faster hashing
# CPython's hash function for strings
# Uses a modified FNV (Fowler-Noll-Vo) hash
def _py_str_hash(s):
h = 0
for c in s:
h = (h * 1000003) ^ ord(c)
return h ^ len(s)
The Python dict starts with capacity 8 and resists at α ≈ 2/3. With 50% of slots reserved for growth, memory overhead is roughly 33%.
Java HashMap Internals
Java HashMap (JDK 8+) uses chaining with a critical optimization:
- Initial capacity: 16 by default
- Load factor: 0.75 by default
- Treeify threshold: When a bucket’s chain length exceeds 8, the linked list is converted to a red-black tree
- Untreeify threshold: When a bucket’s chain length drops below 6, it reverts to a linked list
// Simplified Java HashMap put (JDK 17)
final V putVal(int hash, K key, V value, boolean onlyIfAbsent) {
Node<K,V>[] tab; Node<K,V> p; int n, i;
if ((tab = table) == null || (n = tab.length) == 0)
n = (tab = resize()).length;
if ((p = tab[i = (n - 1) & hash]) == null)
tab[i] = newNode(hash, key, value, null);
else {
Node<K,V> e; K k;
if (p.hash == hash && ((k = p.key) == key || (key != null && key.equals(k))))
e = p;
else if (p instanceof TreeNode)
e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
else {
for (int binCount = 0; ; ++binCount) {
if ((e = p.next) == null) {
p.next = newNode(hash, key, value, null);
if (binCount >= TREEIFY_THRESHOLD - 1)
treeifyBin(tab, hash);
break;
}
if (e.hash == hash && ((k = e.key) == key || (key != null && key.equals(k))))
break;
p = e;
}
}
}
}
The bitwise AND (n - 1) & hash replaces the modulo operation when capacity is a power of 2.
Real-World Use Cases with Code
LRU Cache
Combining a hash table with a doubly linked list provides O(1) get and put:
from collections import OrderedDict
class LRUCache:
def __init__(self, capacity):
self.cache = OrderedDict()
self.capacity = capacity
def get(self, key):
if key not in self.cache:
return -1
self.cache.move_to_end(key)
return self.cache[key]
def put(self, key, value):
if key in self.cache:
self.cache.move_to_end(key)
self.cache[key] = value
if len(self.cache) > self.capacity:
self.cache.popitem(last=False)
Word Frequency Counter
from collections import defaultdict
def word_frequencies(text):
freqs = defaultdict(int)
for word in text.lower().split():
freqs[word] += 1
return dict(freqs)
Two-Sum Problem
def two_sum(nums, target):
seen = {}
for i, num in enumerate(nums):
complement = target - num
if complement in seen:
return [seen[complement], i]
seen[num] = i
return []
External Resources
Conclusion
Hash tables are foundational data structures with nearly constant-time operations. Understanding their trade-offs—chaining vs open addressing, hash function quality, load factor management—helps you use them effectively.
Next time you reach for a dictionary or map, remember the elegant simplicity of hashing turning arbitrary keys into fast lookups.
Comments