Bloom Filters: Probabilistic Data Structures

Introduction

What if you could tell whether an element is “possibly in a set” or “definitely not in a set” using just 1.44 bytes per element—roughly 10 times more compact than storing the elements themselves? That’s exactly what Bloom filters allow.

Bloom filters are probabilistic data structures that answer membership queries with remarkable space efficiency. They’re used by major tech companies for everything from web caching to blockchain to spell checking.

This article explores Bloom filters, explaining how they work, why they’re so efficient, and how to use them effectively.

The Membership Testing Problem

You have a set of elements and need to answer: “Is element X in the set?”

Traditional approaches:

Hash table: O(1) lookup, but stores the actual element
Perfect hash: Requires knowing all elements upfront, complex to build

Bloom filters offer a different trade-off: extremely compact, constant-time lookups, but with a small probability of false positives (saying “yes” when the element isn’t actually present).

Crucially, they never produce false negatives— if they say an element isn’t present, it’s definitely not there.

How Bloom Filters Work

A Bloom filter consists of:

An array of m bits, initially all set to 0
k independent hash functions

Adding an element:

Compute k hash values for the element
Set the bits at those positions to 1

Checking membership:

Compute k hash values for the element
If all corresponding bits are 1, return “possibly present”
If any bit is 0, return “definitely not present”

The key insight is that multiple elements can map to the same bit positions. This sharing of bits is what makes Bloom filters space-efficient, but also causes false positives.

Understanding False Positives

The probability of a false positive depends on:

Number of elements inserted (n)
Size of the bit array (m)
Number of hash functions (k)

The formula for false positive probability is approximately:

p ≈ (1 - e^(-kn/m))^k

Interestingly, there’s an optimal value for k: k = (m/n) × ln(2)

With optimal parameters and n elements, a Bloom filter uses about 1.44 bytes per element for a 1% false positive rate.

Implementation Considerations

Choosing parameters:

More bits = lower false positive rate but more memory
More hash functions = more computation but fewer collisions

The math:

For 1% false positives: use about 9.6 bits per element
For 10% false positives: use about 4.8 bits per element
For 0.1% false positives: use about 14.4 bits per element

Practical Applications

Bloom filters are used extensively in production systems:

Web caching: Check if a URL is in cache before hitting the backend. False positives just cause an unnecessary cache miss—acceptable.

Spell checkers: Quickly check if a word might be in the dictionary. False positives are caught by the actual lookup.

Database systems: Check if a key might exist before performing an expensive disk seek. Reduces unnecessary I/O operations.

Blockchain: Bitcoin uses Bloom filters for SPV (Simplified Payment Verification) clients to query wallet addresses without downloading entire blocks.

Network routers: Check against lists of malicious URLs or IP addresses.

Duplicate detection: Check if an item might be a duplicate before doing expensive comparison.

Variations and Extensions

Scalable Bloom filters: Automatically grow to maintain false positive probability as elements are added.

Counting Bloom filters: Allow deletions by using counters instead of bits.

Compressed Bloom filters: Compress the bit array when sending over networks.

Cuckoo filters: A newer alternative that supports deletions and has better locality.

When to Use Bloom Filters

Bloom filters are perfect when:

False positives are acceptable (but false negatives are not)
Memory is constrained
You need fast membership testing
You don’t need to retrieve the actual element

They might not be suitable when:

You need exact membership (use hash tables)
You need to delete elements (use counting Bloom filters or cuckoo filters)
You need to retrieve stored elements

A Simple Implementation

import hashlib

class BloomFilter:
    def __init__(self, size=1000, hashes=7):
        self.size = size
        self.hashes = hashes
        self.bit_array = [False] * size
    
    def _get_positions(self, item):
        result = []
        for i in range(self.hashes):
            hash_input = f"{item}{i}".encode()
            hash_val = int(hashlib.md5(hash_input).hexdigest(), 16)
            result.append(hash_val % self.size)
        return result
    
    def add(self, item):
        for pos in self._get_positions(item):
            self.bit_array[pos] = True
    
    def might_contain(self, item):
        return all(self.bit_array[pos] for pos in self._get_positions(item))

False Positive Probability Derivation

After inserting n elements into an m-bit array with k hash functions, the probability that a specific bit is still 0 is:

P(bit = 0) = (1 - 1/m)^(kn) ≈ e^(-kn/m)

Therefore, the probability of a false positive (all k bits set to 1 for a non-member) is:

p = (1 - (1 - 1/m)^(kn))^k ≈ (1 - e^(-kn/m))^k

This approximation is accurate for large m and n.

Optimal Number of Hash Functions

Minimizing p with respect to k gives:

k_optimal = (m/n) × ln(2)

Substituting back: p_min = (1/2)^k = (0.6185)^(m/n)

def optimal_hashes(m, n):
    """Calculate optimal number of hash functions."""
    return max(1, round((m / n) * 0.693))

def false_positive_probability(m, n, k):
    """Calculate false positive probability."""
    from math import exp
    return (1 - exp(-k * n / m)) ** k

def estimate_m(n, p):
    """Estimate bits needed for given n and target false positive rate p."""
    from math import log, log2, ceil
    return ceil(-n * log(p) / (log(2) ** 2))

Parameter Tuning Guide

Desired FPR	Bits/Element (m/n)	Optimal k	Filter Size for 1M Elements
10%	4.8	3	600 KB
5%	6.2	4	775 KB
1%	9.6	7	1.2 MB
0.1%	14.4	10	1.8 MB
0.01%	19.2	13	2.4 MB

Counting Bloom Filters

Standard Bloom filters do not support deletion. Counting Bloom filters replace each bit with a small counter (typically 4 bits).

class CountingBloomFilter:
    def __init__(self, size=1000, hashes=7, counter_bits=4):
        self.size = size
        self.hashes = hashes
        self.max_count = (1 << counter_bits) - 1
        self.counters = [0] * size

    def _get_positions(self, item):
        result = []
        for i in range(self.hashes):
            h = int(hashlib.md5(f"{item}{i}".encode()).hexdigest(), 16)
            result.append(h % self.size)
        return result

    def add(self, item):
        for pos in self._get_positions(item):
            if self.counters[pos] < self.max_count:
                self.counters[pos] += 1

    def remove(self, item):
        for pos in self._get_positions(item):
            if self.counters[pos] > 0:
                self.counters[pos] -= 1

    def might_contain(self, item):
        return all(self.counters[pos] > 0 for pos in self._get_positions(item))

Trade-off: Counter overflow is possible with 4-bit counters. The probability of overflow for a single counter after n insertions is bounded by (e * kn / (m * 8))^8.

Cuckoo Filters

Cuckoo filters are a modern alternative offering several advantages:

Deletion support: Native, not approximated
Better cache locality: Uses smaller fingerprints
Higher load factor: Up to 95% vs 50% for optimal Bloom filters
Lower space for low FPR: More space-efficient below 3% FPR

import hashlib, struct

class CuckooFilter:
    def __init__(self, capacity, fingerprint_bits=8, max_kicks=500):
        self.capacity = capacity
        self.fingerprint_bits = fingerprint_bits
        self.max_kicks = max_kicks
        self.table = [None] * capacity
        self.size = 0

    def _fingerprint(self, item):
        h = hashlib.sha256(str(item).encode()).digest()
        return struct.unpack('B', h[:1])[0] & ((1 << self.fingerprint_bits) - 1)

    def _hash(self, item):
        h = hashlib.sha256(str(item).encode()).digest()
        return struct.unpack('I', h[:4])[0] % self.capacity

    def _alt_index(self, index, fingerprint):
        return (index ^ hash(fingerprint)) % self.capacity

    def insert(self, item):
        if self.size >= self.capacity:
            return False
        fp = self._fingerprint(item)
        i1 = self._hash(item)
        i2 = self._alt_index(i1, fp)
        if self.table[i1] is None:
            self.table[i1] = fp
        elif self.table[i2] is None:
            self.table[i2] = fp
        else:
            i = i1
            for _ in range(self.max_kicks):
                self.table[i], fp = fp, self.table[i]
                i = self._alt_index(i, fp)
                if self.table[i] is None:
                    self.table[i] = fp
                    self.size += 1
                    return True
            return False
        self.size += 1
        return True

    def might_contain(self, item):
        fp = self._fingerprint(item)
        i1 = self._hash(item)
        i2 = self._alt_index(i1, fp)
        return self.table[i1] == fp or self.table[i2] == fp

Comparison: Bloom vs Cuckoo Filters

Feature	Bloom Filter	Cuckoo Filter
Deletion	Not supported (Counting variant exists)	Native support
Space efficiency	Better at high FPR (≥3%)	Better at low FPR (<3%)
Lookup time	O(k) hash computations	O(1) hash + O(1) probe
Max load factor	Limited by m (≥50%)	Up to 95%
False negatives	None	None
Insertion time	O(k)	O(1) average, may require relocation

Redis provides Bloom filters through the RedisBloom module (redis-stack):

# Create a Bloom filter
BF.RESERVE myfilter 0.01 1000000

# Add elements
BF.ADD myfilter user:1234
BF.ADD myfilter user:5678
BF.MADD myfilter user:9012 user:3456

# Check membership
BF.EXISTS myfilter user:1234  # Returns 1
BF.EXISTS myfilter user:9999  # Returns 0

# Check multiple
BF.MEXISTS myfilter user:1234 user:9999

# Get filter info
BF.INFO myfilter

Redis Bloom filters are ideal for caching layers, spam detection, and preventing cache penetration.

Real-World Use Cases

Web Crawler URL Deduplication

class CrawlerBloomFilter:
    def __init__(self):
        self.filter = BloomFilter(size=10_000_000, hashes=7)
        self.seen_count = 0

    def should_crawl(self, url):
        if self.filter.might_contain(url):
            return False
        self.filter.add(url)
        self.seen_count += 1
        return True

    def statistic(self):
        return self.seen_count

Cache Penetration Prevention

class CacheWithBloom:
    def __init__(self, backend_cache, db):
        self.bloom = BloomFilter(size=1_000_000, hashes=7)
        self.cache = backend_cache
        self.db = db

    def get(self, key):
        if not self.bloom.might_contain(key):
            return None
        result = self.cache.get(key)
        if result is not None:
            return result
        result = self.db.query(key)
        if result is not None:
            self.cache.set(key, result)
        return result

    def warm_up(self, keys):
        for key in keys:
            self.bloom.add(key)

CDN Edge Caching

CDNs use Bloom filters to track which objects are cached at each edge node. Before fetching from origin, the edge checks its Bloom filter. False positives cause a cache miss (acceptable), while true negatives save a round trip.

Email Spam Detection

Email providers maintain Bloom filters of known spam sender domains. When an email arrives, the sender domain is checked against the filter. A negative result guarantees the domain is not on the list. A positive result triggers further checks.

BigTable and HBase

Google BigTable and Apache HBase use Bloom filters to reduce disk lookups. Each SSTable (Sorted String Table) has an associated Bloom filter. Before performing an expensive disk seek, the system checks the filter. If the filter says the key is absent, the seek is skipped entirely.

class SSTableWithBloom:
    def __init__(self, sstable_path):
        self.sstable = sstable_path
        self.bloom = BloomFilter(size=1_000_000, hashes=7)

    def build_from_data(self, records):
        for key, value in records:
            self.bloom.add(key)
            self._write_record(key, value)

    def might_contain_key(self, key):
        return self.bloom.might_contain(key)

    def get(self, key):
        if not self.might_contain_key(key):
            return None  # Skip disk seek
        return self._disk_lookup(key)  # May still miss

Blockchain SPV Nodes

Bitcoin Simplified Payment Verification (SPV) clients use Bloom filters to request relevant transactions from full nodes without revealing their addresses:

class SPVClient:
    def __init__(self):
        self.bloom = BloomFilter(size=1024, hashes=5)

    def add_address(self, address_bytes):
        self.bloom.add(address_bytes)

    def get_filter(self):
        return self.bloom  # Send to full node

    def process_merkle_block(self, block):
        for tx in block.transactions:
            for address in self.my_addresses:
                if self.bloom.might_contain(address.encode()):
                    self._download_full_tx(tx)

Blocked Bloom Filters

Modern CPUs benefit from data locality. Blocked Bloom filters divide the bit array into cache-line-sized blocks (typically 512 bits):

import math

class BlockedBloomFilter:
    def __init__(self, entries, fpr=0.01):
        bits_per_entry = -math.log(fpr) / (math.log(2) ** 2)
        total_bits = int(entries * bits_per_entry)
        self.block_size = 512  # bits per block
        self.num_blocks = math.ceil(total_bits / self.block_size)
        self.blocks = [0] * self.num_blocks
        self.num_hashes = max(1, round(self.block_size / entries * math.log(2)))

    def _hash_to_block(self, item):
        h = int(hashlib.sha256(str(item).encode()).hexdigest(), 16)
        return h % self.num_blocks

    def _hash_within_block(self, item, k):
        h = int(hashlib.md5(f"{item}:{k}".encode()).hexdigest(), 16)
        return h % self.block_size

    def add(self, item):
        block_idx = self._hash_to_block(item)
        for k in range(self.num_hashes):
            bit = self._hash_within_block(item, k)
            self.blocks[block_idx] |= (1 << bit)

    def might_contain(self, item):
        block_idx = self._hash_to_block(item)
        for k in range(self.num_hashes):
            bit = self._hash_within_block(item, k)
            if not (self.blocks[block_idx] & (1 << bit)):
                return False
        return True

Blocked Bloom filters are 2-3x faster than standard Bloom filters because each membership check touches only one cache line.

Scalable Bloom Filters

Standard Bloom filters require knowing n (expected number of elements) upfront. Scalable Bloom filters solve this by creating a series of filters with geometrically decreasing false positive rates:

class ScalableBloomFilter:
    def __init__(self, initial_capacity=1000, fpr=0.01, scaling_factor=2):
        self.filters = []
        self.fpr = fpr
        self.scaling_factor = scaling_factor
        self._add_filter(initial_capacity, fpr)

    def _add_filter(self, capacity, fpr):
        bits = int(-capacity * math.log(fpr) / (math.log(2) ** 2))
        hashes = max(1, round(bits / capacity * math.log(2)))
        self.filters.append({
            'bits': bits,
            'hashes': hashes,
            'bit_array': [False] * bits,
            'count': 0,
            'capacity': capacity
        })

    def add(self, item):
        current = self.filters[-1]
        if current['count'] >= current['capacity']:
            new_cap = current['capacity'] * self.scaling_factor
            new_fpr = self.fpr * (1 - 1/self.scaling_factor)
            self._add_filter(new_cap, new_fpr)
            current = self.filters[-1]
        for k in range(current['hashes']):
            h = int(hashlib.md5(f"{item}:{k}".encode()).hexdigest(), 16)
            current['bit_array'][h % current['bits']] = True
        current['count'] += 1

    def might_contain(self, item):
        for filt in reversed(self.filters):
            for k in range(filt['hashes']):
                h = int(hashlib.md5(f"{item}:{k}".encode()).hexdigest(), 16)
                if not filt['bit_array'][h % filt['bits']]:
                    break
            else:
                return True
        return False

Bloom Filter Variants Comparison

Variant	Deletion	Space	Lookup Speed	Use Case
Standard	No	1.44 bytes/elem (1% FPR)	O(k) hashes	General
Counting	Yes (limited)	4x standard	O(k) hashes	Caches, streaming
Blocked	No	Same as standard	Cache-line fast	High-throughput
Scalable	No	Slightly more	O(k * m)	Unknown n
Compressed	No	50-80% less on wire	Same	Network transmission
Cuckoo	Yes	1.1 bytes/elem	O(1) probes	Low FPR, deletion

External Resources

Conclusion

Bloom filters represent a fascinating trade-off in computer science: accepting some uncertainty in exchange for dramatic space savings. They embody the idea that sometimes “good enough” is sufficient.

The next time you need membership testing and can tolerate occasional false positives, consider a Bloom filter. Their simplicity, efficiency, and wide adoption make them an essential tool in any programmer’s toolkit.

Bloom Filters: Probabilistic Data Structures

Introduction

The Membership Testing Problem

How Bloom Filters Work

Understanding False Positives

Implementation Considerations

Practical Applications

Variations and Extensions

When to Use Bloom Filters

A Simple Implementation

False Positive Probability Derivation

Optimal Number of Hash Functions

Parameter Tuning Guide

Counting Bloom Filters

Cuckoo Filters

Redis Bloom Filter Module

Real-World Use Cases

Web Crawler URL Deduplication

Cache Penetration Prevention

CDN Edge Caching

Email Spam Detection

BigTable and HBase

Blockchain SPV Nodes

Blocked Bloom Filters

Scalable Bloom Filters

Bloom Filter Variants Comparison

External Resources

Conclusion

Comments

Share this article

👍 Was this article helpful?