Introduction
Consider these problems: Are two people in the same social network connected? Are two pixels in the same region of an image? Is adding this road to the highway network going to create a cycle? These seemingly different questions share a common theme—they’re all about connectivity.
The Union-Find data structure, also known as Disjoint Set Union (DSU), provides an elegant and efficient solution for such connectivity problems. Despite its simplicity, it’s one of the most useful data structures in computer science, forming the backbone of algorithms like Kruskal’s minimum spanning tree and network connectivity analysis.
This article explores Union-Find, explaining how it works and why it’s so efficient.
The Basic Problem
We start with n elements, each in its own set. We need to support two operations:
Find: Given an element, identify which set it belongs to (usually represented by a “representative” element)
Union: Merge two sets into one
That’s it. Just these two operations, but they enable solving remarkably complex problems.
Naive Approaches and Their Problems
You could represent sets with linked lists or arrays, but performance suffers:
- Finding the set representative could take O(n)
- Merging sets might require updating all elements
- Repeated operations become prohibitively expensive
For large datasets, this is unacceptable. We need something smarter.
Union-Find: The Smart Approach
Union-Find uses two key optimizations that make operations nearly constant time (amortized):
Path compression: When we find an element’s representative, we make a direct path to the root for all visited elements. This flattens the structure over time.
Union by rank/size: When merging sets, we attach the smaller tree under the larger one. This keeps the tree balanced.
With these optimizations, operations take almost O(1) amortized time—specifically α(n), the inverse Ackermann function, which is effectively constant for all practical values of n.
Implementation Details
Data structure: We maintain an array parent[i] where i is each element. The parent of the representative element points to itself.
Find with path compression:
def find(x):
if parent[x] != x:
parent[x] = find(parent[x]) # Path compression
return parent[x]
Union by size:
def union(x, y):
root_x = find(x)
root_y = find(y)
if root_x != root_y:
if size[root_x] < size[root_y]:
root_x, root_y = root_y, root_x
parent[root_y] = root_x
size[root_x] += size[root_y]
Understanding Path Compression
Path compression is elegant. When you call find(x), you recursively find the root. While unwinding the recursion, you update each node’s parent to point directly to the root.
This means the first find might be O(log n), but subsequent finds become faster. Over time, the tree becomes nearly flat, and operations become constant time.
The mathematical proof that this gives amortized O(α(n)) complexity is one of the most elegant results in data structure analysis.
Real-World Applications
Union-Find appears in many practical applications:
Kruskal’s algorithm: For finding minimum spanning trees, we sort edges by weight and add them if they connect different components (using Union-Find to check).
Network connectivity: Analyzing whether adding a connection would create cycles, or determining connected components in a network.
Image processing: Finding connected components in binary images for segmentation.
Social networks: Determining friend groups and detecting communities.
Dependency resolution: Package managers use Union-Find to resolve dependencies and detect circular dependencies.
Solving Problems with Union-Find
Here’s how you’d determine if two elements are connected:
class UnionFind:
def __init__(self, n):
self.parent = list(range(n))
self.size = [1] * n
def find(self, x):
if self.parent[x] != x:
self.parent[x] = self.find(self.parent[x])
return self.parent[x]
def union(self, x, y):
px, py = self.find(x), self.find(y)
if px == py:
return False # Already connected
if self.size[px] < self.size[py]:
px, py = py, px
self.parent[py] = px
self.size[px] += self.size[py]
return True
def connected(self, x, y):
return self.find(x) == self.find(y)
Advanced Variations
Weighted Union-Find: Track additional information like component size or weight to optimize decisions.
Union-Find with rollback: Support undoing operations, useful in certain algorithms like offline queries.
Union-Find as graph: Sometimes viewing Union-Find as building a dynamic graph helps conceptualize problems.
Why It’s So Efficient
The key insight is that we don’t need to maintain the actual set, just connectivity. We sacrifice some information (we can’t list all elements in a set efficiently) but gain massive speed.
The inverse Ackermann function α(n) grows so slowly that for n = 10^6, α(n) ≈ 4, and for n = 10^9, α(n) ≈ 5. This is effectively constant time.
Union by Rank vs Union by Size
Union by Rank
Union by rank uses the tree height (rank) to decide which root becomes the parent:
def union_by_rank(x, y):
rx, ry = find(x), find(y)
if rx == ry:
return False
if rank[rx] < rank[ry]:
rx, ry = ry, rx
parent[ry] = rx
if rank[rx] == rank[ry]:
rank[rx] += 1
return True
Union by Size
Union by size uses the number of elements:
def union_by_size(x, y):
rx, ry = find(x), find(y)
if rx == ry:
return False
if size[rx] < size[ry]:
rx, ry = ry, rx
parent[ry] = rx
size[rx] += size[ry]
return True
Comparison:
| Criterion | Union by Rank | Union by Size |
|---|---|---|
| Information stored | Tree height | Component size |
| Memory | Same | Same |
| Time complexity | α(n) | α(n) |
| Additional utility | None | Can query component size |
Path compression tends to flatten the tree, making rank an overestimate of actual height. This does not affect correctness.
Inverse Ackermann Complexity
The inverse Ackermann function α(n) is one of the slowest-growing functions in mathematics:
A(0, n) = n + 1
A(m, 0) = A(m-1, 1)
A(m, n) = A(m-1, A(m, n-1))
A(0, 1) = 2
A(1, 1) = 3
A(2, 1) = 7
A(3, 1) = 2^65536 - 3 (already astronomical)
A(4, 1) = 2^2^... (incomprehensibly large)
α(n) = the smallest m such that A(m, m) ≥ n
| n | α(n) |
|---|---|
| 10^2 | 3 |
| 10^10 | 4 |
| 2^65536 | 5 |
For any practical problem, α(n) ≤ 5. This is why Union-Find operations are called “almost O(1).”
Path Compression: Iterative vs Recursive
Recursive (shown above)
Uses call stack to flatten. Simple but risks stack overflow for deep trees.
Iterative Path Compression
def find_iterative(x):
"""Find root and compress path iteratively."""
root = x
while parent[root] != root:
root = parent[root]
while x != root:
next_node = parent[x]
parent[x] = root
x = next_node
return root
Two-Pass Path Compression
The iterative approach makes two passes: one to find the root, one to compress. Both the recursive and iterative approaches achieve the same α(n) amortized complexity.
Kruskal’s Minimum Spanning Tree
def kruskal(vertices, edges):
"""
vertices: number of vertices
edges: list of (weight, u, v)
returns: list of edges in MST
"""
uf = UnionFind(vertices)
mst = []
edges.sort() # Sort by weight
for weight, u, v in edges:
if uf.union(u, v): # Different components
mst.append((u, v, weight))
return mst
# Example usage
edges = [
(10, 0, 1), (6, 0, 2), (5, 0, 3),
(15, 1, 3), (4, 2, 3)
]
mst = kruskal(4, edges)
print(f"MST edges: {mst}") # [(2, 3, 4), (0, 2, 5), (0, 1, 10)]
Kruskal’s algorithm runs in O(E log E) due to sorting, and the Union-Find operations add O(E α(V)) overhead.
Connected Components
def connected_components(graph, n):
"""Find all connected components in an undirected graph."""
uf = UnionFind(n)
for u in range(n):
for v in graph[u]:
uf.union(u, v)
components = {}
for v in range(n):
root = uf.find(v)
if root not in components:
components[root] = []
components[root].append(v)
return list(components.values())
# Example: social network friends
graph = {
0: [1, 2],
1: [0],
2: [0],
3: [4],
4: [3],
5: []
}
comps = connected_components(graph, 6)
print(f"Components: {comps}")
# [[0, 1, 2], [3, 4], [5]]
Cycle Detection
Union-Find detects cycles in undirected graphs efficiently:
def has_cycle(edges, n):
"""Check if adding edges creates a cycle."""
uf = UnionFind(n)
for u, v in edges:
if not uf.union(u, v): # Already connected = cycle
return True
return False
edges = [(0, 1), (1, 2), (2, 0)]
print(has_cycle(edges, 3)) # True
This is used in:
- Network design: Detecting loops when adding connections
- Circuit analysis: Finding feedback loops
- Deadlock detection: Resource allocation graphs
C++ Implementation
class UnionFind {
vector<int> parent, sz;
public:
UnionFind(int n) : parent(n), sz(n, 1) {
for (int i = 0; i < n; i++)
parent[i] = i;
}
int find(int x) {
while (parent[x] != x) {
parent[x] = parent[parent[x]]; // Path halving
x = parent[x];
}
return x;
}
bool unite(int a, int b) {
int ra = find(a), rb = find(b);
if (ra == rb) return false;
if (sz[ra] < sz[rb]) swap(ra, rb);
parent[rb] = ra;
sz[ra] += sz[rb];
return true;
}
bool connected(int a, int b) {
return find(a) == find(b);
}
int component_size(int x) {
return sz[find(x)];
}
};
Path Halving
Notice the C++ find uses path halving (parent[x] = parent[parent[x]]) instead of full compression. This is a middle ground that compresses less aggressively but reduces CPU overhead per find. It still achieves α(n) amortized complexity.
Image Processing: Connected Component Labeling
Union-Find is used in binary image segmentation to label connected regions:
def connected_component_labeling(image):
"""
image: 2D binary numpy array (0 = background, 1 = foreground)
returns: labeled image where each component has unique integer label
"""
rows, cols = len(image), len(image[0])
uf = UnionFind(rows * cols)
def idx(r, c):
return r * cols + c
# First pass: union adjacent foreground pixels
for r in range(rows):
for c in range(cols):
if image[r][c] == 1:
if r > 0 and image[r-1][c] == 1:
uf.union(idx(r, c), idx(r-1, c))
if c > 0 and image[r][c-1] == 1:
uf.union(idx(r, c), idx(r, c-1))
# Label each pixel with its component root
labels = {}
labeled = [[0] * cols for _ in range(rows)]
next_label = 1
for r in range(rows):
for c in range(cols):
if image[r][c] == 1:
root = uf.find(idx(r, c))
if root not in labels:
labels[root] = next_label
next_label += 1
labeled[r][c] = labels[root]
return labeled
Applications:
- Medical imaging: Identifying tumors, organs, or cells
- Object detection: Separating distinct objects in a scene
- Document analysis: Segmenting text regions from images
- Robotics: Identifying obstacles in sensor data
Java Implementation
public class UnionFind {
private int[] parent;
private int[] rank;
private int count; // Number of components
public UnionFind(int n) {
parent = new int[n];
rank = new int[n];
count = n;
for (int i = 0; i < n; i++)
parent[i] = i;
}
public int find(int x) {
if (parent[x] != x)
parent[x] = find(parent[x]);
return parent[x];
}
public boolean union(int x, int y) {
int rx = find(x), ry = find(y);
if (rx == ry) return false;
if (rank[rx] < rank[ry]) {
parent[rx] = ry;
} else if (rank[rx] > rank[ry]) {
parent[ry] = rx;
} else {
parent[ry] = rx;
rank[rx]++;
}
count--;
return true;
}
public int getCount() { return count; }
public boolean connected(int x, int y) {
return find(x) == find(y);
}
}
LeetCode-Style Problems
Number of Islands
def num_islands(grid):
if not grid:
return 0
rows, cols = len(grid), len(grid[0])
uf = UnionFind(rows * cols)
ones = 0
for r in range(rows):
for c in range(cols):
if grid[r][c] == '1':
ones += 1
idx = r * cols + c
if r > 0 and grid[r-1][c] == '1':
uf.union(idx, (r-1) * cols + c)
if c > 0 and grid[r][c-1] == '1':
uf.union(idx, r * cols + (c-1))
# Count distinct roots among '1' cells
roots = set()
for r in range(rows):
for c in range(cols):
if grid[r][c] == '1':
roots.add(uf.find(r * cols + c))
return len(roots)
Accounts Merge
def accounts_merge(accounts):
"""Merge Gmail accounts belonging to same person."""
email_to_name = {}
email_to_id = {}
uf = UnionFind(10000)
id_counter = 0
for name, *emails in accounts:
for email in emails:
email_to_name[email] = name
if email not in email_to_id:
email_to_id[email] = id_counter
id_counter += 1
uf.union(email_to_id[emails[0]], email_to_id[email])
result = defaultdict(list)
for email in email_to_name:
root = uf.find(email_to_id[email])
result[root].append(email)
return [[email_to_name[emails[0]]] + sorted(emails)
for emails in result.values()]
External Resources
Conclusion
Union-Find demonstrates that simple ideas, properly optimized, can be extraordinarily powerful. With just two operations—find and union—we can solve connectivity problems that would otherwise be intractable.
The next time you encounter a problem about grouping, connectivity, or cycles, consider Union-Find. It’s likely the right tool, and its elegant implementation makes it a pleasure to use.
Comments