How does path compression make Union-Find nearly O(1)?

Without optimization, find(x) follows parent pointers from x to the root -- O(N) in the worst case (a degenerate chain). Path compression flattens the tree: during find(x), make every node on the path point directly to the root. Implementation: find(x) = if parent[x] != x then parent[x] = find(parent[x]); return parent[x]. After one find traversal, all nodes on the path become direct children of the root. Subsequent finds for these nodes are O(1). Combined with union by rank (always attach the shorter tree under the taller), the amortized cost per operation is O(alpha(N)), where alpha is the inverse Ackermann function. alpha(N) 2->3->...->N. Find operations traverse the full chain: O(N). Union by rank maintains a rank (approximate height) for each tree. When unioning: attach the shorter tree under the taller one. If ranks are equal, pick either and increment the new root rank. This ensures the tree height never exceeds O(log N), because doubling a tree height requires doubling the number of elements. With path compression + union by rank: operations are O(alpha(N)) amortized -- effectively constant. Alternative: union by size tracks element count per set. Attach smaller to larger. Equally efficient and has the advantage of maintaining accurate set sizes, which some problems need (e.g., find the largest connected component). Implementation: rank = [0]*N. In union: if rank[root_x] rank[root_y]: parent[root_y] = root_x. else: parent[root_y] = root_x; rank[root_x] += 1.

Coding Interview: Union-Find (Disjoint Set) Deep Dive — Path Compression, Union by Rank, Connected Components, Kruskal

⏱ 6 min read

Union-Find (Disjoint Set Union) is a data structure that tracks elements partitioned into non-overlapping sets. It supports two operations in near-constant time: find (which set does this element belong to?) and union (merge two sets). Union-Find appears in coding interviews for problems involving connectivity, grouping, and cycle detection. This guide covers the implementation, optimizations, and the most common interview applications.

Basic Implementation

Data structure: an array parent[] where parent[i] is the parent of element i. Initially, parent[i] = i (each element is its own set with itself as the root). Find(x): follow parent pointers until reaching the root (a node where parent[x] == x). The root is the representative of the set. Union(x, y): find the roots of x and y. If they are different, make one root point to the other: parent[root_x] = root_y. Now x and y are in the same set. Without optimizations: find is O(N) worst case (the tree degenerates into a chain). With path compression and union by rank: find is O(alpha(N)) amortized, where alpha is the inverse Ackermann function — effectively constant (alpha(N) <= 4 for any practical N). The two optimizations are essential for interview implementations.

Path Compression

Path compression flattens the tree during find operations. When finding the root of element x, make every node on the path point directly to the root. Implementation: find(x) = if parent[x] != x then parent[x] = find(parent[x]). Return parent[x]. This recursive call traverses to the root and on the way back sets every node parent to the root. After one find, all nodes on the path are direct children of the root. Subsequent finds for these nodes are O(1). Example: chain 1 -> 2 -> 3 -> 4 (root). find(1) traverses 1->2->3->4. With path compression, after find(1): parent = [4, 4, 4, 4]. All nodes now point directly to 4. Iterative path compression: two-pass approach. First pass: follow pointers to find the root. Second pass: walk the path again, setting each node parent to the root. Both recursive and iterative achieve the same result. Recursive is cleaner; iterative avoids stack overflow for very long chains.

Union by Rank

Union by rank prevents the tree from becoming unbalanced. Maintain a rank[] array (approximate tree height). Initially, rank[i] = 0. When unioning two sets: if rank[root_x] rank[root_y], make root_y point to root_x. If ranks are equal, choose either direction and increment the new root rank by 1. Without union by rank, a sequence of unions can create a chain of length N. With union by rank, the maximum tree height is O(log N). Combined with path compression, operations are O(alpha(N)) amortized. Alternative: union by size — instead of rank, track the number of elements in each set. Attach the smaller set to the larger. The result is equally efficient. Union by size is often simpler to implement and has the advantage of maintaining accurate set sizes (useful for some problems).

Applications: Connected Components

Number of connected components: given N nodes and a list of edges, find the number of connected components. Initialize Union-Find with N elements (N components). For each edge (u, v): union(u, v). If they were in different sets, the component count decreases by 1. Final count = N – number of successful unions. Time: O(N + E * alpha(N)). Number of islands (2D grid): treat each land cell as a node. For each land cell, union it with adjacent land cells. The number of distinct roots among land cells is the number of islands. This is an alternative to BFS/DFS flood fill. Graph valid tree: a graph is a valid tree if it has exactly N-1 edges and is connected (one component). Process edges with Union-Find. If union(u, v) finds they are already in the same set, there is a cycle — not a tree. If all edges processed without a cycle and component count is 1, it is a valid tree. Earliest moment when everyone becomes friends: given timestamps and friendship edges, process edges in timestamp order. When the component count reaches 1, everyone is connected. Return that timestamp.

Applications: Kruskal MST and Accounts Merge

Kruskal Minimum Spanning Tree: sort all edges by weight. Process edges from lightest to heaviest. For each edge (u, v, w): if find(u) != find(v), include this edge in the MST and union(u, v). If they are already connected, skip (including this edge would create a cycle). Stop when N-1 edges are included. Time: O(E log E) for sorting + O(E * alpha(N)) for Union-Find = O(E log E) overall. Accounts Merge: given a list of accounts where each account has a name and a list of emails, merge accounts that share any email (same person). Create a Union-Find over all unique emails. For each account, union all emails in that account. After processing, group emails by their root. Map each group back to the account name. This handles transitive relationships: if account 1 has emails {A, B} and account 2 has emails {B, C}, all three emails belong to the same person. Redundant Connection: given a graph that is a tree plus one extra edge, find the extra edge. Process edges with Union-Find. The first edge where find(u) == find(v) (already connected) is the redundant edge. These problems share the pattern: process relationships incrementally and use Union-Find to track connected components efficiently.

Union-Find vs BFS/DFS

Both Union-Find and BFS/DFS solve connectivity problems. When to use each: Union-Find is better when: (1) Edges arrive incrementally (online connectivity). You process edges one at a time and need connectivity answers after each. Rebuilding a graph for BFS/DFS after each edge is expensive. (2) You only need connectivity information (same component?), not paths. (3) The problem involves merging groups (accounts merge, friend circles). BFS/DFS is better when: (1) You need the actual path between nodes (Union-Find does not track paths). (2) You need shortest path (BFS). (3) You need traversal order (topological sort, level-order). (4) The graph is already built (no incremental edge processing). Both work for: number of connected components, cycle detection in undirected graphs, number of islands. For these, choose based on familiarity — both produce correct results with similar performance. Interview tip: if the problem says “are these two nodes connected?” after each edge addition, think Union-Find. If it says “find the shortest path” or “traverse all nodes,” think BFS/DFS.