Hoshen–Kopelman algorithm

Introduction

The Hoshen-Kopelman Algorithm is simple and efficient algorithm for labelling clusters on a grid, where grid is a regular network of cells, where each cell may be “occupied” or “unoccupied”. This algorithm is based on well-known union-finding algorithm. The algorithm was originally described in “Percolation and cluster distribution. I. Cluster multiple labeling technique and critical concentration algorithm”[1] by J. Hoshen and R. Kopelman.

Percolation Theory

Percolation theory is the study of the behavior and statistics of clusters on lattices. Suppose we have a large square lattice where each cell can be occupied with the probability p and empty with probability 1 – p.

Each group of neighboring occupied cells forms a cluster. Neighbors are defined as cells having a common side but not those sharing only a corner i.e. we consider 4x4 neighborhood. (top, bottom, left, right). Each occupied cell is occupied independently of the status of its neighborhood. The number of clusters, size of each cluster and their distribution are important topics in percolation theory.

Union-Find Algorithm

The Union-Find Algorithm is a simple method for computing equivalence classes. Calling the function union(x,y) specifies that items x and y are members of the same equivalence class. Because equivalence relations are transitive; all items equivalent to x are equivalent to all items equivalent to y. Thus for any item x, there is a set of items which are all equivalent to x are; this set is the equivalence class of which x is a member. A second function find(x) returns a representative member of the equivalence class to which x belongs.

Hoshen - Kopelman Algorithm for cluster finding

In this algorithm we scan through a grid looking for occupied cells and labelling them with cluster labels. The algorithm begins with scanning the grid cell by cell and check if the cell is occupied, if the cell is occupied then this cell must be labelled with a cluster label. This cluster label is decided based on the neighbors of the cell which have been previously scanned and labelled, and if the cell doesn’t have any occupied neighbors then new label is assigned to the cell.

Pseudo-Code

So in the raster scan of the grid in question, each time an occupied cell is encountered, a check is done to see whether this cell has any neighboring cells who have already been scanned. If so, first a union operation is performed, to specify that these neighboring cells are in fact members of the same equivalence class. Then a find operation is performed to find a representative member of that equivalence class with which to label the current cell. If, on the other hand, the current cell has no neighbors, it is assigned a new, previously unused, label. The entire grid is processed in this way. The grid can be raster-scanned a second time, performing only find operations at each cell, to re-label the cells with their final assignment of a representative element.

largest_label = 0;
for x in 0 to n_columns
{
for y in 0 to n_rows
{
if occupied[x,y] then
left = occupied[x-1,y];
above = occupied[x,y-1];
if (left == 0) and (above == 0) then /* Neither a label above nor to the left. */
largest_label = largest_label + 1; /* Make a new, as-yet-unused cluster label. */
label[x,y] = largest_label;
else if (left != 0) and (above == 0) then /* One neighbor, to the left. */
label[x,y] = find(left);
else if (left == 0) and (above != 0) then /* One neighbor, above. */
label[x,y] = find(above);
else /* Neighbors BOTH to the left and above. */
union(left,above); /* Link the left and above clusters. */
label[x,y] = find(left);
}
}
void union(int x, int y)
{
labels[find(x)] = find(y);
}
int find(int x)
{
int y = x;
while (labels[y] != y)
y = labels[y];
while (labels[x] != x)
{
int z = labels[x];
labels[x] = y;
x = z;
}
return y;
}

Example

Consider the following 6x6 grid.

More cluster finding algorithms

BFR&(Bradley;Fayyad;Reina) algorithm
k-means clustering algorithm
Fuzzy c-means clustering algorithm
Hierarchical clustering algorithm
Gaussian(EM) clustering algorithm
Quality threshold clustering algorithm

[1] - http://journals.aps.org/prb/abstract/10.1103/PhysRevB.14.3438