Why cluster - jonathanbrecher/sharedclustering GitHub Wiki

Clustering is a tool, one of many ways that a researcher can hone in on information that they are trying to find. Clustering does not answer genealogical questions on its own. It does, however, provide some extremely focused guidance for further research.

Many people have match lists at Ancestry containing thousands, tens of thousands, or even hundreds of thousands of matches. A dedicated researcher could review thousands of matches individually, given enough time. No human can review hundreds of thousands of matches by hand. A cluster of a few dozen matches is something that anyone can look at easily.

Not just pretty pictures

The point of a cluster diagram is not simply to create a pretty picture. Clusters should provide accurate information that a genetic genealogist can act on.

Clusters are most important because a researcher cannot predict where the next lead will come from. The last member of a cluster might be the one person who inherited a family bible from the 1700s. Several members of a cluster might have public trees that share ancestors with the same unusual family name. Or they might share ancestors with family names that vary by spelling, where the similarities are obvious only when you look at them next to each other. Proving the links can be harder... but there's nothing to prove until you find the links in the first place.

Data, data, data

The clustering algorithms used by Shared Clustering love data. The more data, the better the clusters. Unlike with some other tools, you do NOT want to limit Shared Clustering by limiting the cluster generation to some narrow range of shared centimorgans. The more data you can provide to Shared Clustering, the better the clusters you'll get in return.