Shared Clustering versus other clustering tools - jonathanbrecher/sharedclustering GitHub Wiki

Quality of clustering

As of this writing, the clusters generated by Shared Clustering are significantly better than those generated by most other tools. In this context, "better" means that the clusters are more useful to the genealogical researcher.

Larger clusters

The clusters generated by Shared Clustering are larger than the clusters generated by most other tools, without sacrificing quality. That is, the clusters are not arbitrarily larger. In each cluster, all members are indeed related genetically. This can be a HUGE help when doing basic research where "more is better" -- more public trees to look at, more people to contact who might have further information of their own, and so on.

More clusters

Shared Clustering is also better at finding sparse clusters than other tools are. The members of a sparse cluster might not match every other member of the cluster. But, they still match each other than they match others outside the cluster, and they are useful research tools even so.

More matches clustered at all

With larger clusters and more clusters, Shared Clustering is able to assign more matches to some cluster. If you have 2,000 matches that share 20 cM or more with the test taker, you can expect 1,800 or more of them are assigned to clusters.

The main limitation is unavoidable. Since Shared Clustering -- and all clustering techniques -- is based on the shared matches reported for each match, a match without any shared matches at all cannot be clustered. By default, Shared Clustering excludes matches that have fewer than two shared matches. Beyond that, nearly all of the other matches will be clustered.

Clustering of under-20 cM matches

Ancestry does not normally include matches under 20 cM when displaying shared match lists on their web site. They label the under-20 cM matches as "distant", and they are often correct about that. Some of the under-20 cM matches are less distant than others, though.

Shared Clustering can add the under-20 cM matches to the clusters formed from the stronger matches. To do that, it needs a complete download of all of your matches. A complete download can take a very long time, often as long as 12-24 hours. The time is worthwhile, though, as adding the under-20 cM matches will typically triple the size of the clusters. This is another HUGE benefit when doing research!

Association between clusters

In many cases, Shared Clustering can highlight how clusters are associated with each other. That gives a tremendous head start when doing research, knowing that common ancestors for several clusters are likely from all from the same general area of your family tree.

Cost

Shared Clustering is free to use, without restrictions.

Data access

With Shared Clustering, you control access to your own data. You do need to enter your Ancestry DNA username and password, but that information is only transmitted between your own computer and the Ancestry servers. Your Ancestry DNA results are downloaded from the Ancestry servers so that Shared Clustering can analyze that data on your computer. Nobody else has access to your password or your data.

Open Source

Shared Clustering is provided as Open Source, with full source code. That means that anyone who cares to review the code can see exactly how the algorithm is behaving. If someone thinks that they have ideas for improvement, they can propose changes or make changes to the code themselves.

The Open Source license for Shared Clustering means that commercial companies are allowed to take this code and incorporate it into their own software that they charge for. Each person can make their own decision whether to pay for the other software or to use Shared Clustering for free.