Clustering algorithm - jonathanbrecher/sharedclustering GitHub Wiki

At a high level, Shared Clustering -- and all shared match clustering algorithms -- work by looking for similarities among the shared match lists of DNA matches.

Suppose that you have three matches, Alice, Bobby, and Charlie, and their shared match lists look like this:

  • Alice

    • Nancy
    • Oscar
    • Patty
  • Bobby

    • Nancy
    • Oscar
    • Patty
  • Charlie

    • Quentin
    • Rachel
    • Scott

In this simple example, the shared match lists for Alice and Bobby are similar since both of them contain the same three people: Nancy, Oscar, and Patty.

The shared match list for Charlie is different. Compared to the lists for Alice and Bobby, Charlie's shared match list lacks Nancy, Oscar, and Patty, but does contain Quentin, Rachel, and Scott.

A clustering algorithm would put Alice and Bobby in one cluster because they are similar, and would put Charlie in a different cluster, possibly with other people whose match lists also included Quentin, Rachel, and Scott.

The data being clustered

The Shared Clustering application builds clusters based on three types of information:

  • The ID of each match
  • The shared centimorgans of each match
  • For each match, the list of IDs of each of its shared matches.

...and that's it. The clustering process uses absolutely no information that is private -- or even identifiable -- in any way.

Within the clustering process, each match is identified by a zero-based unique integer index that is independent of any identifiers provided by the testing company itself. Each match also contains a list of its shared matches, identified by the corresponding integer indexes of those shared matches.

The centimorgan values are used in two ways. First, the matches are ordered by descending centimorgan values. This ordering provides an easy way to limit the clustering to the highest n matches, by selecting that many matches from the front of the list. It also provides some stability to the algorithm across runs, which simplifies debugging.

In some cases, the centimorgan values provide information about the expected clustering behavior. Matches with higher centimorgan values represent closer relatives. Those matches can be expected to have a greater number of shared matches than others, which can be important in the presence of endogamy where the endogamy can also lead to very large numbers of shared matches. Ancestry also treat matches under 20 cM specially, excluding them from shared match lists. That special behavior by Ancestry can be considered when building clusters.

The output of the Shared Clustering application includes several other fields, including the match name, common ancestors, notes, and much more. None of that information contributes to the clustering process. The extra information is added to the output file at the very end, simply copying it from the original input file to the corresponding row in the final cluster diagram.

Correlation matrix

The first step in a clustering algorithm is converting the textual shared match lists into something that a computer can deal with more easily.

Shared Clustering creates its correlation matrix by taking each shared match in order, starting with the strongest match, and assigning each match a sequential index in the range 0 . . . n - 1, where n is the number of matches.

Each match is then assigned an n-dimensional array representing some information about each of the other matches.

One simple approach would be to populate the x-th position array with a value of 1 if the x-th match is included in the shared match list, and 0 otherwise. Shared Clustering does not use this approach, but this alternate algorithm is preserved in the code as reference if anyone wants to review the results.

Shared Clustering works by looking at the frequency of how often each pair of matches appear together in the same match list. The vector is populated with that frequency, on scale from 0 to 1.

Consider two matches, Amy and Bill. If Amy and Bill never ever appear in the same shared match list, then they are correlated with a frequency of 0. Zero values are shown as blanks in the cluster output.

If every shared match list that includes Amy also includes Bill, then they are correlated with a frequency of 1. If half of the shared match lists that include Amy also include Bill, then they are correlated with a frequency of 0.5, and so on. This approach creates an asymmetric matrix, which is somewhat counter-intuitive but helps produce a final cluster diagram that is accurate, useful, and aesthetically pleasing.

There are many other ways to build a correlation matrix. Several alternate approaches were tried and rejected because they produced worse results. Others might be better. It is possible to build the correlation matrix in different ways without invalidating the overall approach to clustering.

Similarity

As there are many ways to build a correlation matrix, so are there many ways to compare n-dimensional arrays. These are most often considered as distance measures, where two vectors have the greatest similarity when the distance between them is smallest. Common distance measures include the Euclidean distance, the Manhattan distance, and many others.

Shared Clustering uses a modified version of Euclidean distance, considering only those values in the range 1 . . . 2, but downweighted by the total overlap between the vectors including all values. This builds clusters that maximize members that are present in the same shared match lists (values in the range 1 . . . 2) while simultaneously encouraging clusters that are as large as possible. As with the matrix builders, several discarded distance metrics are preserved in the code for easy comparison if anyone is interested.

Clustering

Finally, just are there are many ways to build a correlation matrix and many ways to calculate similarity, so are there many ways to perform clustering. Shared Clustering uses a variation of hierarchical agglomerative clustering that only considers the two edges of each intermediate cluster.

From a mathematical perspective, hierarchical agglomerative clustering "should" be most accurate when building clusters relative to the centroid of each cluster-in-progress. This approach did indeed create clusters that were "more accurate" than with the edge-based approach. However, the differences in accuracy were negligible in all cases reviewed. Worse, the clusters built relative to centroids were aesthetically less appealing than with the edge-based approach. The less-aesthetic clusters invited questions that were not genealogically useful.

One important aspect of clustering is that DNA match data inherently is clustered. DNA match data is ultimately derived from information about the underlying shared DNA segments. The shared DNA segments provide a natural clustering of the underlying data. All humans have approximately 7,000 centimorgans in their entire genome. That leads directly to a maximum of approximately 350 possible clusters when considering segments of at least 20 cM. Increasing the total number of matches will create larger clusters, but will not increase the number of clusters beyond 350 or so. Accordingly, the role of a clustering algorithm is to identify the clusters that naturally exist in the underlying data. A successful clustering algorithm must be able to assign matches to the proper high-level clusters that are based on the underlying segment data that is generally not available directly at the time of clustering. The fine ordering of matches within a cluster is far less important and less useful, genealogically, than the overall assignment to the high-level clusters overall.

Clustering is a side-effect of similarity

One interesting detail should not be overlooked in the discussion above. Hierarchical agglomerative clustering identifies the matches that have the sets of shared matches that are most similar to each other. The important phrase there is "most similar". Hierarchical agglomerative clustering is based on similarity. It does not generate clusters per se. The clusters generated by hierarchical agglomerative clustering are entirely a side-effect of the similarity.

This makes sense. Members of a cluster are more similar to other members of the cluster than they are to matches that are not members of the cluster.

Identifying similarity between matches is both simple and robust from a computational perspective. Hierarchical agglomerative clustering builds the largest clusters supported by the underlying data, without needing any predefined notions of cluster size or the association of matches within a cluster.

Clustering by network graphs

There are many other types of clustering besides hierarchical agglomerative clustering. Another popular approach to clustering is based on identifying cliques within a network graph built from matches and shared matches.

In many cases, network graph clustering will produce clusters that are equivalent to those produced by hierarchical agglomerative clustering. That is only proper. All clustering methods based on matches and shared matches should produce similar results, since they are working with the same underlying data.

Unlike hierarchical agglomerative clustering, however, network graph approaches depend on being able to identify cliques in the first place. Identifying cliques is a computationally hard problem, so network graph approaches tend to be slow except in the simplest of cases. Unfortunately, non-simple cases are common when analyzing DNA match data. Endogamy is extremely difficult to interpret using network graph approaches. Even cases of non-endogamic intermarriage can be too complex for network graph analysis within reasonable amounts of time.

Network graph approaches also have significant problems when working with highly-connected graphs. To address this limitation, DNA match algorithms based on network graphs will normally exclude matches with extremely high number of shared matches. Very close matches, especially those with shared centimorgans over 1200 cM, tend to be among the matches that are excluded -- even though those matches are among the most useful from a genealogical perspective, when understanding the remaining matches.

Because hierarchical agglomerative clustering generates clusters strictly by looking at the similarity between matches, it can often provide useful clusters even in the presence of moderate to heavy endogamy. It never needs matches to be excluded; on the contrary, the clusters generated by hierarchical agglomerative clustering tend to be most accurate and most useful when the closest matches are included rather than excluded.

Network graph approaches also may have a predefined notion about the minimum size of a recognized clique (= cluster) or the minimum connectivity of members within a clique. The cluster diagrams generated with this approach tend to have smaller clusters that are more fragmented than those produced by hierarchical agglomerative clustering, obscuring, or that omit matches that have only a weak association with identified cliques. Either of those decisions can obscure important, genealogically useful data that is presented clearly by hierarchical clustering.

For simple cases, clusters produced by either method tend to be similar. Once the complexity increases in the underlying data, hierarchical agglomerative clustering produces results that are genealogically more useful.

Source code

All of these computational methods are exposed in the Open Source source code for Shared Clustering, so that they can be evaluated (and possibly improved upon) by anyone who cares to work on them.