What is a cluster - jonathanbrecher/sharedclustering GitHub Wiki

Does a cluster represent a common ancestor, or does it represent a DNA segment? The answer is... it depends!

Definition of a cluster

Clusters are built from DNA matches. By definition, a cluster is a group of DNA matches who mostly all match each other.

That is the only definition that is always true. everything else is interpretation, and the interpretation is the part that depends.

Clusters containing close matches

A special kind of cluster contains contains only people who are close matches to the test taker. For example, the Shared Clustering application creates clusters that contain only people who match the test taker at 50 cM or more.

These matches are special because setting a high cM cutoff tends to guarantee that the matches have a common ancestor with the test taker within a limited number of generations. Matches over 50 cM are very likely fourth cousins to the test taker, or closer. Matches over 90 cM are very likely third cousins to the test taker, or closer. The Shared cM Project can make reasonably accurate predictions about how a match is related to the test taker, for matches over 50 cM.

Two people who are close matches to the test taker will also be close matches to each other. That is important because close matches share more than high cMs. They also share many segments with the test taker and with each other. Since a single segment is enough to make a DNA match, close matches with a lot of segments have a lot of opportunities to share some match with each other. It is totally reasonable to have a group of third cousins who all share some segment with each other, even if different pairs of cousins share different specific segments.

As a result, clusters containing close matches tend to represent a group of matches who are all related to the test taker through a single ancestor. In a diagram containing matches over 50 cM, each cluster might represent matches who are related to a test-taker through a single great-grandparent. In a cluster containing matches over 90 cM, each cluster might represent matches who are related to a test-taker through a single grandparent.

Clusters containing close matches necessarily contain only close matches. Those matches also tend to all share many segments with the test taker.

Because clusters containing close matches tend to represent single ancestors, these clusters are very useful for adoptees and other people who are trying to understand their close family.

Clusters containing distant matches

Distant matches create clusters that are very different from the ones created from close matches.

There is a good reason for that difference. Matches who share only 20 cM with the test taker will often share only a single segment with the test taker. Their closest shared ancestor might be 6-10 generations removed from the test taker, and from each other. It is very unlikely that distant matches will share many segments with each other in the same way that close matches do, so it is very difficult to build ancestor-based clusters for distant matches.

When a group of people all happen to share the exact same DNA segment, they will automatically all match each other, by virtue of that mutually shared segment. If you think that it's rare that several people will share the exact same segment, you're right! But at the same time, most people have a thousand or more matches over 20 cM, while only having several hundred possible segments over 20 cM. Even rare things happen when the numbers are big enough.

Clusters containing distant matches may contain matches with a large range of shared centimorgans. The important detail is when most of the members of the cluster share only a single segment with the test taker. Sometimes these clusters may overlap at a match who shares more than one segment with the tester, one segment for each cluster.

Close clusters with centimorgans

Because clusters containing distant matches tend to represent single DNA segments, these clusters tend not to be very useful for adoptees and other people who are trying to understand their close family. Clusters of this sort may contain matches who are related to the test taker through a variety of different ancestors, for example relatives descended from fourth-, fifth-, sixth-, seventh-, and eighth-great-grandparents along the path of descent followed by the shared DNA segment.

Clusters containing intermediate matches

There is a physical difference between the close matches that share multiple segments and the distant ones that share only one segment. The important part of that difference is the number of segments, not the number of centimorgans. The Shared Clustering application uses a 50 cM cutoff by default, but that is only a guess that tends to work well. Some people will need to use a higher cutoff of 60 or 70 cM to get good ancestor-based clusters. Some people will be able to get good ancestor-based clusters all the way down to 40 or 35 cM. It all depends on which relatives have tested, and how many segments they share with the test taker and with each other.