Endogamy Advanced analysis techniques - jonathanbrecher/sharedclustering GitHub Wiki

Endogamy makes a mess of clustering. Instead of seeing nice solid clusters on a mostly-white background, a test taker with a heritage of endogamy will instead see a very "speckly" background, possibly across their entire cluster diagram.

Low endogamy clusters High endogamy clusters

Test results with endogamy will also show extremely high numbers of matches and shared matches. A test taker without endogamy might have 50,000 matches at Ancestry.com, each of those with a few dozen shared matches. In contrast, a test taker with endogamy could easily have 200,000 matches, each with thousands of shared matches.

Test takers with endogamy do not have "more relatives" than those without endogamy. The greater number of matches is simply the way that Ancestry interprets and reports the test results. Some matches over 100 cM may in fact be related only distantly, with their cM values greatly inflated by endogamy. Other matches under 10 cM might indeed be relatively close matches, perhaps downweighted by Ancestry's Timber algorithm. The trick is how to figure out which of the reported matches are the best leads for further research and which are just noise from endogamy.

In some cases where endogamy makes normal clustering difficult, a combination of clustering and similarity seems to be a much better way to find useful leads.

This is an experimental technique. I have used this technique successfully to find new matches that I was then able to confirm through paper records. But it probably will not work for everyone. If you have feedback about what does and doesn't work for you, I'd love to hear.

Summary

Use the Check endogamy button to confirm that you truly do have endogamy. If not, use normal clustering for the best results.
Download your matches from Ancestry with Shared Clustering using the "Endogamy Special" downloading option.
Generate a cluster diagram using only the matches above 50 cM.
Examine the cluster diagram and identify the most promising looking clusters.
For each cluster, generate a series of Similarity analyses, limiting the analysis to only the test IDs for each cluster in turn.
Focus on the highest matches in each Similarity analysis for further research.

Details

Confirm that you truly do have endogamy

Many people who think they have endogamy can generate clusters just fine. Since Clustering will always give better results than Similarity, you shouldn't use Similarity unless you are sure that clustering will not work for you.

Use the Check endogamy button

One of the quickest ways to check if clustering might work for you is to use the Check endogamy button on the Download tab. That button should give a very quick, fairly accurate answer that says whether you have enough endogamy to make a difference to the clustering analysis.

Just give it a try

Another option is simply to try it and see. Start downloading all of your matches down to 6 cM, using the "Slow and complete" download option in Shared Clustering. Let the download run for a few minutes until you get to the "Downloading shared matches" portion of the download and see when it predicts that the download will finish.

If you can get all of your matches downloaded in a few hours, you should let the full download run to completion and then generate clusters normally.

Otherwise, if Shared Clustering predicts that it will take days or weeks to download all of your matches, you have too many matches and shared matches for clustering to be successful. You should cancel the full download by closing the Shared Clustering window, and then continue with a Similarity analysis as below.

Long download time

Download your matches from Ancestry with Shared Clustering using the "Endogamy Special" downloading option

The "Endogamy Special" downloading option will download all of your matches, but it will only download the shared matches over 50 cM for each match. This should limit the download to a reasonable amount of data, while still downloading enough data to enable Similarity analysis.

Generate a cluster diagram using only the matches above 50 cM

Generate a cluster diagram using the following settings in the Advanced options section of the Cluster tab:

Minimum cluster size: 3
Maximum gray percentage: 0
Lowest centimorgans to cluster: 50
Lowest centimorgans in shared matches: 50

Cluster over 50 cM options

This will generate a cluster diagram containing your closest matches (but only a tiny percentage of your total matches).

Examine the cluster diagram and identify the most promising looking clusters

A cluster diagram for a test taker with endogamy will be difficult to analyze, even when limited to strong matches over 50 cM. You should be looking for one of two things:

First, you should look for any matches that you have already identified. An identified match gives you a lot of information about how other people might also fit into your tree.

Second, you should look for any clusters of at least 3-4 matches, even if you have not identified anyone in the cluster yet. Unknown clusters could open up entirely new branches of your tree -- but clearly they can be a lot more difficult to figure out.

Promising and not promsing clusters

Generate a Similarity analysis for each cluster

For one cluster at a time, copy the corresponding Test IDs from Column C of the cluster diagram in Excel, and paste them into the "Test IDs to compare for similarity" field on the Similarity tab of Shared Clustering. Generate a similarity report for that cluster, then repeat for any other clusters that you have identified.

Similarity tab with Test IDs

Focus on the highest matches in each Similarity analysis for further research

Each similarity report shows the matches that are most similar to the matches in the cluster that you have identified. The most similar matches could be from anywhere in your overall match list. You could certainly see that some of the highest matches in the similarity report are matches that share less than 20 cM with you, or even less than 10 cM with you.

You should not ignore weak matches that appear high in your similarity list! The fact that they rank high in the similarity report provides extra information beyond the strength of the match itself.

The similarity report "tails off" fairly quickly, although there is no specific cutoff. You should start at the top of the list and work your way down until the matches stop being useful to you. That might happen after the first handful or matches, or after the first few dozen matches. You won't know until you look.