Similarity tab - jonathanbrecher/sharedclustering GitHub Wiki

The Similarity tab provides an alternate way of analyzing data that you downloaded from Ancestry. This tab is intended only when normal clustering isn't an option, and is mainly designed for test takers with heavy endogamy.

Test takers with heavy endogamy can easily have hundreds of thousands of matches, each with hundreds or even thousands of shared matches. Almost none of those matches are 'real' in the sense of being genealogically useful, and the huge glut of endogamic matches makes it extremely difficult to interpret normal clusters.

Similarity can be used by people who have already identified some true relatives in their DNA results. Given that hint of some matches who are definitely related to each other, Shared Clustering can look for other matches that are most similar to the known matches. The most similar additional matches are often (but not always!) related to the test taker in similar ways as the known matches.

For example, if you provide a group of third cousins who are known to be related to each other, then the similarity results might highlight other third cousins who are related to them. And possibly some fourth cousins and fifth cousins as well. And some people who might not be related in any identifiable way.

Unfortunately, the results found by similarity are always less certain than the results found by clustering. Clustering has the advantage of showing which matches are match each other. That cross-matching reinforces the true matches when building clusters, and tends to exclude the false matches. Similarity does not have the reinforcement of cross-matching, and so similarity tends to include many more false matches than clustering does.

So why use similarity at all, when it includes false matches? There is one simple answer: Similarity also finds real matches -- and often finds many real matches -- that cannot be found in any other way. Similarity looks at all of the matches, all the way down to 6 cM if that many matches were downloaded. Ancestry provides no good way to look at matches under 20 cM, short of one at a time. Again, clustering will be more accurate... but similarity will work even when clustering will not.

For test takers with heavy endogamy, similarity works well in combination with the 'Endogamy special' downloading option.

Similarity default

Saved data file

The saved data file is the file that you saved previously using the Download tab. You can type the path to the file or click the Select button to choose it from disk.

This field will be saved and restored if you relaunch the application.

Similarity output file

This is the name of the file that will contain the similarity analysis. You can type the path to the file or click the Select button to choose a location to save the file on disk.

This field will be saved and restored if you relaunch the application.

Lowest centimorgans to retrieve

This value controls how many tests will be examined for similarity. Normally you would want to examine as many tests as possible, meaning that this value should be as low as possible.

Lowest centimorgans of shared matches

This value controls how many shared matches are used for determining similarity.

This value should be as low as possible For tests with endogamy, higher values may be necessary, to exclude the endogamic noise of many shared matches that have no known relation to the test taker.

Since Ancestry never returns shared matches below 20 cM, the lowest meaningful value here is 20 cM.

Test IDs to compare for similarity

This section accepts a list of test IDs, one per line, and finds other matches most similar to the matches with those IDs. Test IDs may be copied from the appropriate column in a cluster diagram. Test IDs are also shown as the second code for each test. For example a match as the address

https://www.ancestry.com/discoveryui-matches/compare-ng/0EBB869C-2C05-4412-B0A8-7AF57494D924/with/A5F420FF-491D-4059-98AE-0A9D3A390F37

has an ID of A5F420FF-491D-4059-98AE-0A9D3A390F37

If you leave this section blank, you will get a report showing which matches are most similar to each individual match. This report can be extremely long.

Advanced options

The advanced options are provided to give more control over the similarity behavior.

Minimum cluster size

The minimum cluster size excludes matches that have very few shared matches. The default value is 3 and generally should not be changed. Increasing the value runs the risk of excluding valid matches. Decreasing the value tends to include a lot of false positives that simply distract from research.

Ancestry host name

The host name used when browsing the Ancestry web site. This will be used in the links that are generated in the similarity file. This value defaults to www.ancestry.com; users outside of the United States might want to change it to www.ancestry.com.au, www.ancestry.co.uk, www.ancestry.de, or some other localized Ancestry site.

Open similarity file(s) when complete

When the similarity is complete, the resulting diagram by default will be opened in whatever program you have installed to read .xlsx files, typically Microsoft Excel. You can disable the automatic opening if you plan on viewing the diagram in some other way, for example by using a web-based application such as Google Sheets.

Progress bar

The progress bar shows the status of the filtering operation.