Download tab - jonathanbrecher/sharedclustering GitHub Wiki

The Download tab lets you download data from Ancestry. You need to download your data before you can cluster it on your local computer.

Cluster default

Username and password

Username and password

You must enter your Ancestry DNA username and password and then click Sign In before using the rest of this tab.

The username will be saved, and restored if you quit and relaunch the application. For security, your password will not be saved and must be entered when you need it.

Test name

Test name

The test name lets you specify which test to download data from. The test name(s) will only be populated after you have entered your Ancestry DNA username and password and clicked Sign In.

Some people have access to multiple tests, either because they manage tests for several people or because other tests have been shared with them. If you only have access to one test, that will be the only option.

For convenience, the number of matches in the test is shown to the right of the test name, after the test name is selected.

Download completeness

Download completeness

There are three options for download completeness:

  • Fast but incomplete (fourth cousin matches or higher — usually 10 minutes or less)
  • Slow and complete (all matches — usually at least several hours)
  • Endogamy special, for use with heavy endogamy (all matches, but only top 200 shared matches — can take 24 hours or longer)

In general, you want to download as much data as possible. But downloading tens of thousands of matches can take a long time. Downloading just the fourth cousin matches is faster than downloading everything.

Recommendation: First, download just the fourth cousin matches quickly. Then start a new download to download all matches to a separate file with a different name. You can get a lot of information from clustering the fourth cousin matches while waiting for the rest of the data to download, then repeat the clustering when you have all of the data.

You might want to use a file name that includes the minimum centimorgans that are included in the file. It can also be a good idea to include the date when you downloaded the data, to keep track if you do another download in a few weeks or a few months. Make sure that you remember where you save the file on disk, so that you can find it again if you want to generate new a new cluster diagram in a few days. It's much faster to reuse the same download file to create multiple diagrams with different settings, rather than repeating the long download process every time.

To repeat: You want as much data as you can get. Really. The clustering cannot be harmed by having more data. The clustering can be harmed by having less data or incomplete data.

Special notes for test takers with endogamy

The third downloading option, 'Endogamy special', is designed for test takers with heavy endogamy. This option should not be used in cases where it is possible to download all matches using the second option.

Endogamy is a term that means different things to different people. In Ancestry test results, endogamy results in lots of matches and LOTS of shared matches. A test taker with heavy endogamy could easily have 20,000 fourth cousin matches and 200,000 total matches. And every one of those 200,000 total matches could have hundreds or even thousands of shared matches. Since the downloading time is related to the number of shared matches to download, it could easily take weeks (or months!) to download all matches for a test taker with heavy endogamy.

The 'Endogamy special' downloading option downloads all matches, just as the second option does, but it only downloads the top 200 shared matches. That much data can still take more than a day to download, but a few days is a lot more reasonable than a few months.

Again, this option should not be used in cases where it is possible to download all matches using the second option. But it's a functional backup for tests that simply cannot be downloaded completely in a reasonable amount of time.

If you are unsure whether you should use the 'Endogamy special' downloading option, you can click the 'Check endogamy' button.

Check endogamy button

Check endogamy

The 'Check endogamy' button is enabled after you have signed in to Ancestry. It will examine a few of the matches for the selected test and report whether the test has significant endogamy:

1% shared match frequency

or

10% shared match frequency, partial

or

10% shared match frequency

or

50% shared match frequency

Testing for endogamy? Really?

Well, no, not really. Endogamy is a description of events that happened hundreds of years ago. Shared Clustering can't know for sure what happened hundreds of years ago. Instead, Shared Clustering tests for the results of endogamy that most affect clustering, namely having a huge number of shared matches.

When you click the 'Check endogamy' button, Shared Clustering select 10 matches at roughly the 20 cM match level. The 20 cM level is right at the boundary between what Ancestry calls '4th to 6th cousins' and 'Distant cousins'. In other words, these are distant matches.

For each of those matches, Shared Clustering counts the number of shared matches and divides that by the total number of matches over 20 cM. Each of those ratios is between 0 (no shared matches at all) and 1 (every single match is a shared match).

People without endogamy have very few shared matches, on average, for each match. They typically have shared match ratios under 2%. Clustering will work well for test results with very low shared match ratios.

Ashkenazi testers seem to have tens of thousands of matches over 20 cM, where each match has thousands of shared matches. They have shared match ratios around 10%. Clustering can work for these tests, but it does not always work well. The "Endogamy special" downloading option might be needed to deal with the large number of shared matches.

Some ethnic groups, including those with Pacific Island and New Mexico heritage, have shared match ratios over 50%. Clustering will almost certainly not work for these tests, and Similarity probably won't work either.

There is another possibility. Some people might have endogamy only through one parent or one grandparent. Not surprisingly, those people have an intermediate number of shared matches. Unfortunately, it is hard to predict the results of clustering for people with partial endogamy. Any matches outside of the endogamic groups will probably cluster well. The rest might work well or might give poor results. These people probably should try normal clustering in hopes that it works well for them, but be prepared to look at similarity if clustering doesn't work well.

Advanced options

Advanced options

The advanced options are provided to give more control over the two default options for download completeness. If you change the values in the advanced options, then all of the three options for download completeness will be unselected.

Adjusting the advanced options is rarely necessary. It's mainly useful for people with endogamy, who have hundreds of thousands of total matches and thousands of shared matches per match. It can take literally weeks to download that much data, so even tough it's not a good idea to limit the amount of data to be downloaded, sometimes it's necessary just to get something to work with at all.

Lowest centimorgans to retrieve

Lowest centimorgans to retrieve

This value controls how many tests to retrieve. The tests retrieved will starting from the strongest match and ending with the matches at this number of centimorgans. For convenience, the number of third cousin matches, fourth cousin matches, and total matches is shown above, to the right of the test name.

It is always best to download as many matches as possible, meaning that this value should be as low as possible.

Lowest centimorgans of shared matches

Lowest centimorgans of shared matches

This value controls how many shared matches should be retrieved for each match.

Since Ancestry never returns shared matches below 20 cM, the lowest meaningful value here is 20 cM.

For tests with endogamy, each match may have thousands of shared matches, making clustering both impractical and unhelpful. Downloading only the highest shared matches -- for example, only the top 200 shared matches -- can reduce download time significantly.

Recommendation: Don't adjust this value unless you know for sure that the majority of the test matches have hundreds or thousands of shared matches.

Fast downloads of match data only

If you set the 'Lowest centimorgans of shared matches' value greater than the centimorgans of your closest match, then no data about shared matches will be downloaded at all.

On the one hand, this is a bad idea because clustering is based on shared matches, and if download no information about shared matches then you cannot generate clusters.

On the other hand, downloading information about shared matches is by far the slowest part of downloading information from Ancestry. If you skip all of the shared match information, you can download the rest of the information about your primary matches very quickly. Even people with hundreds of thousands of matches should be able to download information about those primary matches in 5-10 minutes, if they don't download the shared match info.

Information downloaded about primary matches includes name, id, shared centimorgans and segments, common ancestors, and user-entered notes -- basically, everything included in the saved cluster diagram except the clusters themselves. You can then use the Export tab to get an easy-to-read spreadsheet with that information for all of your primary matches. You can also use the Upload Notes tab to edit your notes for each match and upload the updated notes back to the Ancestry web site.

If you want to download only data about your primary matches, you should set the 'Lowest centimorgans of shared matches' higher than the highest of your shared matches:

No shared matches

Progress bar

Progress bar

The progress bar shows how much data has been downloaded.

Downloading progresses in two stages. First, all of the match names are downloaded. This is equivalent to paging through the matches on the Ancestry web site, 200 matches per page. After all of the match names have been downloaded, a second patch downloads the shared matches for each match. Unsurprisingly, the first stage is fairly quick, while most of the download time is spent downloading the shared matches.

Downloading speed

Ancestry throttles the rate at which Shared Clustering (or any application) can download data from their site. Shared clustering respects their limits, and for safety backs off even a bit further. That means that the download speed is mostly not limited by network issues, and will be a fairly constant value for everyone.

The downloading speed does depend on how many matches you have, and how many shared matches are shown for each match.

People without significant endogamy might typically have fewer than 50,000 matches, and each of those matches might average fewer than a dozen shared matches. It should be possible to perform a complete download of test results without endogamy in a few hours or less, at rates well over 10,000 matches per hour.

People with endogamy have a lot more data. It is normal for test results with heavy endogamy to have over 200,000 matches, where each of those matches has hundreds or even thousands of shared matches. That's a HUGE amount of data, and it could take literally months to download everything. The 'Endogamy special' downloading option is designed to get some useful data downloaded in a reasonable amount of time, even in the presence of endogamy. At a rate near 10,000 matches per hour, it could still take more than a full day to download even a limited amount of data from over 200,000 matches.

Continue in Cluster tab

Continue in Cluster tab

After the download is complete, the name of the downloaded file will appear at the bottom of the tab, along with a button that will switch to the Cluster tab with the name of the downloaded file already loaded in the Cluster tab.