Cluster tab - jonathanbrecher/sharedclustering GitHub Wiki
The Cluster tab lets you create cluster diagrams using the data that you downloaded using the [Download tab]], that you [entered by hand, or that you obtained from similar sources.
Saved data file
The saved data file is the file that you saved previously using the Download tab. You can type the path to the file or click the Select button to choose it from disk.
This field will be saved and restored if you relaunch the application.
You can reuse the same saved data file to create many cluster diagrams. In fact, it's a good idea to reuse the same saved data file, since it takes much (much!) longer to download your data than to create clusters from it. You can very easily create many cluster diagrams from the same saved data file, using different settings or filters. There's really no need to make a new download unless you get new matches. New matches arrive slowly. Most people will only need to download new data every few weeks, or even every few months.
Cluster output file
This is the name of the file that will contain the cluster analysis. You can type the path to the file or click the Select button to choose a location to save the file on disk.
This field will be saved and restored if you relaunch the application.
Cluster completeness
There are three options for cluster completeness:
- Close relatives only (50 cM and greater)
- All visible shared matches (20 cM and greater)
- Complete (6 cM and greater)
Close relatives only (50 cM and greater)
The first option isn't terribly useful. Most people have relative few very close relatives. You probably have spent a lot of time looking at your close relatives already, and you won't learn much from clustering only your close relatives. This option is provided mainly as an easy way to start clustering, where you can see how the clusters match what you already know.
All visible shared matches (20 cM and greater)
The second option is the workhorse of cluster analysis. If you've been working with your DNA matches for a while, you may already have some idea which of your shared cousin matches seem to "go together". This level of cluster analysis should show you a lot that looks familiar to you, while also giving insights that you haven't thought of before.
Complete (6 cM and greater)
The third option is the researcher's delight, adding the matches under 20 cM that Ancestry doesn't include in the shared matches. This produces a much more complex clustering diagram, but nothing beats a complete analysis in terms of useful data
Advanced options
The advanced options are provided to give more control over the clustering behavior.
Minimum cluster size
The minimum cluster size excludes matches that have very few shared matches. The default value is 3 and generally should not be changed. Increasing the value runs the risk of excluding valid matches. Decreasing the value tends to include a lot of false positives that simply distract from research.
The one case where it does make sense to reduce the cluster size is if you are working on a very specific genealogical question, such as trying to break through a brickwall. In a case like that, it might be worth accepting the false positives in order to find every match that might possibly yield the missing clue that you're looking for.
Maximum gray percentage
This value controls the maximum gray percentage in the generated cluster diagram. The gray percentage is measured on a scale of 0...100 where 0 is no gray at all and 100 is a solid gray background. Nobody has a solid gray background, but some people have an awful lot of gray, to the point that so much gray makes it difficult to interpret the more interesting clusters. You can reduce the total amount of gray by reducing this value. The default value is 5, limiting the gray to at most 5% of the background.
This value should be used with caution. Most people would be fine have less than 5% gray to start with, so any value between 5 and 100 would be equivalent for them. The default value is set lower than 100 by default for the benefit of those people who otherwise would have had significant gray and otherwise would get a very difficult diagram to work with.
Values below 100 introduce a form of data loss. Data loss is a bad thing... but massive amounts of gray are basically useless, so there's a tradeoff to make.
Lowest centimorgans to cluster
This value is tied to the options for cluster completeness. As discussed above, there are really only a few useful values here: 50 cM to include all close matches, 20 cM to include the shared matches visible on the Ancestry website, and 6 cM to include everything (assuming that you have downloaded the low-strength matches in the first place). You can specify other values, but you probably shouldn't. Clusters work best with as much data as possible.
This value determines the number of rows in the final digram. There will be one row for each match that share at least as many centimorgans as the value you enter in this field.
Lowest centimorgans in shared matches
This value controls which shared matches should be considered when generating clusters, and is also tied to the options for cluster completeness. As discussed above, there are really only two useful values here: 20 cM to include the shared matches visible on the Ancestry website, and 6 cM to include everything (assuming that you have downloaded the low-strength matches in the first place). You can specify other values, but you probably shouldn't. Clusters work best with as much data as possible.
This value determines the number of columns in the final digram. There will be one column for each match that share at least as many centimorgans as the value you enter in this field.
Maximum matches per cluster file
The maximum number of matches included per cluster file. If more than this many matches are included in the complete cluster diagram, then the complete diagram will be split across multiple files. Each file will include as many columns as matches specified here, plus a small number of additional header columns at the start of each row.
This option is necessary because while Excel supports up to 16384 columns, other spreadsheet programs such as Google Sheets or Open Office support only 1024 or even as few as 256 columns.
Open cluster file(s) when complete
When the clustering is complete, the resulting diagram by default will be opened in whatever program you have installed to read .xlsx files, typically Microsoft Excel. You can disable the automatic opening if you plan on viewing the diagram in some other way, for example by using a web-based application such as Google Sheets.
Filter Test IDs to
This section accepts a list of test IDs, one per line, and creates a cluster diagram that contains only the matches with those IDs. If you're interested in looking at several clusters that are spread out on your main cluster diagram, you can copy the test IDs from your original clusters and paste them here. Reclustering in this way can be somewhat easier (and quicker!) than editing the spreadsheet by hand.
Filtering the clusters to a smaller number of matches can also be helpful when investigating the under-20 cM matches, so that you are not overwhelmed by the huge number of low-strength matches in areas that you are not currently researching.
In the Shared Clustering User Group on Facebook, Brian Schuck offered a good example of when it might be useful to use this field:
Let's say you do a clustering exercise all the way to 6 cm and find 5000 matches that cluster with a minimum cluster size of 4 (which is what happens on my Dad's kit). That's a pretty big spreadsheet that is hard to visualize all at once. But I really want to research just a couple of clusters on a great great grandfather's side. If I've annotated all the close matches - I can identify which clusters might have that information. I then filter the spreadsheet to only contain those clusters and then copy the IDs, paste them in there and re-run the report. Now I have a spreadsheet with just a few hundred matches - and I can actually see what's going on with the clusters all at once. It just makes it more workable. I use this feature a lot for targeted research.
Exclude clusters with greater than X members
This value is rarely useful. In fact, it is actively harmful for most people and should normally be left blank. As a rule of thumb, this value should remain blank when generating clusters for any test with fewer than 5,000 matches over 20 cM.
For a very few people, however, this value is the only way to get clusters to work at all. This value is designed for people who have mixed ethnicity with part of their heritage from an extremely endogamic ethnic group and part of their heritage from an ethnicity with no endogamy. The non-endogamic matches would generate nice clusters, except that the larger number of endogamic matches totally overwhelm the clustering.
In the rare cases where this value is useful, it should be set to a number larger than the largest likely non-endogamic cluster size. A value of 200 is a reasonable first guess in cases like that, although it may need to be adjusted up or down to get the best results.
Anonymize output
When the Anonymize Output option is turned on, all personally identifiable information will be randomized in the cluster diagram. The resulting diagram can be safely shared without revealing any personal information about yourself or your matches. This can be useful when sharing cluster diagrams in discussion groups that insist on fully aononymized information.
Specifically, the following data is affected by anonymization:
-
The last name of each match is converted to one of the top 5,000 most common last names in the United States. Each name is converted in a consistent but non-reversable manner. That means that you'll get the same names if you create anonymized cluster diagrams several times in a row, but nobody will be able to convert the anonymized names back to the actual names in your match list. First names are not anonymized because they are not identifiable by themselves. The overall formatting of each name (capitalization, spaces, etc.) is preserved as much as possible between the original and anonymized names, so that someone who does know what the original names were might be able to orient themselves even with the anonymized names.
-
The Test ID is fully anonymized
-
The Link is non-clickable
-
The Shared Centimorgans and Shared Segments values are not modified. These are just numbers and are not identifiable on their own. Plus, these values can be important when interpreting the rest of the diagram.
-
The Tree link is non-clickable, but the Tree Type and Tree Size values are not modified.
-
Last names of common ancestors are anonymized in the same was as the last names of the matches
-
Colored-dot group names are renamed to Group1, Group2, etc
-
Notes are blanked out
Unlike all of the other settings, this value is not saved. Anonymized output will always be turned off after a fresh launch of Shared Clustering, in order to prevent some very confusing output from forgetting that the setting was turned on.
Ancestry host name
The host name used when browsing the Ancestry web site. This will be used in the links that are generated in the cluster diagram. This value defaults to www.ancestry.com; users outside of the United States might want to change it to www.ancestry.com.au, www.ancestry.co.uk, www.ancestry.de, or some other localized Ancestry site.
Progress bar
The progress bar shows the clustering.
Clustering progresses in two or three stages. First, the software compares every pair of matches to find which ones are most similar. This stage takes the longest amount of time. Then the clusters are formed by grouping the most similar matches together. If you are clustering the under-20 cM matches, those will be added as a third stage before the cluster report is saved at the end.