Clustering without downloading data - jonathanbrecher/sharedclustering GitHub Wiki

As a process, clustering (with a lowercase 'c') does nothing more than put similar data together. The data that gets clustered could come from anywhere. That means that you can generate useful cluster diagrams without downloading data directly from any website, as long as you have the patience type in whatever data you care about by hand.

To enter data by hand that Shared Clustering can cluster, follow these steps:

Instructions

Click here to download a helpful Excel template that will simplify several of the steps below

1. Using nearly any spreadsheet application, create a new spreadsheet

The spreadsheet application must be able to save .xlsx files, also known as Microsoft Excel files. Pretty much every spreadsheet application can save files in that format, including LibreOffice, OpenOffice, Goole Sheets, Apple Numbers, etc.

2. Label the first two cells with the exact text "Name" and "Shared Centimorgans"

Label cells as Name and Shared Centimorgans

3. Underneath those cells enter the name and shared centimorgans for your top several matches.

You should NOT skip your strongest matches. The strongest matches can add a lot of information to a cluster diagram, and should be included.

The more names that you enter, the better your clusters will be, but it can take a long time to enter the data for many matches. Stopping at 50 cM is a good compromise for many people. Some people may choose to stop at a higher cutoff such as 90 cM if they have a lot of matches.

Enter names and shared centimorgans

4. Across the top row, enter the same names that you entered into the first column

Enter names and shared centimorgans

5. Enter the shared match data

For each column, look at the shared matches reported for that person. Enter a "1" into each cell where the person in the row is a shared match to the person in the column, otherwise leave the cell blank.

Your hand-entered data might look like this:

Raw data entry

6. Save your raw data to a .xlsx file

This raw data file technically is already a cluster diagram, even if a very useless one. You want to convert it to a more useful diagram with clusters that are larger and better organized.

7. On the Cluster tab of the Shared Clustering application, select the raw data file that you just created

Continue with the normal instructions for using the Cluster tab to create cluster diagrams.

Your final output might look like this:

Clustered diagram

Excel template file

For convenience, a sample Excel template file can be downloaded here. The template file here will copy the row headers to the column headers for you automatically, if you enter the rows first. It also shrinks the columns to convenient sizes, and freezes the panes so that the top row and left columns stay visible as you scroll. There is nothing magic about this template file; you can easily get the same effect by creating your own file from scratch.

Saving a bit of typing

Several sites -- including 23andMe, MyHeritage, and FamilyTreeDNA -- allow you to download all of your matches to your computer in a separate file. That file only contains data about your matches, while Shared Clustering needs your matches and your shared matches. So you cannot generate cluster diagrams from that file directly.

What you can do is copy the list of match names and shared centimorgans from that file into the template above, and use that list as a starting point for entering the rest of the shared match data. Less typing is a good thing, even if not a complete solution.

Extending an existing cluster diagram

The Shared Clustering application can read its own output files. That means that if you have an existing cluster diagram, you can add new rows and columns to the bottom and right of the existing clusters, and then use the modified file as the starting point to produce an updated cluster diagram.

Clustered diagram with added matches

The opposite is also true. If you have a large cluster diagram that has more information than you care to look at, you can delete the rows and columns for the matches you are not interested in, and then use the reduced diagram as the basis for further research.

Comparison to downloaded data

Even when working with downloaded data, the Shared Clustering application doesn't know (and doesn't care) where the data came from. The cluster diagrams generated from the method above should be identical to the diagrams created from downloaded data, as long as you enter all of the data properly.

The main drawbacks to manual data entry are that it is tedious and error-prone. It is very boring to enter lots of data by hand. Still, most people should be able to enter data for their top 100 matches or so in under an hour. Even that much data could be very useful, especially for adoptees who are mainly interested in their closest matches anyway.

Manual data entry really isn't an option for more than a few hundred matches. A direct download really is the only practical approach when analyzing thousands or tens of thousands of matches at once.

On the plus side, the manual clustering approach is fairly resilient to data entry errors. You should usually get reasonable clusters even if you make a few mistakes when typing in the raw data. This resilience to errors is an important part of any sort of DNA match clustering.

The main benefit, of course, is that approach does work when direct download of data is not an option, including for sites that Shared Clustering cannot download from directly.

Comparison to the Leeds Method

The approach described here is very similar to the Leeds method.

The Leeds Method can be performed by hand using pen and paper, while the approach here does need you to produce a *.xlsx file on the computer.

The Leeds Method is also somewhat faster than this manual method, since you would normally only create four columns of data, no matter how many rows you looked at. The approach here requires that you enter one column for every row, which needs a lot more typing.

In cases where the Leeds method works well, it is probably best to continue using the Leeds method. The manual approach should generate exactly the same results, while taking more effort to get there.

The main benefit to this manual clustering approach is that it can produce excellent results even in cases where the Leeds Method fails. That could include test data with recent intermarriage or more distant endogamy. The manual approach can also usually generate reasonable clusters down to about 50 cM as long as you have the patience to enter that much data, while the Leeds Method generally has problems below 90 cM.

The manual clustering approach also provides richer information in some cases. That is especially true for people who have many matches above 400 cM, which would normally be excluded by the Leeds Method. This approach can often isolate great-grandparent clusters for people who take the time to enter data below 90 cM, while it is much more difficult to get beyond grandparent-level data using the Leeds Method alone.