Clustering without downloading data - jonathanbrecher/sharedclustering GitHub Wiki
As a process, clustering (with a lowercase 'c') does nothing more than put similar data together. The data that gets clustered could come from anywhere. That means that you can generate useful cluster diagrams without downloading data directly from any website, as long as you have the patience type in whatever data you care about by hand.
To enter data by hand that Shared Clustering can cluster, follow these steps:
Instructions
Click here to download a helpful Excel template that will simplify several of the steps below
1. Using nearly any spreadsheet application, create a new spreadsheet
The spreadsheet application must be able to save .xlsx files, also known as Microsoft Excel files. Pretty much every spreadsheet application can save files in that format, including LibreOffice, OpenOffice, Goole Sheets, Apple Numbers, etc.
2. Label the first two cells with the exact text "Name" and "Shared Centimorgans"
3. Underneath those cells enter the name and shared centimorgans for your top several matches.
You should NOT skip your strongest matches. The strongest matches can add a lot of information to a cluster diagram, and should be included.
The more names that you enter, the better your clusters will be, but it can take a long time to enter the data for many matches. Stopping at 50 cM is a good compromise for many people. Some people may choose to stop at a higher cutoff such as 90 cM if they have a lot of matches.
4. Across the top row, enter the same names that you entered into the first column
5. Enter the shared match data
For each column, look at the shared matches reported for that person. Enter a "1" into each cell where the person in the row is a shared match to the person in the column, otherwise leave the cell blank.
Your hand-entered data might look like this:
6. Save your raw data to a .xlsx file
This raw data file technically is already a cluster diagram, even if a very useless one. You want to convert it to a more useful diagram with clusters that are larger and better organized.
7. On the Cluster tab of the Shared Clustering application, select the raw data file that you just created
Continue with the normal instructions for using the Cluster tab to create cluster diagrams.
Your final output might look like this:
Excel template file
For convenience, a sample Excel template file can be downloaded here. The template file here will copy the row headers to the column headers for you automatically, if you enter the rows first. It also shrinks the columns to convenient sizes, and freezes the panes so that the top row and left columns stay visible as you scroll. There is nothing magic about this template file; you can easily get the same effect by creating your own file from scratch.
Saving a bit of typing
Several sites -- including 23andMe, MyHeritage, and FamilyTreeDNA -- allow you to download all of your matches to your computer in a separate file. That file only contains data about your matches, while Shared Clustering needs your matches and your shared matches. So you cannot generate cluster diagrams from that file directly.
What you can do is copy the list of match names and shared centimorgans from that file into the template above, and use that list as a starting point for entering the rest of the shared match data. Less typing is a good thing, even if not a complete solution.
Extending an existing cluster diagram
The Shared Clustering application can read its own output files. That means that if you have an existing cluster diagram, you can add new rows and columns to the bottom and right of the existing clusters, and then use the modified file as the starting point to produce an updated cluster diagram.
The opposite is also true. If you have a large cluster diagram that has more information than you care to look at, you can delete the rows and columns for the matches you are not interested in, and then use the reduced diagram as the basis for further research.
Comparison to downloaded data
Even when working with downloaded data, the Shared Clustering application doesn't know (and doesn't care) where the data came from. The cluster diagrams generated from the method above should be identical to the diagrams created from downloaded data, as long as you enter all of the data properly.
The main drawbacks to manual data entry are that it is tedious and error-prone. It is very boring to enter lots of data by hand. Still, most people should be able to enter data for their top 100 matches or so in under an hour. Even that much data could be very useful, especially for adoptees who are mainly interested in their closest matches anyway.
Manual data entry really isn't an option for more than a few hundred matches. A direct download really is the only practical approach when analyzing thousands or tens of thousands of matches at once.
On the plus side, the manual clustering approach is fairly resilient to data entry errors. You should usually get reasonable clusters even if you make a few mistakes when typing in the raw data. This resilience to errors is an important part of any sort of DNA match clustering.
The main benefit, of course, is that approach does work when direct download of data is not an option, including for sites that Shared Clustering cannot download from directly.
Comparison to the Leeds Method
The approach described here is very similar to the Leeds method.
The Leeds Method can be performed by hand using pen and paper, while the approach here does need you to produce a *.xlsx file on the computer.
The Leeds Method is also somewhat faster than this manual method, since you would normally only create four columns of data, no matter how many rows you looked at. The approach here requires that you enter one column for every row, which needs a lot more typing.
In cases where the Leeds method works well, it is probably best to continue using the Leeds method. The manual approach should generate exactly the same results, while taking more effort to get there.
The main benefit to this manual clustering approach is that it can produce excellent results even in cases where the Leeds Method fails. That could include test data with recent intermarriage or more distant endogamy. The manual approach can also usually generate reasonable clusters down to about 50 cM as long as you have the patience to enter that much data, while the Leeds Method generally has problems below 90 cM.
The manual clustering approach also provides richer information in some cases. That is especially true for people who have many matches above 400 cM, which would normally be excluded by the Leeds Method. This approach can often isolate great-grandparent clusters for people who take the time to enter data below 90 cM, while it is much more difficult to get beyond grandparent-level data using the Leeds Method alone.