Frequently asked questions - jonathanbrecher/sharedclustering GitHub Wiki

Is a Macintosh version available?

Shared Clustering is Open Source. That means that anyone can take the code I've written, and run with it. I'd love for someone to contribute a Mac version, while I continue to focus on improving the algorithms themselves.

I realize that doesn't help the people who only have a Mac right now. Once you download your data and generate your clusters, the cluster files themselves are normal *.xlsx file that can be viewed just fine in the Macintosh version of Microsoft Excel. The sticking point is getting those files generated in the first place. Maybe you can borrow a Windows laptop overnight from a friend? Maybe your local library has a Windows machine that you can use for a few hours?

Ken Spratlin has posted a description of how he is running Shared Clustering on his Mac using VirtualBox.

My antivirus software says that this has a virus!

Are you sure that's what it says? What message did you get, exactly? Some antivirus software says that Shared Clustering might have a virus. That's true of course. Also not terribly helpful, since any software might have a virus. If you have a message that it actually DOES have a virus, that would be a concern -- and also would be unlikely.

See more discussion in the system requirements.

Can I view the files with something other than Microsoft Excel?

The files generated by Shared Clustering are standard *.xlsx files. You can read those files with pretty much any spreadsheet on any platform. However, some programs have limitations that make it hard to work with the files. The main limitations I know about are in the display of colors and the maximum number of columns that can be displayed in a file.

Shared Clustering uses a feature called "conditional formatting" to display colors using a three-color color scale. This feature is not supported by all spreadsheets:

  • LibreOffice: Yes
  • OpenOffice: No
  • Google Sheets: No
  • Apple Numbers: No

Excel can display files with over 16000 columns. Most other programs have smaller limits. You can use the Maximum matches per cluster file option in the Advanced Options section of the Cluster tab to split your output file into smaller slices if needed:

  • LibreOffice: 1024 columns
  • OpenOffice : 1024 columns
  • Google Sheets: 256 columns
  • Apple Numbers: 255 columns

Is there any significance to the order of the clusters?

Mostly not, sorry. There definitely is no high level ordering that puts all of the maternal clusters and then all the paternal clusters after those. Clusters from the same general area on your tree are often near each other in the diagram, but even that much isn't guaranteed.

In a general sense, there cannot be any ordering of the clusters. Genealogy would be so much simpler if DNA came with labels! There's no way for any software to know a priori what is represented by a given cluster, so there's no way for any software to order the clusters in any significant way. The software can only say that the clusters exist. As a researcher, you need to figure out what the clusters mean to you.

How can I set the maximum / minimum centimorgan limits to cluster?

It's a bad idea to limit the clustering based on some cM cutoff value. Shared Clustering loves data. You won't hurt your results by including more matches, and you will get better clusters if Shared Clustering has more data to work with.

If you absolutely can't stand seeing the high-cM matches on the final chart, you can delete them by hand. The Excel chart is editable after all. You still benefit from including them in the calculations.

Shared Clustering already does have a control for the lower cM bound. That's in the Advanced Options section. There are really only two useful minimum values: 20 cM and 6 cM. The 20 cM lower cutoff is useful mainly for people who want their clustering results to match the results shown on the Ancestry web site, since Ancestry doesn't show shared matches below 20 cM. You can set the value to something else, but you don't want to. Limiting your data can only hurt your results.

Something that IS useful is to limit your results to a subset of clusters. Do a full run and decide which clusters you want to focus on. Those clusters might not be next to each other in the full diagram. You can then do a second run that filters to just the matches in those clusters. That will give you a much smaller diagram that can be easier to deal with. See the "Filter Test GUIDs to" option in the Advanced Options.