Help needed - jonathanbrecher/sharedclustering GitHub Wiki

Do you have ideas for how to improve Shared Clustering? I'd love to hear from you! If you're able to download the code and implement your own ideas, that's even better! Here are some areas where I know that Shared Clustering could be improved...

Downloading match lists from other sites

Ancestry has far more match data than anyone other site, so I've focused on them. But other sites do have data also. It would be great to be able to download from other sites.

See AncestryTestsRetriever.cs and AncestryMatchesRetriever.cs for simple stubs that could be extended to other sources

Other matrix builders

Building an adjacency matrix is the first step in generating clusters. There are lots of ways to populate an adjacency matrix. The default in Shared Clustering is pretty good. Can you do better?

See the MatrixBuilders directory for a couple of examples and an interface that could be extended to different implementations.

Other distance metrics

Just there are many ways to populate an adjacency matrix, so are there lots of ways to measure the similarity (or distance) between two rows in that matrix.

See the Distance directory for several experiments at measuring the distance between matches. Many of those produce subtly different results that aren't clearly better. Can you find one that produces better output?

Identifying clusters

Humans are good at identifying clusters by sight. Computers, not so much -- especially when clusters have fuzzy edges and when several clusters overlap. The algorithm used by Shared Clustering to identify primary clusters is 'adequate' but honestly not much better than that. There is a lot of room for improvement here.

See the PrimaryClusterFinders directory for the current algorithm and an interface that could be extended to better implementations.

New types of output

Shared Clustering produces its cluster diagrams in *.xlss format that is native to Microsoft Excel and can be read by most other spreadsheet applications. It's a pretty good way to view this data. There aren't many other ways to view thousands of rows and columns at one time.

What other output format would you like to see? I'm not sure that HTML could handle the display of a matrix that is thousands of cells on a side... but I haven't tried it either. Does someone want to take a shot?

See the CorrelationWriters directory for the current *.xlsx writer and an interface that could be used as the basis for other output.

Interpretation

Clusters come in various 'shapes' that may have physical meaning. Some interpretations are included in this documentation. Can you figure out the meaning of any other cluster shapes, so that other people can learn more about their results?

Macintosh version, Linux version

Shared clustering is currently available only for Windows. There's no reason the same couldn't be done on other platforms. The algorithms are pretty much straight math, and should translate cleanly. You'd simply need to wrap a user interface around it.