Overview for recruiters and hiring managers - jonathanbrecher/sharedclustering GitHub Wiki

I enjoy solving problems. The most important question I have for any potential employer is the same: What problems can I help solve for you?

The Shared Clustering project was born out of a real problem that I wanted to solve for myself and to benefit others. This is what I most enjoy -- taking a real-world, valuable problem, and working out a solution that nobody else has found.

To be clear, I am not looking for a position writing genealogy software. That's not the point. I'm looking for a position where I can solve useful problems. This just happens to be a good example to share.

This project says far more about my skills and interests as a software engineer than any boilerplate I could put in my resume. Since Shared Clustering is Open Source, you can see every aspect of a complete, functional application from source code to end-user documentation. Here are ideas for some other things that you can look at to learn more about me as a software engineer.

Problem statement
The clustering algorithm
- Alternate clustering algorithms
Development environment
- Why not a web application?
Other things recruiters and hiring managers could look at
Conclusion

Problem statement

Genealogy has been a popular hobby for a long time. In the past 10-15 years, affordable DNA testing has added a new dimension by providing factual data to any person about their own DNA.

Beyond just a hobby, genealogy also provides incredibly valuable information. For some people, that includes medical data including disease susceptibility. For those who were adopted, DNA can be the only lead to the rest of their heritage.

In short, a lot of people would find a lot of value if they could get to usable information from their DNA more easily.

Something as deeply personal as DNA data also comes with privacy concerns. The largest DNA analysis firm, Ancestry.com, provides only limited access to the raw DNA data itself. Ancestry preserves some measure of privacy by not reporting exactly which specific DNA segments they have analyzed. That privacy hides a lot of information. Instead, Ancestry provides a "match list": for each person being tested, a list of other people who match the tester in least one identical segment of DNA. Ancestry also provides a second dimension dimension, the "shared match list": for each pair of matches, a further list of additional people who share at least one segment (not necessarily the same segment) with both. Finally, Ancestry reports a measure of the total amount of DNA shared between the test taker and each match, where the closest relatives usually share the most DNA.

The shared match data can be used even in a manual fashion. For example, any matches that are shared between a test taker and a known maternal cousin are also likely to be related on the tester's maternal side. This manual approach works well within 1-2 generations and can be stretched to 3 generations with some effort when the test taker already knows something about their family tree. This manual approach is extremely difficult for those without that knowledge (such as adoptees), and difficult for anyone beyond the closest few generations.

Most test takers have between 10,000 and 300,000 matches. Is there a way to analyze the match lists and shared match lists to provide real, actionable data about the test taker's genealogy?

Yes, there is.

The clustering algorithm

The lists of matches and shared matches provided by Ancestry comprise a multi-dimensional data set. The Shared Clustering application implements a variation on hierarchical agglomerative clustering to arrange the matches based on their similarity, so that the matches that are most similar to each other are arranged nearest to each other in the final cluster diagram.

The cluster diagram provides a visual display that closely correlates with the underlying DNA segment data. That segment data is not exposed by Ancestry but was used by Ancestry to generated the lists of shared matches. Part of the diagram will typically contain the tester's maternal relatives while another part contains the tester's paternal relatives. The maternal relatives are in turn separated into clusters of people related through the maternal grandmother versus maternal grandfather, etc. In ideal cases, the relative within clusters can be traced back 10 generations or more. Analysis of the results is usually limited more by the lack of historical records (census, marriage, etc.) than the fidelity of the clusters being generated.

Alternate clustering algorithms

There are many clustering algorithms. Hierarchical agglomerative clustering was not my first choice; it was the most useful of the ones I looked at, in terms of producing accurate, actionable results.

There are also many possible implementations of hierarchical agglomerative clustering, especially in how the hierarchy is defined and the agglomeration is performed. I've preserved some of my discarded experiments in the code for others to review and possibly improve on.

Several other DNA match clustering applications have also been published in the time since I started on this project. Based on their behavior -- none of them are Open Source, so I can't be positive -- all of the others perform clustering based on a clique-based network analysis. In some cases, clique-based clusters also provide reasonable information, especially for test takers who have very simple DNA and simple networks of shared matches. Many test takers have DNA that is not simple, with at least some intermarriage between related ancestors, and clique-based clustering produces poor results with those tests.

By using hierarchical agglomerative clustering instead of clique-based clustering, Shared Clustering produces the most useful results.

Being useful is important to me.

Development environment

Shared Clustering is a Windows WPF application written in C#, because that's the environment that I was using in my day job when I wrote it.

I am NOT religious about what framework or what language I use. I will happily use whatever is most appropriate for the problem being solved. I used C++ as my primary language for many years. I used Java for a while. I even used Ruby for a while. I am NOT looking for "a C# development position". I am looking for a position where I can solve useful problems, and I will learn and use any language that's appropriate for that position.

Why not a web application?

To repeat a theme: A web application was not appropriate for the problem being solved. For this particular problem, there are three big strikes against a web application.

Data privacy

One issue is a matter of data privacy. A web-based analysis of the tester's data would need access to that data, by definition. No matter how much security I employed, there would always be some question about what was happening to each person's data. A Windows application avoids that question completely. The data exists nowhere outside of Ancestry's servers and the tester's own computer. There can be no question about what I'm doing with the data when I don't have the data in the first place.

Interactivity

A second issue is a matter of performance. Ancestry has implemented severe throttling on their server. It can easily take hours or even days to retrieve gigabytes of shared match data from Ancestry. That means that a fully interactive web application is out of the question. At best, a web application could take a submitted request and return processed results a long time later, likely via email. At that point, many of the benefits of a web-based UI are lost anyway.

Visualization of large data sets

The third issue is the amount of data involved. A typical cluster diagram could easily have 5000 x 5000 data points. That much data isn't handled well in a web browser. In fact, most of the web-based spreadsheets -- Google Sheets, OpenOffice, LibreOffice -- are limited to 1024 columns or fewer. The desktop version of Excel can handle that much data easily, so it's a better solution in this case than a web interface.

Counterpoint: Cross-platform support

There is at least one big drawback to a Windows desktop application -- unlike a web application, the desktop application can only be run by people who have Windows. Maybe I'll make a Mac version some day. Maybe someone will contribute a Mac version to this Open Source project before I do it myself.

Other things recruiters and hiring managers could look at

Solving a real-life problem

The Shared Clustering application was created to solve a problem, and it does solve that problem.

You don't need to take my word for it. A selection of testimonials are included in the online documentation, many with external links.

More discussion takes place in the Shared Clustering User Group. That is a private group to protect occasional personal information that might be disclosed, but you can ask to join the group and then read the feedback there.

In fact, one of the most interesting parts of the discussion in the User Group is the part that isn't there. If you look at user groups for similar services (1, 2), a lot of the discussion there focuses on people who are unable to use the other programs. They don't understand how to use the other software, they don't understand how to interpret the diagram, or they question the accuracy of the results. Those discussions mostly don't happen in the Shared Clustering User Group. Shared Clustering is understandable. It works properly. When software works well, people post praise rather than complaints.

Source code and coding style

The Shared Clustering application is Open Source. All of it is posted publicly here on GitHub for anyone to review.

SOLID principles

Hiring managers should find the code organization organized in a clear and understandable way. Although this is a personal project that I created for myself, it adheres to a professional coding style. It follows SOLID principles, especially single-responsibility principle and dependency inversion. Most aspects of coding style and organization would translate to any language, not just the C# used here.

Bug fixes versus new feature development

Since the code is on GitHub, the entire change history is also available for anyone to review. Most of the commits are related to new feature development rather than bug fixing, and most of the bug fixes are simple and localized. In many cases, even the features were simple and localized. Well-written code supports feature development without demanding constant bug fixing.

Performance

Shared Clustering is intended to be an easily usable, interactive application.

As mentioned above, the initial data download is throttled by Ancestry.com, That's unfortunate. Even though I cannot increase the download speed, I have provided status messages and a progress bar that reports accurately how much time remains.

After the download is complete, the rest of the analysis is performed on the user's local computer. At that point, the rest of the performance is under my control.

The generation of a DNA match cluster diagram is fundamentally a version of the travelling salesman problem. As such, an exact solution is NP-complete. Fortunately, knowledge of the domain allows for a number of useful shortcuts.

Each tester can reasonably be assumed to have at most about 300 clusters, whether they have 10,000 or 100,000 matches. The practical limit on the number of clusters is determined by the fixed (albeit large) size of human DNA and the techniques used by companies to analyze DNA data and produce match lists in the first place. That limit allows the problem to be subdivided into at most a few hundred subproblems rather than one large one.

Those several hundred clusters are usually associated with each other in fairly simple ways. Clusters are often fully independent, allowing for many equivalent solutions that do not need to be optimized further.

Within each cluster, the matches are often associated very closely. The arrangements of matches within a cluster are often fungible, even if not literally equivalent.

All of those domain-specific details allow for very good performance in the software. Most people can generate clusters in a matter of seconds.

The resulting cluster diagram could easily lead the genealogical researcher to weeks or months of further research. At that level, Shared Clustering has served its purpose and the difference between a fraction of a second and several seconds is mostly irrelevant.

User interface

The user interface of the Shared Clustering application is at best adequate. I'm not going to defend the interface, beyond that it serves its purpose of making the application usable.

I don't pretend that I'm a great UI designer. If you're looking for a full-time UI designer, you want someone other than me.

Error handling

The best offense is a good defense, as the saying goes. Well-written code with few errors will rarely need error handling. Unfortunately, "rarely" and "never" are different things. Production code needs error handling.

Much of the error handling in Shared Clustering is simply a matter of basic validation. Any values input by a user must be assumed invalid until confirmed otherwise. Some values can be validated on the fly, such as non-numeric characters in a numeric field. Other values can only be validated by comparison to values calculated later, with error messages presented to the user when necessary.

Similarly, any data downloaded via REST must be tested for format and content, not to mention general error handling around the Internet connection itself.

All of that is important. All of that is insufficient.

The single most important design decision in Shared Clustering is the inclusion of top-level exception handles that handle all uncaught exceptions and write a log file. Software will always have unexpected failures. Shared Clustering is designed for people who may not report any more than "It didn't work!" Even so, most people can find a log file on disk and send it via email. In several cases, having a log file reduced a problem from hours or days of work, down to literally minutes.

Good error handling reduces time spent diagnosing problems. Less time spent on problems makes more time for feature development, and that's the goal after all.

User documentation

I think I can safely say that the user documentation is better than average. Software should always be usable even without any documentation, and that is true here also. But having good documentation can be useful.

In this case, I've released the Shared Clustering application to a fairly non-technical user base. The documentation is provided in my own self-defense. For every question answered clearly in the documentation, that's one less question that I have to answer personally. This approach has worked well. I do get questions from time to time, but very few of them are questions that people could have answered for themselves. I enjoy questions that make me think. Some of the best feature enhancements are prompted by external users who ask good questions.

Conclusion

I hope this discussion has given some idea of what I'm interested in and what I can deliver. Please look at the source code and form your own opinions. Nothing I can say here is more important than what you can see for yourself.