GSOC final submission - AlishaMechtley/treematcher GitHub Wiki

Tree searching using regular-expression-like queries

-Organization: Open Bioinformatics Foundation
-Project description: http://obf.github.io/GSoC/ideas/#ete-toolkit
-Mentors: Jaime Huerta-Cepas, [email protected]
Renato Alves, [email protected]
François Serra, [email protected]

A short description of what work was done

A command line tool

A command line tool was created. It has an additional option for input (multiple patterns can be read from a file) and several options for output (e.g., image rendering, Newick format, or return the first node that matches).

Improvements to the search engine

A cache option to increase the speed of searching really large trees with thousands of leaves. A leaf-only parameter and modified caching algorithm to include cacheing all nodes greatly expands the types attributes that can be cached.

Better syntax

The original search engine required that all sibling nodes be specified and matched explicitly. By permuting the target trees rather than the pattern, I allowed the pattern to not have to explicitly label every sibling node. I also created a smart_lineage function which demonstrates how to evaluate constraints using ast where you can pull out the parts of a constraint and evaluate them differently. The example shows how you can use a name (e.g., Homo Sapiens) instead of a taxid with @.lineage making it appear like the program "knows" what you are trying to do instead of throwing an error for incorrect usage. Finally, I began working on a way to match zero or more nodes in patters (as the * is used in regular expressions) which currently works for linear patterns and will be expanded to more complex patterns.

Documentation and Unittests

I created the unittests. I documented the code and the ReadMe file. I also created a biological example.

A more detailed description of the work that was done can be found in my blog: https://github.com/AlishaMechtley/treematcher/wiki/Blog

What code got merged, what code didn't get merged

I began working on my own GitHub page and did a pull request whenever I made significant progress. https://github.com/AlishaMechtley/treematcher

Eventually, I started pushing directly to the main repository. https://github.com/etetoolkit/treematcher

Most everything was merged except for two cases where I forced a push and lost my own commits and the contributions of my mentors. Once I learned how to merge code with conflicts (i.e., go through line by line and accept what you want to keep), everything was properly merged after that.

What's left to do.

The optional machine learning for autogenerated patterns was not implemented.

Perhaps this can be a possibility for the next GSoC.

The optional Visualization tool was not implemented.

However, I do have a plan to implement one in the coming weeks.