Limitations and suggestions - tahiri-lab/KMeansPhyloTreesClustering GitHub Wiki

⚠️ Limitations and Suggestions

🔴 Limitations

1. Robinson–Foulds (RF) Distance

The Robinson–Foulds distance only considers the topology of phylogenetic trees.
It does not take into account the branch lengths, which can carry important biological information.

As a result, two trees with the same structure but different evolutionary distances may be considered identical, which can lead to a loss of relevant biological insight.


2. K-means Clustering

The K-means algorithm is not fully adapted to phylogenetic tree data.

K-means is designed for data in a Euclidean space, where objects are represented as points in an orthonormal coordinate system. However, phylogenetic trees are not points in such a space, but complex hierarchical structures.

Moreover, K-means relies on the computation of a centroid using a geometric mean.
In this context, the centroid is represented by a tree (supertree), which is not mathematically equivalent to a true mean in Euclidean space.

This makes the centroid approximation inaccurate and theoretically questionable.


3. Sensitivity to the Alpha Parameter

The parameter α (alpha) introduces subjectivity in the clustering process.

If α is:

  • too large → it may overemphasize certain criteria
  • too small → it may underrepresent important variations

This can significantly affect the quality and stability of the resulting clusters, making the results sensitive to parameter tuning.


🚀 Suggestions for Improvement

1. Alternative Clustering Methods

To better handle phylogenetic tree structures, alternative clustering approaches could be used:

  • Hierarchical clustering
  • K-medoids (PAM)

These methods are more suitable for non-Euclidean data and do not rely on artificial centroids.


2. Improved Biological Interpretation

The current system provides limited support for biological interpretation of the results.

Improving this aspect would allow users to:

  • better understand evolutionary relationships
  • extract meaningful biological insights from clusters

3. Visualization Tools

The user interface lacks visual representation of results.

Adding visualization features such as:

  • tree plotting
  • cluster visualization
  • graphical summaries

would significantly improve:

  • usability
  • interpretability
  • user experience

4. User Experience Enhancements

The current interface (command-line based) can be difficult for users.

Improvements could include:

  • a graphical user interface (GUI)
  • simplified parameter selection
  • clearer output formatting

🔥 Conclusion

While the project provides a structured approach to clustering phylogenetic trees,
its limitations highlight the need for more suitable algorithms and better usability tools.

Future improvements should focus on:

  • adapting methods to tree-structured data
  • enhancing biological relevance
  • improving user interaction and visualization