Taxonomy Network - serratus-bio/open-virome GitHub Wiki

NCBI Taxonomy

The NCBI Taxonomy forms a monopartite network containing Taxon nodes and HAS_PARENT edges from a human curated ontology. This network has a hierarchical or tree structure. The nodes and edges are extracted from NCBI Taxonomy FTP.

A bipartite network can be formed with SRA Run nodes and associated Taxon nodes via HAS_HOST_METADATA edges. These edges are mined from the "Organism" label in the SRA run metadata.

A bipartite network can be formed with SRA Run nodes and associated Taxon nodes via HAS_HOST_STAT edges. These edges are mined from the NCBI STAT tool.

A bipartite network can be formed with sOTU Virus nodes and associated Taxon nodes via HAS_INFERRED_TAXON edges. These edges are generated by BLASTing the RdRP palmprint against Genbank viral genomes to find the top hit, then mapping the organism to the NCBI taxonomy.

Summary stats

Total number of Taxon nodes: 2,501,873

Total number of HAS_PARENT relationships: 5,003,746

Total number of HAS_HOST_METADATA relationships: 7,679,352

Total number of HAS_HOST_STAT relationships: 17,597,641

Total number of HAS_INFERRED_TAXON relationships: 950,482

Communities

In the monopartite taxonomy network, all Taxon nodes form a single connected component.

Visualizing the entire network with a force-directed layout shows naturally forming communities of closely related taxonomy ids. We can use hierarchical community detection algorithms to reduce the number of labels during a feature engineering step.