Taxonomy Network - serratus-bio/open-virome GitHub Wiki
NCBI Taxonomy
The NCBI Taxonomy forms a monopartite network containing Taxon
nodes and HAS_PARENT
edges from a human curated ontology. This network has a hierarchical or tree structure. The nodes and edges are extracted from NCBI Taxonomy FTP.
A bipartite network can be formed with SRA Run
nodes and associated Taxon
nodes via HAS_HOST_METADATA
edges. These edges are mined from the "Organism" label in the SRA run metadata.
A bipartite network can be formed with SRA Run
nodes and associated Taxon
nodes via HAS_HOST_STAT
edges. These edges are mined from the NCBI STAT tool.
A bipartite network can be formed with sOTU Virus
nodes and associated Taxon
nodes via HAS_INFERRED_TAXON
edges. These edges are generated by BLASTing the RdRP palmprint against Genbank viral genomes to find the top hit, then mapping the organism to the NCBI taxonomy.
Summary stats
Total number of Taxon
nodes: 2,501,873
Total number of HAS_PARENT
relationships: 5,003,746
Total number of HAS_HOST_METADATA
relationships: 7,679,352
Total number of HAS_HOST_STAT
relationships: 17,597,641
Total number of HAS_INFERRED_TAXON
relationships: 950,482
Communities
In the monopartite taxonomy network, all Taxon
nodes form a single connected component.
Visualizing the entire network with a force-directed layout shows naturally forming communities of closely related taxonomy ids. We can use hierarchical community detection algorithms to reduce the number of labels during a feature engineering step.