Tree placement and taxonomy data - ababaian/serratus GitHub Wiki

Current best pol alignment:

pol.muscle.afa

Raxml pol tree from this alignment:

RAxML_bestTree.pol.muscle.raxml

Radial visualization of pol tree colored by genus:

I dumped some data from my Linux server at home to S3. This a snapshot of work in progress, it is not well organized or documented. You can access s3:// URLs using the aws cli, or with wget/curl by replacing s3://serratus-public/ by https://serratus-public.s3.amazonaws.com/.

There are two top-level directories:

s3://serratus-public/rce/uniprot_genes
s3://serratus-public/rce/complete_cov_genomes

The polymerase (also called pol or RdRP for RNA dependent RNA polymerase) alignment is in this sub-directory:

uniprot_genes/pol_msas/

There is a muscle alignment in aligned FASTA (.afa) and Phylip sequential (.phys) formats. I tried running probcons but it was very slow, didn't complete after a day or so. Might be nice to make two or three different trees and take a consensus. For now, the muscle+raxml tree is fine I think.

Raxml output is in this directory:

uniprot_genes/raxml/pol.muscle/

There is one pol gene for each full-length genome in GenBank. Information about the genomes is in cov_complete_genomes/, including GenBank records (.gb), FASTA sequences etc. The cov_complete_genomes/complete.tsv file has a handy summary of taxonomic information. Fields are:

GenBank accession.
NCBI integer taxonomy identifier of the GB record.
Species taxonomy identifier (inferred from the taxonomy database tree).
Genome length in bases.
Taxonomy name corresponding to field 2.
Full taxonomy from GB record.
Full taxonomy with rankname:sciname by climbing taxonomy tree from id in field 2.