6. MLST and phylogenetic analysis - milena-t/masters_thesis Wiki

Background

Multi-locus sequence typing is the process of aligning genes from genomes to a schema of 7 core genes and giving each new discovered allele a number. This way each genome can be typed by which alleles of the 7 core genes are present. For example, the schema would be 1-1-1-1-1-1-1, since it's the baseline for all allele changes. When the first genome is aligned, and it differs in the last two of the 7 genes, it would be 1-1-1-1-1-2-2. So, each new allele per gene gets a new number. The numbers don't signify how different the alleles are, allele 1 and 2 from a gene are only the first two discovered ones, we don't know if they are similar or more different than e.g. allele 1 and 189.

Since the 7 genes in the schema are specifically picked because they are highly conserved, this method can be used to perform a (low-resolution) phylogenetic analysis. It is easy to cluster the genomes when they are summarized by a sequence of 7 numbers. These clusters can be used to form a tree.

The clusters can be analyzed with roary. Since each cluster contains less genes than the whole genome, it is likely possible to run then through roary and then compare pan-genomes.

Pipeline

chewBBACA

Bo has suggested chewBBACA (github, wiki) for cgMLST. I used the conda installation.

Create the custom Schema

To ensure reproducibility, it is important to use the same training file (species-specific) for each created schema. chewBBACA uses prodigal training files, which are available for download here.

chewBBACA.py CreateSchema -i /data/campy/jejuni1000_mlst -o . --n test_schema --ptf Campylobacter_jejuni.trn --cpu 16

Detailed info on the arguments here. The only thing of note is that the fasta files in -i need to be uncompressed.

After that, the schema is refined by performing an allele call on the genomes used to create the schema. This takes a while, about 5 hours for 1000 C. jejuni genomes. The schema can then be evaluated, more info here.

Use the Oxford schema

I will use the Oxford schema, which is available on pubMLST. I will only use the 7-gene MLST schema first, maybe expand to the cgMLST one later.

There is a function in chewBBACA called PrepExternalSchema that can be used to filter out sequences that do not meet the criteria (like start and end with start/stop codons, no ambiguous characters...). This function does not work with the above linked schema, it recognizes the genes and alleles, but classifies all of them as invalid.

Use the Innuendo schema

the INNUENDO schema is already prepared for chewBBACA. I will try it for allele calling. It can be downloaded, unzipped (tar -xvzf Cjejuni_wgMLST_2795_schema.tar.gz), and the resulting folder can directly be entered into the -g argument for the AlleleCall function. ChewBBACA gives a warning that the schema was generated with chewBBACA2.1.0 or lower, but I have chosen to ignore this and proceed anyways.

chewBBACA requires four additional parameters that aren't explained in the documentation. They are listed below with my chosen values in ():

The default values are for wgMLST schema creation, there is supposed to be a schema config for the allele call, but I don't have that, so I'll try the default values the next time.

Results on 20 test-genomes:

the results can be visualized in phyloviz when results_alleles.tsv is prepared first as follows:

chewBBACA.py ExtractCgMLST -i /path/to/allelecall/results/results_alleles.tsv -o /path/to/OutputFolderName --t 0

This results in two relevant output files that can be importet into phyloviz: cgMLST.tsv and metadata_stats.tsv.

MLST

This program (GitHub) does normal MLST with 7 genes. I used the conda installation, but it requires one perl package (List::MoreUtils) to be added manually to conda, which I did with cpanm List::MoreUtils (description of cpanm here).

The usage is very simple, the only important part is that the command line output should be redirected into a file:

mlst /data/campy/jejuni1000_mlst/*fna.gz > jejuni1000.mlst

--csv can be used to return the output as csv instead of the default tsv.

Since the input files can be listed with *, a space separated list is also possible. For this reason, when running the entire dataset, I will use the db_tools.perl script and make a list with the absolute file paths for all the jejuni files in the database. This can then be modified to a bash script where the mlst command is added at the top and a output file with > at the bottom. The command for the correct schema also needs to be added.

This works fine for 10 genomes in test, but when this is done for all >47000 C. jejuni genomes, the program terminates with the error "argument list too long". I have decided to split it into 47 individual runs with 1000 genomes each. I can merge the output files after.

Schema

There is a specific schema available for campylobacter (for all schemas run mlst --longlist). It can be used with --scheme campylobacter

Oxford Schema

the oxford schema is a general cgMLST (or MLST) schema that is available for download on pubMLST.

Evaluation

I will use Phyloviz for evaluation. It requires two input files, a tsv file with genome name, ST, and allele types in each row, with a header. the second is a corresponding metadata file that is similarly organized except that the metadata values are in the columns.