Processing of whole plastid genomes, alignment, and partitioning for phylogenomic analysis - barrettlab/2021-Genomics-bootcamp GitHub Wiki

Annotate each complete plastome with [GEseq] (https://chlorobox.mpimp-golm.mpg.de/geseq.html). Use the closest available GenBank sequence as the reference for annotation. Be sure to annotate the IR. Choose linear, so you can make a single figure of all the annotations together.
Download the GenBank files from GEseq, and import them into Geneious.
Download all available plastomes from e.g. Asteraceae plus some e.g. non-Asteraceae outgroups from NCBI GenBank. Import them into Geneious.
Select all the plastomes (GenBank files, plus your new annotations).

NOTE: you need to remove one copy of the Inverted Repeat (IR) here, because those sequences in the IR will be duplicated! You can do this manually: if the IR is annotated, then just select and delete in Geneious. If not, then select all sequences after the start of the full ycf1 CDS and delete.

You here you can also select all, and under Edit > Batch Rename you can modify the names for each accession. Replace the name of the sequence with 'Organism' and 'Accession'. You will notice a bunch of duplicate accessions with numbers starting with 'NC_XXXXXXX'. You will need to remove these if they duplicate those that already have GenBank accession #s.

Now, click the annotation tab, and for Type, choose CDS (coding sequences). For Columns, choose: Sequence name, Name, and Sequence. Select all, and then press 'extract.' Be sure to uncheck the box asking to modify the names with extraction. Export this as a .csv file.
Open the .csv document, and make a new column in the first column, make one between the sequence name/taxon name and again between the taxon name and sequence.
For the first column, insert a ‘>’ and drag all the way down (or use ctrl+D). Between the sequence/taxon names, insert something like ‘@’ and between the taxon name/sequence insert ‘#’. You’ll use these to search and replace later.

Each sequence should be on a single line. It should look like:

>accD@Heliopsis_helianthoides#ATGCCGACAGCA etc.

Copy and paste everything into a text file and replace the tab characters (\t) with nothing.
Check to make sure Geneious didn’t truncate the longer sequences (if it did, you’ll see ‘…’ at the end of your sequences. Not sure why it does that, but sometimes it cuts everything off at the end after a certain length. You’ll need to fix that if it happens so you have the full sequences).
Set up a spreadsheet of commands, to run sequentially. You can drag the gene names down, as these will be the only thing that changes, along with your output filenames. Basically, you are going to use grep in UNIX to capture each CDS from all taxa and create a new single locus file for each CDS.

grep -e '	accD	Hex_genes_all.txt  >	by_gene	accD	.fasta		grep -e 'accD' Hex_genes_all.txt  > by_gene/accD.fasta	
grep -e '	atpA	Hex_genes_all.txt  >	by_gene	atpA	.fasta		grep -e 'atpA' Hex_genes_all.txt  > by_gene/atpA.fasta	
grep -e '	atpB	Hex_genes_all.txt  >	by_gene	atpB	.fasta		grep -e 'atpB' Hex_genes_all.txt  > by_gene/atpB.fasta	
grep -e '	atpE	Hex_genes_all.txt  >	by_gene	atpE	.fasta		grep -e 'atpE' Hex_genes_all.txt  > by_gene/atpE.fasta	
grep -e '	atpF	Hex_genes_all.txt  >	by_gene	atpF	.fasta		grep -e 'atpF' Hex_genes_all.txt  > by_gene/atpF.fasta

Use an excel formula to get the full commands, e.g.:

=A1&" "&B1&" "&C1&" "&D1&" "&">"&" "&B1&E1&" "&";"

Then drag down until you have the full command for each CDS. You can also specify output to a new folder in this set of commands. E.g. mkdir by_gene and then in each command for the redirect, specify ‘by_gene/accD.fasta’ (see above). End each grep command with a ‘;’ to keep them separate.

*** Some problems may occur with rpl2, rpl20, and rpl22. You can fix this by including the @ in your command for rpl2 (rpl2@), to make sure it only captures that gene (otherwise grep will grab all 3 genes)

Copy and paste all the commands into the terminal. It should run quickly.
Check the output fasta files to make sure grep worked correctly for each CDS, especially rpl2.
Now you should have a separate file for each CDS, and you can align them with a series of commands just like you did with grep. But first, you need to search and replace to get rid of @ and #. You can open all the files in something like Notepad++ or just use a UNIX command (e.g. sed) to replace:

‘>accD@’ with ‘>’ [Get rid of the gene names from each line] Use a wildcard/regex: replace ‘>.*@’ with ‘>’  you can search and replace in all files if they are all open. Then, replace ‘#’ with a newline ‘\n’.

Or, in UNIX terminal, use sed:

sed ‘s/>.*@//g’ < accD.fasta | sed ‘s/#/\n/g’ > accD_edited.fasta

Mkdir edited_fastas
output=edited_fastas
for file in *.fasta
do
sed ‘s/>.*@/>/g’ < ${file} | sed ‘s/#/\n/g’ > ${output} 
done

Now, open one of the modified fastas, and you should see:

>Heliopsis_helianthoides
ATGCCGACAGCAGGACATTTGGTCGCATTTACT etc.

Alignment with MACSE: Again, set this up in a spreadsheet. We will use MACSE which is a codon-aware aligner.

macse_v1.2.jar -prog alignSequences -seq  accD_edited.fasta -gc_def 11 ;
macse_v1.2.jar -prog alignSequences -seq  atpA_edited.fasta -gc_def 11 ;
macse_v1.2.jar -prog alignSequences -seq  atpB_edited.fasta -gc_def 11 ;
macse_v1.2.jar -prog alignSequences -seq  atpE_edited.fasta -gc_def 11 ;
etc.

This will take a while.

My suggestion is to run these commands 20 at a time (i.e. using 20 cores on the server), and then wait until all alignments finish. Then, run the next 20, etc. ycf1 and ycf2 will take the longest, because they are the longest CDS.

Use TriFusion or SequenceMatrix to concatenate the alignments. (Available for Windows/Mac/Linux). You want to have a concatenated NEXUS file with a PARTITIONS or SETS block at the end.

begin sets;
  CHARSET COI=1-688;
  CHARSET 16S=689-1250;
  CHARSET morph=1251-1322;

[Etc.]…

END;

However, this only partitions by gene or locus, so if you want to partition by dataset (e.g. gene) AND codon position, here is the format.

  CHARSET COIpos1=1-688\3;
  CHARSET COIpos2=2-688\3;
  CHARSET COIpos3=3-688\3;

Etc.