III. Assembling plastid genomes - barrettlab/2021-Genomics-bootcamp GitHub Wiki

Plastid genome assembly with Fast-Plast, NOVOplasty, and GetOrganelle

Links to GitHub pages:

1. Seed and assemble: de novo plastome assembly with NOVOplasty

NOVOplasty is a perl script that requires three files (or more, optionally):

i. Your reads (either R1.fastq + R2.fastq, or a "shuffled" R1R2.fastq file) ii. A "seed," for example a closely related rbcL sequence from Genbank iii. A configuration file. This is just a manually edited text file pointing to your reads, the seed, etc., and supplying some parameters.

Here is an example config file (e.g. config.txt)

Project:
-----------------------
Project name          = Mitochondrial_genome_assembly_01
Type                  = mito
Genome Range          = 12000-20500
K-mer                 = 39
Max memory            = 11
Extended log          =
Save assembled reads  =
Seed Input            = seed.fasta
Reference sequence    =
Variance detection    = 
Chloroplast sequence  =

Dataset 1:
-----------------------
Read Length           = 151
Insert size           = 412
Platform              = illumina
Single/Paired         = PE
Combined reads        = 
Forward reads         = R1.fq.gz
Reverse reads         = R2.fq.gz

Heteroplasmy:
-----------------------
MAF                   = 
HP exclude list       = 
PCR-free              = 

Optional:
-----------------------
Insert size auto      = yes
Insert Range          = 1.9
Insert Range strict   = 1.3
Use Quality Scores    = no

Project:
-----------------------
Project name          = Hexalectris_warnockii_plastome
Type                  = chloro
Genome Range          = 50000-160000
K-mer                 = 41
Max memory            = 
Extended log          =
Save assembled reads  =
Seed Input            = rps14_seed.fasta
Reference sequence    =
Variance detection    = 
Chloroplast sequence  =

Dataset 1:
-----------------------
Read Length           = 101
Insert size           = 412
Platform              = illumina
Single/Paired         = PE
Combined reads        = 
Forward reads         = Hexalectris_warnockii_R1.fq.gz
Reverse reads         = Hexalectris_warnockii_R2.fq.gz

Heteroplasmy:
-----------------------
MAF                   = 
HP exclude list       = 
PCR-free              = 

Optional:
-----------------------
Insert size auto      = yes
Insert Range          = 1.9
Insert Range strict   = 1.3
Use Quality Scores    = no

Project1
/path/to/seed_file/Seed1.fasta
/path/to/reads/reads_1a.fastq
/path/to/reads/reads_2a.fastq
Project2
/path/to/seed_file/Seed2.fasta
/path/to/reads/reads_1b.fastq
/path/to/reads/reads_2b.fastq
Project3
/path/to/seed_file/Seed3.fasta
/path/to/reads/reads_1c.fastq
/path/to/reads/reads_2c.fastq
# etc...

Project:
-----------------------
Project name          = batch:/path/to/batch_file.txt
Type                  = mito
Genome Range          = 12000-20500
K-mer                 = 39
Max memory            = 11
Extended log          =
Save assembled reads  =
Seed Input            = batch
Reference sequence    =
Variance detection    = 
Chloroplast sequence  =

Dataset 1:
-----------------------
Read Length           = 151
Insert size           = 412
Platform              = illumina
Single/Paired         = PE
Combined reads        = 
Forward reads         = batch
Reverse reads         = batch

Heteroplasmy:
-----------------------
MAF                   = 
HP exclude list       = 
PCR-free              = 

Optional:
-----------------------
Insert size auto      = yes
Insert Range          = 1.9
Insert Range strict   = 1.3
Use Quality Scores    = no

To run NOVOPlasty, make sure your batch, config, seed, and reads are in the same place, or specify their absolute paths.

###Simply type the line below. Note: you may need to type 'perl' before the script.

NOVOPlasty4.3.1.pl -c config.txt

2. Assemble, orient, and verify whole plastid genome sequences with Fast-Plast

3. Bait, map, and de novo assemble whole plastid genome sequences with GetOrganelle

Running GetOrganelle

General command format

get_organelle_from_reads.py -1 SRR5602600_1.fastq.gz -2 SRR5602600_2.fastq.gz -o SRR5602600-plastome -R 15 -F embplant_pt

A run with:

-s C_mac_mac_reference_F2095.fasta								# seed file of a plastome in fasta format, from GenBank
-1 C_mac_mac_0161e_Ouray_CO/C_mac_mac_0161e_Ouray_CO_S22_L001_R1_001.fastq.gz			# read 1
-2 C_mac_mac_0161e_Ouray_CO/C_mac_mac_0161e_Ouray_CO_S22_L001_R2_001.fastq.gz			# read 2
-o get_organelle_output/C_mac_mac_0161e_Ouray_CO_t16_seeded					# specify a new output directory
-t 20											        # run on 20 cores, will speed up read mapping and (maybe) spades assembly with multiple kmer lengths
-R 25	                                                                                        # the number of extension iterations
-k 21,31,45,65,85										# different kmer sizes for assemblies with spades
-F embplant_pt							                                # use the embryophyte plastome database

FULL COMMAND

Also uses 'nohup' and '&' to make it run in the background

nohup get_organelle_from_reads.py -1 C_mac_mac_0161e_Ouray_CO/C_mac_mac_0161e_Ouray_CO_S22_L001_R1_001.fastq.gz -2 C_mac_mac_0161e_Ouray_CO/C_mac_mac_0161e_Ouray_CO_S22_L001_R2_001.fastq.gz -o get_organelle_output/C_mac_mac_0161e_Ouray_CO_t16_seeded -s C_mac_mac_reference_F2095.fasta -t 20 -R 25 -k 21,31,45,65,85 -F embplant_pt &

For a quick & dirty run (not recommended, only for exploratory purposes)

get_organelle_from_reads.py -1 C_mac_mac_0161e_Ouray_CO/C_mac_mac_0161e_Ouray_CO_S22_L001_R1_001.fastq.gz -2 C_mac_mac_0161e_Ouray_CO/C_mac_mac_0161e_Ouray_CO_S22_L001_R2_001.fastq.gz -o get_organelle_output/C_mac_mac_0161e_Ouray_CO_t16_seeded --fast -k 21, 55, 85, -F embplant_pt

GetOrganelle gives weird fasta headers, e.g.

>6736872_6685368_6749650-,6684792-,6738466_6736954_6730482_6741986_6733100_6749442_6738164_6728062_6687416_6732826_6739348_6666028_6...
ATTTATAGGATTCAAATAATCAAATAAAATAAAGATAGGCGGGTAATAACCTTATTTATGACAAGTTTCAAATTGGTAAAGTATACCCCTAGGATAAAGAAGAAGAAGGGGCTGAGAAAACTCGCAAGAAAA...

Here is a simple for-loop using awk to rename the fasta headers for multiple files with the file names:

for FILE in *.fasta;
do
 awk '/^>/ {gsub(/.fa(sta)?$/,"",FILENAME);printf(">%s\n",FILENAME);next;} {print}' $FILE > changed_${FILE}
done

Output:

>CHN-Hubei-1980
ATTTATAGGATTCAAATAATCAAATAAAATAAAGATAGGCGGGTAATAACCTTATTTA...

How can I reorient the plastome output of GetOrganelle? Use ECuADOR.pl script

First, make a new directory called 'reorient_test_ecuador'

mkdir reorient_test_ecuador

### then, move your plastomes there with mv or cp command

### run the script:
perl ECuADOR.pl -i reorient_test_ecuador -w 1000 -f fasta -out test --ext fasta --save_regions ALL --orient TRUE --noIRs 2