Assembly Polishing - mel-jo/probable-memory GitHub Wiki
Assembly polishing is done to correct any errors left in the assembly by using DNA short reads as a reference. It is done in two stages. First, the trimmed DNA short reads are mapped onto the genome assemblies produced by Flye, and then the assemblies are polished from the results of mapping.
- BWA - BWA is a fast and memory-efficient tool for aligning short sequencing reads to a reference genome.
- samtools - samtools is a suite of tools, used to work with SAM/BAM files which are prominently seen with read alignments.
-
.fasta
files of the assemblies generated from Flye and Canu. -
.fastq.gz
files of the processed DNA short reads.
Strain | Flye assembly file | Canu assembly file | Processed DNA short reads file |
---|---|---|---|
R7 | SRR24413072_processed_assembly.fasta | canu_SRR24413072.contigs.fasta | SRR24413071_1_paired.fastq.gz |
SRR24413071_2_paired.fastq.gz | |||
HP126 | SRR24413066_processed_assembly.fasta | canu_SRR24413066.contigs.fasta | SRR24413065_1_paired.fastq.gz |
SRR24413065_2_paired.fastq.gz | |||
DV3 | SRR24413081_processed_assembly.fasta | canu_SRR24413081.contigs.fasta | SRR24413080_1_paired.fastq.gz |
SRR24413080_2_paired.fastq.gz |
First up, the assemblies have to be prepared by indexing the assembly into a compressed, searchable version of the assembly. This makes it easier for BWA to find where each short read might align in the assembly, down the line.
Code
#load the necessary modules
module load bioinfo-tools
module load bwa
module load samtools
#do indexing
bwa index assembly.fasta
This step produces 5 different files - .amb
, ann
, .bwt
, .pac
, .sa
. All of these files will be used by BWA during alignment. Now, the reads can be aligned to the assembly. Here BWA takes each read pair and tries to find the best match on the assembly, as it compares the reads to the index and calculates where the read likely originated. The results are saved in a .sam
file.
Code
bwa mem assembly.fasta read_1.fastq.gz read_2.fastq.gz > aligned_reads.sam
Then the resulting .sam
file is converted to a .bam
file, as it makes the file smaller and easier to process. The .bam
file is also sorted and indexed into a .bam.bai
file, as the .bam.bai
index file make it easier for the downstream tools to access regions of .bam
quickly, rather than parsing through the file every time.
Code
samtools view -S -b aligned_reads.sam > aligned_reads.bam
samtools sort aligned_reads.bam -o aligned_reads_sorted.bam
samtools index aligned_reads_sorted.bam
- Pilon - Pilon uses aligned short reads to fix base errors, correct indels, fill gaps, improve sequence accuracy etc., It works by checking how well the reads match the assembly, then makes corrections based on consensus.
-
.fasta
files of the assemblies generated from Flye and Canu. -
.bam
and.bam.bai
files generated from mapping.
Strain | Flye assembly file | Mapped index files from Flye assemblies | Canu assembly file | Mapped index files from Canu assemblies |
---|---|---|---|---|
R7 | SRR24413072_processed_assembly.fasta |
flye_aligned_reads_SRR24413072_sorted.bam flye_aligned_reads_SRR24413072_sorted.bam.bai |
canu_SRR24413072.contigs.fasta |
canu_aligned_reads_SRR24413072_sorted.bam canu_aligned_reads_SRR24413072_sorted.bam.bai |
HP126 | SRR24413066_processed_assembly.fasta |
flye_aligned_reads_SRR24413066_sorted.bam flye_aligned_reads_SRR24413066_sorted.bam.bai |
canu_SRR24413066.contigs.fasta |
canu_aligned_reads_SRR24413066_sorted.bam canu_aligned_reads_SRR24413066_sorted.bam.bai |
DV3 | SRR24413081_processed_assembly.fasta |
flye_aligned_reads_SRR24413081_sorted.bam flye_aligned_reads_SRR24413081_sorted.bam.bai |
canu_SRR24413081.contigs.fasta |
canu_aligned_reads_SRR24413081_sorted.bam canu_aligned_reads_SRR24413081_sorted.bam.bai |
Pilon goes through the assembly and then the aligned reads in .bam
file to scan for mismatches, low-confidence regions etc. If a lot of reads disagree with a specific position in the assembly, Pilon corrects it.
Code
module load bioinfo-tools
module load pilon
java -Xmx16G -jar $PILON_HOME/pilon.jar \
--genome assembly.fasta \
--frags aligned_reads_sorted.bam \
--output polished_assembly \
--threads 8
.fasta
file of the polished assemblies.