Assembly Polishing - mel-jo/probable-memory GitHub Wiki

Assembly polishing is done to correct any errors left in the assembly by using DNA short reads as a reference. It is done in two stages. First, the trimmed DNA short reads are mapped onto the genome assemblies produced by Flye, and then the assemblies are polished from the results of mapping.

Mapping

Tools used in this step

  • BWA - BWA is a fast and memory-efficient tool for aligning short sequencing reads to a reference genome.
  • samtools - samtools is a suite of tools, used to work with SAM/BAM files which are prominently seen with read alignments.

Input for this step

  • .fasta files of the assemblies generated from Flye and Canu.
  • .fastq.gz files of the processed DNA short reads.
Strain Flye assembly file Canu assembly file Processed DNA short reads file
R7 SRR24413072_processed_assembly.fasta canu_SRR24413072.contigs.fasta SRR24413071_1_paired.fastq.gz
SRR24413071_2_paired.fastq.gz
HP126 SRR24413066_processed_assembly.fasta canu_SRR24413066.contigs.fasta SRR24413065_1_paired.fastq.gz
SRR24413065_2_paired.fastq.gz
DV3 SRR24413081_processed_assembly.fasta canu_SRR24413081.contigs.fasta SRR24413080_1_paired.fastq.gz
SRR24413080_2_paired.fastq.gz

The Process

First up, the assemblies have to be prepared by indexing the assembly into a compressed, searchable version of the assembly. This makes it easier for BWA to find where each short read might align in the assembly, down the line.

Code
#load the necessary modules

module load bioinfo-tools
module load bwa
module load samtools

#do indexing 

bwa index assembly.fasta

This step produces 5 different files - .amb, ann, .bwt, .pac, .sa. All of these files will be used by BWA during alignment. Now, the reads can be aligned to the assembly. Here BWA takes each read pair and tries to find the best match on the assembly, as it compares the reads to the index and calculates where the read likely originated. The results are saved in a .sam file.

Code
bwa mem assembly.fasta read_1.fastq.gz read_2.fastq.gz > aligned_reads.sam

Then the resulting .sam file is converted to a .bam file, as it makes the file smaller and easier to process. The .bam file is also sorted and indexed into a .bam.bai file, as the .bam.bai index file make it easier for the downstream tools to access regions of .bam quickly, rather than parsing through the file every time.

Code
samtools view -S -b aligned_reads.sam > aligned_reads.bam
samtools sort aligned_reads.bam -o aligned_reads_sorted.bam
samtools index aligned_reads_sorted.bam

Polishing

Tools used in this step

  • Pilon - Pilon uses aligned short reads to fix base errors, correct indels, fill gaps, improve sequence accuracy etc., It works by checking how well the reads match the assembly, then makes corrections based on consensus.

Input for this step

  • .fasta files of the assemblies generated from Flye and Canu.
  • .bam and .bam.bai files generated from mapping.
Strain Flye assembly file Mapped index files from Flye assemblies Canu assembly file Mapped index files from Canu assemblies
R7 SRR24413072_processed_assembly.fasta flye_aligned_reads_SRR24413072_sorted.bam
flye_aligned_reads_SRR24413072_sorted.bam.bai
canu_SRR24413072.contigs.fasta canu_aligned_reads_SRR24413072_sorted.bam
canu_aligned_reads_SRR24413072_sorted.bam.bai
HP126 SRR24413066_processed_assembly.fasta flye_aligned_reads_SRR24413066_sorted.bam
flye_aligned_reads_SRR24413066_sorted.bam.bai
canu_SRR24413066.contigs.fasta canu_aligned_reads_SRR24413066_sorted.bam
canu_aligned_reads_SRR24413066_sorted.bam.bai
DV3 SRR24413081_processed_assembly.fasta flye_aligned_reads_SRR24413081_sorted.bam
flye_aligned_reads_SRR24413081_sorted.bam.bai
canu_SRR24413081.contigs.fasta canu_aligned_reads_SRR24413081_sorted.bam
canu_aligned_reads_SRR24413081_sorted.bam.bai

The Process

Pilon goes through the assembly and then the aligned reads in .bam file to scan for mismatches, low-confidence regions etc. If a lot of reads disagree with a specific position in the assembly, Pilon corrects it.

Code
module load bioinfo-tools
module load pilon

java -Xmx16G -jar $PILON_HOME/pilon.jar \
  --genome assembly.fasta \
  --frags aligned_reads_sorted.bam \
  --output polished_assembly \
  --threads 8

Output in this step

.fasta file of the polished assemblies.

⚠️ **GitHub.com Fallback** ⚠️