2. DNA Assembly - Sara-SL/GenomeAnalysis GitHub Wiki

Method

SOAPdenovo

When I was satisfied with the quality of my data I assembled the data using the software SOAPdenovo. To do this I fist created a config file and then a batch-job since the running time was estimated to 1,5h. This was done using the following commands:

mkdir 2_genome_assembly

touch SOAPdenovo.config

nano SOAPdenovo.config

cd ../Scripts/

touch SOAPdenovo_batch_job.sh

nano SOAPdenovo_batch_job.sh

sbatch SOAPdenovo_batch_job.sh

When writing the config file I followed the SOAPdenovo manual example, example.config. Each library section starts with the tag [LIB] and includes the following items:

avg_ins
reverse_seq
asm_flags
rd_len_cutof
rank
pair_num_cutoff
map_len

Library bat_175, bat_400 and bat_800 are paired-end reads while library bat_2to4kb, bat_5to6kb and bat_8to10kb are mate-pair reads. Therefore these libraries had a bit different parameter values. The global parameter max_rd_len was set to 100. The K-mer parameter was set 49 - same as in the paper.

Evaluation of SOAPdenovo

To evaluate the results from SOAPdenovo I run Nucmer and MUMmerplot. The scaffold for my dataset had already been uploaded on UPPMAX so I copied and unzipped it, and used it as input reference file in Nucmer. As input query file I used the .conigt SOAPdenovo output file for some runs and the .scafSeq SOAPdenovo output file for other runs. I run Nucmer with the command:

module load bioinfo-tools

module load MUMmer

nucmer -p Nucmer_contig /home/sarasl/git/GenomeAnalysis/Data/scaffold/sel4_NW_015503979.fna /home/sarasl/git/GenomeAnalysis/2_genome_assembly/SOAPdenovo_output_K49.contig

After I had run Nucmer I used the output file as input file in MUMmerplot. I created a batch script containing the following command:

mummerplot -p MUMmerplot_contig --png /home/sarasl/git/GenomeAnalysis/2_genome_assembly/Evaluation/Nucmer_contig.delta

Since this graph didn't give a good result I tried to remove short alignments by running delta-filter before MUMmerplot. I tried this with both parameter -l 100 and -l 1000 but it didn't get any better. Then I tried to switch to the .scafSeq sequence, run Nucmer, delta-filter with parameter -l 1000 and then MUMmerplot but still no improvement. Last I tried to run MUMmerplot with the parameters --layout and --filter and I finally got a good result.

delta-filter -l 100 /home/sarasl/git/GenomeAnalysis/2_genome_assembly/Evaluation/MUMmerplot_scafSeq_1kb/Nucmer_scafSeq.delta > /home/sarasl/git/GenomeAnalysis/2_genome_assembly/Evaluation/MUMmerplot_scafSeq_100bp_layout_filter/delta-filter_scafSeq_100bp_layout_filter.delta

mummerplot -p MUMmerplot_scafSeq_100bp_layout_filter --png --layout --filter /home/sarasl/git/GenomeAnalysis/2_genome_assembly/Evaluation/MUMmerplot_scafSeq_100bp_layout_filter/delta-filter_scafSeq_100bp_layout_filter.delta

Remove short contigs

Assemblers usually include short contigs in their output, but since short contigs will make all downstream processes slower and doesn't provide much information I wanted to remove them before continuing my analysis. I did this by writing a python script using BioPython. I found a script on BioPythons homepage which I used as a template and modified it so that it only cleaned the data of contigs shorter than 1kb. I run the python script for both the .contig and .scafSeq file and checked the result by running Nucmer and MUMmerplot again.

Result

Figure 1: Plot from first MUMmerplot run with .contig file

MUMmer_contig

Figure 2: Plot from MUMmerplot run with .contig file, only alignments longer than 100bp

MUMmer_contig_100bp

Figure 3: Plot from MUMmerplot run with .contig file, only alignments longer than 1kb

MUMmer_contig_1kb

Figure 4: Plot from MUMmerplot run with .scafSeq file, only alignments longer than 1kb

MUMmer_scafSeq_1kb

Figure 5: Plot from MUMmerplot run with .scafSeq file and layout and filter option

MUMmer_scafSeq_100_layout_filter

Figure 6: Plot from MUMmerplot with cleaned .config file

MUMmer_contig_cleaned

Figure 7: Plot from MUMmerplot with cleaned .scafSeq file

MUMmer_scafSeq_cleaned

Discussion

At the time that the paper was written, there were only high coverage genomes for 3 bat species and low coverage genomes for 2 bat species available and the last common ancestor to the bat species that they were looking at was 43 million years ago. Because of this, the researches didin't have a references genome to compare their assembly against. Instead they looked at different measures like total scaffold length, longest scaffold, N50, NG50, %N, and Heterozygosity etc. From these different measures they concluded that their genome assemble was a reliable substrate for subsequent genomic analyses.

In this studie I have used their genome as a reference and assumed that if my assembly look similar to the assembly from the paper, then the quality should be good enough. When the MUMmerplot show a continuous linear graph with slope=1, it means that the assembly and the reference genome look the same which in turn show that the assembly is good. A good assembly means that the reads have been puzzled correctly forming a continuous sequence representing the true genome (assuming that the reference is the true genome sequence). Form the result I could see that only the MUMmerplot from the assembly with the cleaned .contig file gave a linear graph with slope=1 (figure 6). In figure 7, with the cleaned .scafSeq file, I could see that I had many scaffolds that didn't align to my reference. Therefore I decided to continue my analysis with the cleaned .contig file.