2. DNA Assembly - Sara-SL/GenomeAnalysis GitHub Wiki
Method
SOAPdenovo
When I was satisfied with the quality of my data I assembled the data using the software SOAPdenovo. To do this I fist created a config file and then a batch-job since the running time was estimated to 1,5h. This was done using the following commands:
mkdir 2_genome_assembly
touch SOAPdenovo.config
nano SOAPdenovo.config
cd ../Scripts/
touch SOAPdenovo_batch_job.sh
nano SOAPdenovo_batch_job.sh
sbatch SOAPdenovo_batch_job.sh
When writing the config file I followed the SOAPdenovo manual example, example.config. Each library section starts with the tag [LIB] and includes the following items:
- avg_ins
- reverse_seq
- asm_flags
- rd_len_cutof
- rank
- pair_num_cutoff
- map_len
Library bat_175, bat_400 and bat_800 are paired-end reads while library bat_2to4kb, bat_5to6kb and bat_8to10kb are mate-pair reads. Therefore these libraries had a bit different parameter values. The global parameter max_rd_len was set to 100. The K-mer parameter was set 49 - same as in the paper.
Evaluation of SOAPdenovo
To evaluate the results from SOAPdenovo I run Nucmer and MUMmerplot. The scaffold for my dataset had already been uploaded on UPPMAX so I copied and unzipped it, and used it as input reference file in Nucmer. As input query file I used the .conigt SOAPdenovo output file for some runs and the .scafSeq SOAPdenovo output file for other runs. I run Nucmer with the command:
module load bioinfo-tools
module load MUMmer
nucmer -p Nucmer_contig /home/sarasl/git/GenomeAnalysis/Data/scaffold/sel4_NW_015503979.fna /home/sarasl/git/GenomeAnalysis/2_genome_assembly/SOAPdenovo_output_K49.contig
After I had run Nucmer I used the output file as input file in MUMmerplot. I created a batch script containing the following command:
mummerplot -p MUMmerplot_contig --png /home/sarasl/git/GenomeAnalysis/2_genome_assembly/Evaluation/Nucmer_contig.delta
Since this graph didn't give a good result I tried to remove short alignments by running delta-filter before MUMmerplot. I tried this with both parameter -l 100 and -l 1000 but it didn't get any better. Then I tried to switch to the .scafSeq sequence, run Nucmer, delta-filter with parameter -l 1000 and then MUMmerplot but still no improvement. Last I tried to run MUMmerplot with the parameters --layout and --filter and I finally got a good result.
delta-filter -l 100 /home/sarasl/git/GenomeAnalysis/2_genome_assembly/Evaluation/MUMmerplot_scafSeq_1kb/Nucmer_scafSeq.delta > /home/sarasl/git/GenomeAnalysis/2_genome_assembly/Evaluation/MUMmerplot_scafSeq_100bp_layout_filter/delta-filter_scafSeq_100bp_layout_filter.delta
mummerplot -p MUMmerplot_scafSeq_100bp_layout_filter --png --layout --filter /home/sarasl/git/GenomeAnalysis/2_genome_assembly/Evaluation/MUMmerplot_scafSeq_100bp_layout_filter/delta-filter_scafSeq_100bp_layout_filter.delta
Remove short contigs
Assemblers usually include short contigs in their output, but since short contigs will make all downstream processes slower and doesn't provide much information I wanted to remove them before continuing my analysis. I did this by writing a python script using BioPython. I found a script on BioPythons homepage which I used as a template and modified it so that it only cleaned the data of contigs shorter than 1kb. I run the python script for both the .contig and .scafSeq file and checked the result by running Nucmer and MUMmerplot again.
Result
Figure 1: Plot from first MUMmerplot run with .contig file
Figure 2: Plot from MUMmerplot run with .contig file, only alignments longer than 100bp
Figure 3: Plot from MUMmerplot run with .contig file, only alignments longer than 1kb
Figure 4: Plot from MUMmerplot run with .scafSeq file, only alignments longer than 1kb
Figure 5: Plot from MUMmerplot run with .scafSeq file and layout and filter option
Figure 6: Plot from MUMmerplot with cleaned .config file
Figure 7: Plot from MUMmerplot with cleaned .scafSeq file
Discussion
At the time that the paper was written, there were only high coverage genomes for 3 bat species and low coverage genomes for 2 bat species available and the last common ancestor to the bat species that they were looking at was 43 million years ago. Because of this, the researches didin't have a references genome to compare their assembly against. Instead they looked at different measures like total scaffold length, longest scaffold, N50, NG50, %N, and Heterozygosity etc. From these different measures they concluded that their genome assemble was a reliable substrate for subsequent genomic analyses.
In this studie I have used their genome as a reference and assumed that if my assembly look similar to the assembly from the paper, then the quality should be good enough. When the MUMmerplot show a continuous linear graph with slope=1, it means that the assembly and the reference genome look the same which in turn show that the assembly is good. A good assembly means that the reads have been puzzled correctly forming a continuous sequence representing the true genome (assuming that the reference is the true genome sequence). Form the result I could see that only the MUMmerplot from the assembly with the cleaned .contig file gave a linear graph with slope=1 (figure 6). In figure 7, with the cleaned .scafSeq file, I could see that I had many scaffolds that didn't align to my reference. Therefore I decided to continue my analysis with the cleaned .contig file.