9. Christmas Time! - mai0/Project_BB2491 GitHub Wiki

Running Abyss on Uppmax

''There is nothing permanent except change, Heraclitus''

Today I started writing the small script to run Abyss:

  • 1st attempt
 #!/bin/bash -l

#SBATCH -A g2016025
#SBATCH -p node
#SBATCH -n 8
#SBATCH -t 1:00:00
#SBATCH -J sprucecp_assembly_abyss

module load bioinfo-tools
module load abyss/1.3.7
module load bowtie

abyss-pe k=64 name=spruce lib='reads' reads=' proj/g2016025/Group1/data/z4006c01.g.ipe.1.fq /proj/g2016025/Group1/data/z4006c01.g.ipe.2.fq' 

My output was: Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated.

I didn't understand the cause of that problem, so I continued by altering some parts of the script and experimenting!!

  • 2nd attempt
#!/bin/bash -l

#SBATCH -A g2016025
#SBATCH -p node
#SBATCH -n 8
#SBATCH -t 1:00:00
#SBATCH -J sprucecp_assembly_abyss_2

module load bioinfo-tools
module load abyss/1.3.7
module load bowtie

abyss-pe k=64 name=spruce_2 in='proj/g2016025/Group1/data/z4006c01.g.ipe.1.fq /proj/g2016025/Group1/data/z4006c01.g.ipe.2.fq'''

The job was cancelled again!

  • 3rd version was for k=31 (based on documentation in Uppmax):
#!/bin/bash -l

#SBATCH -A g2016025
#SBATCH -p node
#SBATCH -n 4
#SBATCH -t 13:00:00
#SBATCH -J sprucecp_assembly_abyss_5

module load gcc openmpi
module add bioinfo-tools abyss/1.9.0

#cd /proj/g2016025/Group1/data

# Commands
abyss-pe np=$SLURM_NPROCS name=chloro k=31 in='/proj/g2016025/Group1/data/z4006c01.g.ipe.1.fq /proj/g2016025/Group1/data/z4006c01.g.ipe.2.fq'

The assembler run normally and perfect! I realized that my main problem was the time, since I tried to run for shorter time than it should be! It took around 8 hours!

Comments:

  • pe: the reads are paired end

  • name: name of the output file

  • k=31: means that k is 31 in the de Bruijn graph

  • in: the input file which contains the reads. Where the reads have the suffix /1 and /2 which means the forward and reverse reads belonging in the same fragment.

  • Output files from Abyss (help from https://github.com/bcgsc/abyss/wiki/ABySS-File-Formats) :

    • name-contigs.fa Final contigs in Fasta format
    • name-contigs.dot contig overlap graph (Graphviz format)
    • name-bubbles.fa Equal-length variant sequences (fasta format)
    • name-indel.fa Different-length variant sequences (fasta)
    • .dist Estimates the distance between contigs
    • .path Lista of merged contigs
    • .hist Histogram of a library
    • coverage.hist K-mer coverage in histogram
    • -stats Statistics(e.g. N50) for the unitigs, contigs, scaffolds
  • So the output files are: chloro-2.dot1 chloro-4.fa chloro-6.dot chloro-8.f chloro-stats.md
    abyss_k64 chloro-2.fa chloro-4.fa.fai chloro-6.f chloro-bubbles.fa chloro-stats.tab
    Abyss_try1_sbatch.sh chloro-2.path chloro-4.path1 chloro-6.hist chloro-contigs.dot chloro-unitigs.fa
    Abyss_try2_sbatch.sh chloro-3.dist chloro-4.path2 chloro-6.path chloro-contigs.fa coverage.hist
    Abyss_try3_sbatch.sh chloro-3.dot chloro-4.path3 chloro-6.path.dot chloro-indel.fa
    chloro-1.dot chloro-3.fa chloro-5.dot chloro-7.dot chloro-scaffolds.dot reference.fasta
    chloro-1.fa chloro-3.fa.fai chloro-5.fa chloro-7.fa chloro-scaffolds.fa
    chloro-1.path chloro-3.hist chloro-5.path chloro-7.path chloro-stats
    chloro-2.dot chloro-4.dot chloro-6.dist.dot chloro-8.dot chloro-stats.csv

  • The statistics for k=31:

n n.500 L50 min N80 N50 N20 E.size max sum name
2266570 859 306 500 569 736 1126 923 4189 657047 chloro-unitigs.fa
2262606 1312 336 500 654 1065 2576 2027 15765 1357433 chloro-contigs.fa
2262343 1279 308 500 681 1201 2820 2410 18833 1419923 chloro-scaffolds.fa
  • Notes (also for k=64):
    • N50 and E-size should be high as they are somehow here. Because that means that long scaffolds have been assembled
    • L50: number of scaffolds to reach N50. It must be low so that long but few scaffolds have been generated
    • we can observe that max is high which is good because chloroplast genome is circular.