9. Christmas Time! - mai0/Project_BB2491 GitHub Wiki
Running Abyss on Uppmax
''There is nothing permanent except change, Heraclitus''
Today I started writing the small script to run Abyss:
- 1st attempt
#!/bin/bash -l
#SBATCH -A g2016025
#SBATCH -p node
#SBATCH -n 8
#SBATCH -t 1:00:00
#SBATCH -J sprucecp_assembly_abyss
module load bioinfo-tools
module load abyss/1.3.7
module load bowtie
abyss-pe k=64 name=spruce lib='reads' reads=' proj/g2016025/Group1/data/z4006c01.g.ipe.1.fq /proj/g2016025/Group1/data/z4006c01.g.ipe.2.fq'
My output was: Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated.
I didn't understand the cause of that problem, so I continued by altering some parts of the script and experimenting!!
- 2nd attempt
#!/bin/bash -l
#SBATCH -A g2016025
#SBATCH -p node
#SBATCH -n 8
#SBATCH -t 1:00:00
#SBATCH -J sprucecp_assembly_abyss_2
module load bioinfo-tools
module load abyss/1.3.7
module load bowtie
abyss-pe k=64 name=spruce_2 in='proj/g2016025/Group1/data/z4006c01.g.ipe.1.fq /proj/g2016025/Group1/data/z4006c01.g.ipe.2.fq'''
The job was cancelled again!
- 3rd version was for k=31 (based on documentation in Uppmax):
#!/bin/bash -l
#SBATCH -A g2016025
#SBATCH -p node
#SBATCH -n 4
#SBATCH -t 13:00:00
#SBATCH -J sprucecp_assembly_abyss_5
module load gcc openmpi
module add bioinfo-tools abyss/1.9.0
#cd /proj/g2016025/Group1/data
# Commands
abyss-pe np=$SLURM_NPROCS name=chloro k=31 in='/proj/g2016025/Group1/data/z4006c01.g.ipe.1.fq /proj/g2016025/Group1/data/z4006c01.g.ipe.2.fq'
The assembler run normally and perfect! I realized that my main problem was the time, since I tried to run for shorter time than it should be! It took around 8 hours!
Comments:
-
pe: the reads are paired end
-
name: name of the output file
-
k=31: means that k is 31 in the de Bruijn graph
-
in: the input file which contains the reads. Where the reads have the suffix /1 and /2 which means the forward and reverse reads belonging in the same fragment.
-
Output files from Abyss (help from https://github.com/bcgsc/abyss/wiki/ABySS-File-Formats) :
- name-contigs.fa Final contigs in Fasta format
- name-contigs.dot contig overlap graph (Graphviz format)
- name-bubbles.fa Equal-length variant sequences (fasta format)
- name-indel.fa Different-length variant sequences (fasta)
- .dist Estimates the distance between contigs
- .path Lista of merged contigs
- .hist Histogram of a library
- coverage.hist K-mer coverage in histogram
- -stats Statistics(e.g. N50) for the unitigs, contigs, scaffolds
-
So the output files are: chloro-2.dot1 chloro-4.fa chloro-6.dot chloro-8.f chloro-stats.md
abyss_k64 chloro-2.fa chloro-4.fa.fai chloro-6.f chloro-bubbles.fa chloro-stats.tab
Abyss_try1_sbatch.sh chloro-2.path chloro-4.path1 chloro-6.hist chloro-contigs.dot chloro-unitigs.fa
Abyss_try2_sbatch.sh chloro-3.dist chloro-4.path2 chloro-6.path chloro-contigs.fa coverage.hist
Abyss_try3_sbatch.sh chloro-3.dot chloro-4.path3 chloro-6.path.dot chloro-indel.fa
chloro-1.dot chloro-3.fa chloro-5.dot chloro-7.dot chloro-scaffolds.dot reference.fasta
chloro-1.fa chloro-3.fa.fai chloro-5.fa chloro-7.fa chloro-scaffolds.fa
chloro-1.path chloro-3.hist chloro-5.path chloro-7.path chloro-stats
chloro-2.dot chloro-4.dot chloro-6.dist.dot chloro-8.dot chloro-stats.csv -
The statistics for k=31:
n | n.500 | L50 | min | N80 | N50 | N20 | E.size | max | sum | name |
---|---|---|---|---|---|---|---|---|---|---|
2266570 | 859 | 306 | 500 | 569 | 736 | 1126 | 923 | 4189 | 657047 | chloro-unitigs.fa |
2262606 | 1312 | 336 | 500 | 654 | 1065 | 2576 | 2027 | 15765 | 1357433 | chloro-contigs.fa |
2262343 | 1279 | 308 | 500 | 681 | 1201 | 2820 | 2410 | 18833 | 1419923 | chloro-scaffolds.fa |
- Notes (also for k=64):
- N50 and E-size should be high as they are somehow here. Because that means that long scaffolds have been assembled
- L50: number of scaffolds to reach N50. It must be low so that long but few scaffolds have been generated
- we can observe that max is high which is good because chloroplast genome is circular.