Test3_GenomeAssembly - sansuach/Sanskritiacharya GitHub Wiki
Flye
First, I moved to the base directory where the analysis will be performed.
cd /lustre/isaac/proj/UTK0318/test3/analysis/
I created a directory called sachar10 for the current analysis and moved into it.
mkdir sachar10
cd sachar10
I created a subdirectory flye_assembly for storing files related to the genome assembly and finally, a subdirectory (flye) where I'll set up the actual Flye assembly.
mkdir flye_assembly
cd flye_assembly
mkdir flye
cd flye
I created a Conda environment named flye and installed Flye from the Bioconda channel. Then, I activated the environment to use Flye.This allows me to run Flye without interfering with other environments.
conda create -n flye bioconda::flye
conda activate flye
I created symbolic links to the raw sequencing data from the citrus project, enabling easy access in the current directory.
ln -s /lustre/isaac/proj/UTK0318/test3/raw_data/citrus/* .
Using the nano editor, I created a new SLURM script called genome_assembly.qsh
.
nano genome_assembly.qsh
Contents of genome_assembly.qsh:
#!/bin/bash
#SBATCH -J flye_assembly
#SBATCH --nodes=1
#SBATCH --cpus-per-task=48
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 03:00:00
#SBATCH --mem=250G
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err
# Load conda environment
eval "$(conda shell.bash hook)"
conda activate flye
flye --nano-raw microcitrus_australasica_nanopore.4.fastq \
--out-dir /lustre/isaac/proj/UTK0318/test3/analysis/sachar10/flye_assembly/flye \
--genome-size 337m \
--threads 48
This SLURM script specifies the job parameters, loads the Flye environment, and specifies the command to run Flye with the appropriate input file, genome size, and output directory.
I submitted the SLURM job script for execution
sbatch genome_assembly.qsh
I checked the status of my SLURM jobs using the squeue command:
squeue -u sachar10
To count the number of reads in a FASTQ file, I used the following command:
grep -c '^@' microcitrus_australasica_nanopore.4.fastq
Once the assembly is complete, I used stats.sh from the BBMap toolkit to generate statistics about the assembled genome and saved the output in assembly.stats.
conda create -n bbmap bioconda::bbmap
conda activate bbmap
stats.sh in=assembly.fasta > assembly.stats
I inspected the assembly stats to analyze details like GC content, scaffold length, and coverage.
more assembly.stats
I created and activated a new environment named compleasm for CompleAsm, which I’ll use to enhance and assess the assembly.
conda create -n compleasm bioconda::compleasm
conda activate compleasm
I moved up one directory, created a new directory for CompleAsm, and linked the Flye assembly output (assembly.fasta) to this location.
cd ../
mkdir compleasm
cd compleasm
ln -s ../../flye_assembly/flye/assembly.fasta
I created a SLURM job script (compleasm.qsh) for CompleAsm. Here’s the content of this script:
#!/bin/bash
#SBATCH -J compleasm
#SBATCH --nodes=1
#SBATCH --cpus-per-task=10
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 03:00:00
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err
eval "$(conda shell.bash hook)"
conda activate compleasm
compleasm download embryophyta
compleasm \
run \
-a assembly.fasta \
-o assembly.compleasm \
-t 10 \
-l embryophyte
This script sets up and runs CompleAsm with the downloaded dataset for the lineage embryophyta. I submitted the CompleAsm job.
sbatch compleasm.qsh
At last, I navigated to the output directory and checked the summary file for CompleAsm results, which includes lineage details and completeness metrics.
cd assembly.compleasm
cat summary.txt