Test3_GenomeAssembly - sansuach/Sanskritiacharya GitHub Wiki

Flye

First, I moved to the base directory where the analysis will be performed.

cd /lustre/isaac/proj/UTK0318/test3/analysis/

I created a directory called sachar10 for the current analysis and moved into it.

mkdir sachar10
cd sachar10

I created a subdirectory flye_assembly for storing files related to the genome assembly and finally, a subdirectory (flye) where I'll set up the actual Flye assembly.

mkdir flye_assembly
cd flye_assembly
mkdir flye
cd flye

I created a Conda environment named flye and installed Flye from the Bioconda channel. Then, I activated the environment to use Flye.This allows me to run Flye without interfering with other environments.

conda create -n flye bioconda::flye
conda activate flye

I created symbolic links to the raw sequencing data from the citrus project, enabling easy access in the current directory.

ln -s /lustre/isaac/proj/UTK0318/test3/raw_data/citrus/* .

Using the nano editor, I created a new SLURM script called genome_assembly.qsh.

nano genome_assembly.qsh

Contents of genome_assembly.qsh:

#!/bin/bash
#SBATCH -J flye_assembly
#SBATCH --nodes=1
#SBATCH --cpus-per-task=48
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 03:00:00
#SBATCH --mem=250G
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err

# Load conda environment
eval "$(conda shell.bash hook)"
conda activate flye

flye --nano-raw microcitrus_australasica_nanopore.4.fastq \
     --out-dir /lustre/isaac/proj/UTK0318/test3/analysis/sachar10/flye_assembly/flye \
     --genome-size 337m \
     --threads 48

This SLURM script specifies the job parameters, loads the Flye environment, and specifies the command to run Flye with the appropriate input file, genome size, and output directory.

I submitted the SLURM job script for execution

sbatch genome_assembly.qsh

I checked the status of my SLURM jobs using the squeue command:

squeue -u sachar10

To count the number of reads in a FASTQ file, I used the following command:

grep -c '^@' microcitrus_australasica_nanopore.4.fastq

Once the assembly is complete, I used stats.sh from the BBMap toolkit to generate statistics about the assembled genome and saved the output in assembly.stats.

conda create -n bbmap bioconda::bbmap
conda activate bbmap
stats.sh in=assembly.fasta > assembly.stats

I inspected the assembly stats to analyze details like GC content, scaffold length, and coverage.

more assembly.stats

I created and activated a new environment named compleasm for CompleAsm, which I’ll use to enhance and assess the assembly.

conda create -n compleasm bioconda::compleasm
conda activate compleasm

I moved up one directory, created a new directory for CompleAsm, and linked the Flye assembly output (assembly.fasta) to this location.

cd ../
mkdir compleasm
cd compleasm
ln -s ../../flye_assembly/flye/assembly.fasta

I created a SLURM job script (compleasm.qsh) for CompleAsm. Here’s the content of this script:

#!/bin/bash
#SBATCH -J compleasm
#SBATCH --nodes=1
#SBATCH --cpus-per-task=10
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 03:00:00
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --output=R-%x.%j.out
#SBATCH --error=R-%x.%j.err

eval "$(conda shell.bash hook)"
conda activate compleasm
compleasm download embryophyta
compleasm \
        run \
        -a assembly.fasta \
        -o assembly.compleasm \
        -t 10 \
        -l embryophyte

This script sets up and runs CompleAsm with the downloaded dataset for the lineage embryophyta. I submitted the CompleAsm job.

sbatch compleasm.qsh

At last, I navigated to the output directory and checked the summary file for CompleAsm results, which includes lineage details and completeness metrics.

cd assembly.compleasm
cat summary.txt