Genome Assembly - mestato/EPP622_2024 GitHub Wiki

Hifiasm

Note from Shade: 17.121 GB of RAM and ~22 CPU, so I think requesting 20 GB of RAM and 25 or 30 CPU on either the short or campus node for 3 hr

Link to reads

ln -s /lustre/isaac/proj/UTK0318/discula_genome/raw_data/* .

Install

 conda create -n hifiasm bioconda::hifiasm

Script

#!/bin/bash
#SBATCH -J hifiasm
#SBATCH --nodes=1
#SBATCH --cpus-per-task=48
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 03:00:00
#SBATCH --mem=192G
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]

eval "$(conda shell.bash hook)"
conda activate hifiasm

hifi_reads=dd_as111_100X.fastq.gz
hic_r1=23ddas111_1327065_S3HiC_R1.fastq.gz
hic_r2=23ddas111_1327065_S3HiC_R2.fastq.gz
outfile=dd_as111_100x_assembly

echo "HiFi Reads: $hifi_reads"
echo "HiC Read 1: $hic_r1"
echo "HiC Read 2: $hic_r2"

hifiasm \
        -o $outfile \
        -t 48 \
        --hg-size 49m  \
        --n-hap 1 \
        --h1 $hic_r1 \
        --h2 $hic_r2 \
        $hifi_reads \
        >& hifiasm_dd_as111_100x.log

This took 18 minutes for me. More info on parameters:

  • -t is number of threads to use
  • --hg_size is estimated haploid genome size used for inferring read coverage [auto]
  • --n-hap is assumption of haplotype number.
  • --h1 is the file of forward reads of HiC
  • --h2 is the file of reverse reads of HiC

See the hifiasm documentation learn more about the output files and the gfa format.

To explore the assembly, we need to convert from gfa to fasta format. A tool, gfatools, is available for this. Its very fast, so we do not need to create a submission script.

conda create -n gfatools bioconda::gfatools
conda activate gfatools
for GFA in *.p_ctg.gfa; do
        OUT=$(echo $GFA | sed 's/.gfa/.fasta/')
        gfatools gfa2fa $GFA > $OUT
done

BBMap

Install and run

conda create -n bbmap bioconda::bbmap
conda activate bbmap
stats.sh in=dd_as111_100x_assembly.hic.p_ctg.fasta > dd_as111_100x_assembly.hic.p_ctg.stats

Now check out the stats in dd_as111_100x_assembly.hic.p_ctg.stats

Compleasm

Install

conda create -n compleasm bioconda::compleasm

Link to the file

mkdir 2_compleasm
cd 2_compleasm
ln -s ../1_hifiasm/dd_as111_100x_assembly.hic.p_ctg.fasta .

Script

#!/bin/bash
#SBATCH -J compleasm
#SBATCH --nodes=1
#SBATCH --cpus-per-task=10
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 03:00:00
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]

eval "$(conda shell.bash hook)"
conda activate compleasm
compleasm download fungi
compleasm \
                run \
                -a dd_as111_100x_assembly.hic.p_ctg.fasta \
                -o dd_as111_100x_assembly.hic.p_ctg.compleasm \
                -t 10 \
                -l fungi

This took 26 minutes for me. Check out results in dd_as111_100x_assembly.hic.p_ctg.compleasm/summary.txt

Find the mitochondrial genome

The assembly includes the mitochondrial genome and the nuclear genome. To find the mitochondrial genome, we will use minimap2 - it doesn't just align reads to assemblies, it can also align assemblies (or genomes) to each other.

Lets create the analysis directory and link to the files. We already have the mt genome, so we can use it (if you don't have it already, you could align to a mt genome from a related organism).

mkdir 3_mt
cd 3_mt
ln -s ../../../raw_data/dd_100X_oatk.mito.ctg.fasta .
ln -s ../1_hifiasm/dd_as111_100x_assembly.hic.p_ctg.fasta .

This is super fast, its ok to do on the log in node

conda activate minimap2
minimap2 -x asm5 dd_as111_100x_assembly.hic.p_ctg.fasta dd_100X_oatk.mito.ctg.fasta >assembly-vs-mito.minimap2.paf

More info on minimap2 settings and paf format

Look at contig graph

We can use Bandage - you'll have to run on your laptop.