Genome Assembly - mestato/EPP622_2024 GitHub Wiki
Hifiasm
Note from Shade: 17.121 GB of RAM and ~22 CPU, so I think requesting 20 GB of RAM and 25 or 30 CPU on either the short or campus node for 3 hr
Link to reads
ln -s /lustre/isaac/proj/UTK0318/discula_genome/raw_data/* .
Install
conda create -n hifiasm bioconda::hifiasm
Script
#!/bin/bash
#SBATCH -J hifiasm
#SBATCH --nodes=1
#SBATCH --cpus-per-task=48
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 03:00:00
#SBATCH --mem=192G
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
eval "$(conda shell.bash hook)"
conda activate hifiasm
hifi_reads=dd_as111_100X.fastq.gz
hic_r1=23ddas111_1327065_S3HiC_R1.fastq.gz
hic_r2=23ddas111_1327065_S3HiC_R2.fastq.gz
outfile=dd_as111_100x_assembly
echo "HiFi Reads: $hifi_reads"
echo "HiC Read 1: $hic_r1"
echo "HiC Read 2: $hic_r2"
hifiasm \
-o $outfile \
-t 48 \
--hg-size 49m \
--n-hap 1 \
--h1 $hic_r1 \
--h2 $hic_r2 \
$hifi_reads \
>& hifiasm_dd_as111_100x.log
This took 18 minutes for me. More info on parameters:
- -t is number of threads to use
- --hg_size is estimated haploid genome size used for inferring read coverage [auto]
- --n-hap is assumption of haplotype number.
- --h1 is the file of forward reads of HiC
- --h2 is the file of reverse reads of HiC
See the hifiasm documentation learn more about the output files and the gfa format.
To explore the assembly, we need to convert from gfa to fasta format. A tool, gfatools, is available for this. Its very fast, so we do not need to create a submission script.
conda create -n gfatools bioconda::gfatools
conda activate gfatools
for GFA in *.p_ctg.gfa; do
OUT=$(echo $GFA | sed 's/.gfa/.fasta/')
gfatools gfa2fa $GFA > $OUT
done
BBMap
Install and run
conda create -n bbmap bioconda::bbmap
conda activate bbmap
stats.sh in=dd_as111_100x_assembly.hic.p_ctg.fasta > dd_as111_100x_assembly.hic.p_ctg.stats
Now check out the stats in dd_as111_100x_assembly.hic.p_ctg.stats
Compleasm
Install
conda create -n compleasm bioconda::compleasm
Link to the file
mkdir 2_compleasm
cd 2_compleasm
ln -s ../1_hifiasm/dd_as111_100x_assembly.hic.p_ctg.fasta .
Script
#!/bin/bash
#SBATCH -J compleasm
#SBATCH --nodes=1
#SBATCH --cpus-per-task=10
#SBATCH -A ISAAC-UTK0318
#SBATCH -p short
#SBATCH -q short
#SBATCH -t 03:00:00
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
eval "$(conda shell.bash hook)"
conda activate compleasm
compleasm download fungi
compleasm \
run \
-a dd_as111_100x_assembly.hic.p_ctg.fasta \
-o dd_as111_100x_assembly.hic.p_ctg.compleasm \
-t 10 \
-l fungi
This took 26 minutes for me. Check out results in dd_as111_100x_assembly.hic.p_ctg.compleasm/summary.txt
Find the mitochondrial genome
The assembly includes the mitochondrial genome and the nuclear genome. To find the mitochondrial genome, we will use minimap2 - it doesn't just align reads to assemblies, it can also align assemblies (or genomes) to each other.
Lets create the analysis directory and link to the files. We already have the mt genome, so we can use it (if you don't have it already, you could align to a mt genome from a related organism).
mkdir 3_mt
cd 3_mt
ln -s ../../../raw_data/dd_100X_oatk.mito.ctg.fasta .
ln -s ../1_hifiasm/dd_as111_100x_assembly.hic.p_ctg.fasta .
This is super fast, its ok to do on the log in node
conda activate minimap2
minimap2 -x asm5 dd_as111_100x_assembly.hic.p_ctg.fasta dd_100X_oatk.mito.ctg.fasta >assembly-vs-mito.minimap2.paf
More info on minimap2 settings and paf format
Look at contig graph
We can use Bandage - you'll have to run on your laptop.