Assembly - mestato/EPP622_2024 GitHub Wiki

Assembly

We use Hifiasm for genome assembly, which incorporates HiFi and Hi-C data to generate an assembly. For more information see their documentation, github, or publication. Output is a haplotype resolved assembly (i.e. you'll have two separate files for the contiguous haplotype 1 and 2).

cd /lustre/isaac/proj/UTK0318/dogwood_genome/analysis/<yourusername>
mkdir 2_hifiasm
cd 2_hifiasm
ln -s /lustre/isaac/proj/UTK0318/dogwood_genome/raw_data/*fastq.gz .
ls -l

Now we will do our conda install, activate the environment, and test that it works

mamba create -n hifiasm bioconda::hifiasm
conda activate hifiasm
hifiasm

Now we need to build a slurm script

#!/bin/bash
#SBATCH -J hifiasm
#SBATCH --nodes=1
#SBATCH --cpus-per-task=40
#SBATCH -A ISAAC-UTK0318
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:30:00
#SBATCH --array=1-3

eval "$(conda shell.bash hook)"
conda activate fastqc
infile=$(sed -n -e "${SLURM_ARRAY_TASK_ID} p" files.txt)
fastqc -t 3 $infile
<location of hifiasm>/hifiasm \
	-o Cornus_florida_CherokeeBrave_v.3.0.0 \
	-t 40 \
	--hg-size 1541m \
	--h1 CherBrave_sp22227_S3HiC_R1.fastq.gz \
	--h2 CherBrave_sp22227_S3HiC_R2.fastq.gz \
	CherBrave_Run1_m64310e_211208_183918.fastq.gz \
	CherBrave_Run2_m54334Ue_211217_062947.fastq.gz \
	CherBrave_Run3_m54334Ue_220211_075016.fastq.gz \
	>& hifiasm_output.txt
  • -o prefix for output files
  • -t number of threads
  • --hg-size estimated size of haploid genome
  • --h1 Hi-C forward reads
  • --h2 Hi-C reverse reads
  • input PacBio Hifi reads (CherBrave_Run1-3)
  • >& writes the standard output to hifiasm_output.txt

Check out this documentation for more information on the output files. Note it will output .gfa files, which are assembly graphs and will need to be converted to fasta files. This could be done by:

## example; don't do in class
# convert gfa to fasta - primary haplotype
awk '/^S/{print ">"$2;print $3}' \
Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg.gfa \ # gfa file you want to convert
> Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg.fasta # name you want the fasta file to be

Stats on assembly

Hifiasm Assembly

Make a new folder in your long_read folder called 2_hifiasm. pwd should output /pickett_shared/teaching/EPP622_Fall2022/analysis/<your name>/long_read

mkdir 2_hifiasm
cd 2_hifiasm

Hap 1 & 2 are located here: /pickett_shared/teaching/EPP622_Fall2022/long_read/analysis/2_hifiasm

Symbolically link hap 1 & 2 (the outputs from hifiasm)

ln -s /pickett_shared/teaching/EPP622_Fall2022/long_read/analysis/2_hifiasm/*fasta .

Load java

spack load /wgx2dcw

Run bbmap and store the output in a .txt file (only uses 1 thread, so we should be good)

/sphinx_local/software/bbmap/stats.sh -Xmx20g \
in=Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg.fasta \
> Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg.stats.txt

/sphinx_local/software/bbmap/stats.sh -Xmx20g \
in=Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap2.p_ctg.fasta \
> Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap2.p_ctg.stats.txt

Let's take a look at the top part of the files

head -15 Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg.stats.txt
head -15 Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap2.p_ctg.stats.txt

3D-DNA Assembly

As long as we have time, let's check out the scaffolded/chromosome level assembly. The chromosome level assembly was produced using Juicer and 3D-DNA. For more information about their pipeline and how to run it please see their cookbook. There's lots of code involved, so if you are ever using their pipeline and need help, feel free to reach out, but I won't put all the code here.

Let's just check out how the final 3D-DNA assembly compares to the hifiasm assembly

The 3D-DNA assembly is here /pickett_shared/teaching/EPP622_Fall2022/long_read/analysis/3_3d-dna

Let's make a new folder and symbolically link the 3D-DNA hap 1 & 2 files!

cd ../
pwd
# should output /pickett_shared/teaching/EPP622_Fall2022/analysis/<your folder name>/long_read
mkdir 3_3d-dna
cd 3_3d-dna
ln -s /pickett_shared/teaching/EPP622_Fall2022/long_read/analysis/3_3d-dna/*fasta .

Now run bbmap again

# java should already be loaded, so you don't need to load it again
/sphinx_local/software/bbmap/stats.sh -Xmx20g \
in=Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg_HiC.fasta \
> Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg_HiC.stats.txt

/sphinx_local/software/bbmap/stats.sh -Xmx20g \
in=Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap2.p_ctg_HiC.fasta \
> Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap2.p_ctg_HiC.stats.txt

Let's take a look at the top part of the files again

head -15 Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg_HiC.stats.txt
head -15 Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap2.p_ctg_HiC.stats.txt
⚠️ **GitHub.com Fallback** ⚠️