Assembly - mestato/EPP622_2024 GitHub Wiki
We use Hifiasm for genome assembly, which incorporates HiFi and Hi-C data to generate an assembly. For more information see their documentation, github, or publication. Output is a haplotype resolved assembly (i.e. you'll have two separate files for the contiguous haplotype 1 and 2).
cd /lustre/isaac/proj/UTK0318/dogwood_genome/analysis/<yourusername>
mkdir 2_hifiasm
cd 2_hifiasm
ln -s /lustre/isaac/proj/UTK0318/dogwood_genome/raw_data/*fastq.gz .
ls -l
Now we will do our conda install, activate the environment, and test that it works
mamba create -n hifiasm bioconda::hifiasm
conda activate hifiasm
hifiasm
Now we need to build a slurm script
#!/bin/bash
#SBATCH -J hifiasm
#SBATCH --nodes=1
#SBATCH --cpus-per-task=40
#SBATCH -A ISAAC-UTK0318
#SBATCH -p condo-epp622
#SBATCH -q condo
#SBATCH -t 00:30:00
#SBATCH --array=1-3
eval "$(conda shell.bash hook)"
conda activate fastqc
infile=$(sed -n -e "${SLURM_ARRAY_TASK_ID} p" files.txt)
fastqc -t 3 $infile
<location of hifiasm>/hifiasm \
-o Cornus_florida_CherokeeBrave_v.3.0.0 \
-t 40 \
--hg-size 1541m \
--h1 CherBrave_sp22227_S3HiC_R1.fastq.gz \
--h2 CherBrave_sp22227_S3HiC_R2.fastq.gz \
CherBrave_Run1_m64310e_211208_183918.fastq.gz \
CherBrave_Run2_m54334Ue_211217_062947.fastq.gz \
CherBrave_Run3_m54334Ue_220211_075016.fastq.gz \
>& hifiasm_output.txt
-
-o
prefix for output files -
-t
number of threads -
--hg-size
estimated size of haploid genome -
--h1
Hi-C forward reads -
--h2
Hi-C reverse reads -
input
PacBio Hifi reads (CherBrave_Run1-3) -
>&
writes the standard output to hifiasm_output.txt
Check out this documentation for more information on the output files. Note it will output .gfa files, which are assembly graphs and will need to be converted to fasta files. This could be done by:
## example; don't do in class
# convert gfa to fasta - primary haplotype
awk '/^S/{print ">"$2;print $3}' \
Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg.gfa \ # gfa file you want to convert
> Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg.fasta # name you want the fasta file to be
Make a new folder in your long_read folder called 2_hifiasm.
pwd
should output /pickett_shared/teaching/EPP622_Fall2022/analysis/<your name>/long_read
mkdir 2_hifiasm
cd 2_hifiasm
Hap 1 & 2 are located here: /pickett_shared/teaching/EPP622_Fall2022/long_read/analysis/2_hifiasm
Symbolically link hap 1 & 2 (the outputs from hifiasm)
ln -s /pickett_shared/teaching/EPP622_Fall2022/long_read/analysis/2_hifiasm/*fasta .
Load java
spack load /wgx2dcw
Run bbmap and store the output in a .txt file (only uses 1 thread, so we should be good)
/sphinx_local/software/bbmap/stats.sh -Xmx20g \
in=Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg.fasta \
> Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg.stats.txt
/sphinx_local/software/bbmap/stats.sh -Xmx20g \
in=Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap2.p_ctg.fasta \
> Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap2.p_ctg.stats.txt
Let's take a look at the top part of the files
head -15 Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg.stats.txt
head -15 Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap2.p_ctg.stats.txt
As long as we have time, let's check out the scaffolded/chromosome level assembly. The chromosome level assembly was produced using Juicer and 3D-DNA. For more information about their pipeline and how to run it please see their cookbook. There's lots of code involved, so if you are ever using their pipeline and need help, feel free to reach out, but I won't put all the code here.
Let's just check out how the final 3D-DNA assembly compares to the hifiasm assembly
The 3D-DNA assembly is here /pickett_shared/teaching/EPP622_Fall2022/long_read/analysis/3_3d-dna
Let's make a new folder and symbolically link the 3D-DNA hap 1 & 2 files!
cd ../
pwd
# should output /pickett_shared/teaching/EPP622_Fall2022/analysis/<your folder name>/long_read
mkdir 3_3d-dna
cd 3_3d-dna
ln -s /pickett_shared/teaching/EPP622_Fall2022/long_read/analysis/3_3d-dna/*fasta .
Now run bbmap again
# java should already be loaded, so you don't need to load it again
/sphinx_local/software/bbmap/stats.sh -Xmx20g \
in=Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg_HiC.fasta \
> Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg_HiC.stats.txt
/sphinx_local/software/bbmap/stats.sh -Xmx20g \
in=Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap2.p_ctg_HiC.fasta \
> Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap2.p_ctg_HiC.stats.txt
Let's take a look at the top part of the files again
head -15 Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap1.p_ctg_HiC.stats.txt
head -15 Cornus_florida_CherokeeBrave_v.3.0.0.hic.hap2.p_ctg_HiC.stats.txt