Discula destructiva HiFi Assembly [DEPRECATED] - ShadeNiece/DisculaDestructiva_GenomeAssembly-Annotation GitHub Wiki

Received HiFi data from University of Washington 2/23/2024
Path to raw data in ISAAC-NG: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/raw_data/dd_hifi_raw/23_DD_AS1112.SN/r84046_20240216_212323_1_B01/hifi_reads
Long story short, the Revio malfunctioned during my first HiFi run from Feb 2024, and we were only able to generate a total of 327 Mbp instead of the expected 75-90 Gbp of data. UW offered to redo QC, library prep, and sequencing for free, so they redid the HiFi run in May 2024. This new run generated 93.51 Gbp of data. This new assembly is held within the updated assembly GitHub.

Download data

I downloaded my raw HiFi data to the above directory in ISAAC-NG using Globus.

Install hifiasm

I followed the installation documentation on the hifiasm GitHub

cd /lustre/isaac/proj/UTK0032/sniece/software
git clone https://github.com/chhylp123/hifiasm
cd hifiasm && make

Run PBTK (PacBio Toolkit): bam2fastq

Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/analysis/01_pbtk_bam2fastq
Documentation: https://github.com/PacificBiosciences/pbtk
We want to convert the bam we received from UW to a fastq that can be used as input for hifiasm

Link raw bam to current directory

ln -s /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/raw_data/dd_hifi_raw/23_DD_AS1112.SN/r84046_20240216_212323_1_B01/hifi_reads/*.unassigned* .

ln -s /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/raw_data/dd_hifi_raw/HiFiRun_Improved_May2024/Discula_ps.SN.HFSS/r84046_20240503_231430_1_C01/hifi_reads/* .

Conda install PBTK

conda create -n pbtk -c bioconda pbtk

Run PBTK Note: I only have one bam due to the sequencing debacle

nano run_pbtk_bam2fastq.qsh

#!/bin/bash
#SBATCH --job-name=run_pbtk_bam2fastq
#SBATCH --nodes=1
#SBATCH --ntasks=10
#SBATCH --mem=10G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus
#SBATCH --qos=campus
#SBATCH --time=1:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

eval "$(conda shell.bash hook)"

conda activate pbtk

for file in *bam
do
        outfile=$(basename $file | sed 's/.bam//')
        bam2fastq -o $outfile $file
done

sbatch run_pbtk_bam2fastq.qsh

this will generate a fastq.gz file from your bam file

4. LongQC

5. Hifiasm

Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/analysis/03_hifiasm Documentation: https://github.com/chhylp123/hifiasm

Link the HiFi and Hi-C fastq files

ln -s /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/analysis/01_pbtk_bam2fastq/*.fastq.gz .
ln -s /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/raw_data/dd_hic_raw/*.fastq.gz .

Run hifiasm (no downsampling)

nano run_hifiasm.qsh

#!/bin/bash
#SBATCH --job-name=run_hifiasm
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --mem=50G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=condo-ut-genomics
#SBATCH --qos=genomics
#SBATCH --time=24:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

hifi_reads=m84046_240216_233301_s2.hifi_reads.unassigned.fastq.gz
hic_r1=23ddas111_1327065_S3HiC_R1.fastq.gz
hic_r2=23ddas111_1327065_S3HiC_R2.fastq.gz
outfile=dd_as111_assembly

echo "HiFi Reads: $hifi_reads"
echo "ONT Reads: $ont_reads"
echo "HiC Read 1: $hic_r1"
echo "HiC Read 2: $hic_r2"

/lustre/isaac/proj/UTK0032/sniece/software/hifiasm/hifiasm \
        -o $outfile \
        -t 32 \
        --hg-size 49m  \
        --h1 $hic_r1 \
        --h2 $hic_r2 \
        $hifi_reads \
        >& hifiasm_dd_as111.log

.hic.p_ctg.gfa: assembly graph of primary contigs
.hic.hap1.p_ctg.gfa: fully phased contig graph of haplotype1 where each contig is fully phased
.hic.hap2.p_ctg.gfa: fully phased contig graph of haplotype2 where each contig is fully phased

6. Convert gfa to fasta and get stats

Install bbmap

conda create -n bbmap -c bioconda bbmap

Convert gfa to fasta and run bbmap

#!/bin/bash
#SBATCH --job-name=run_gfa2fast_bbmap_stats
#SBATCH --nodes=1
#SBATCH --ntasks=10
#SBATCH --mem=10G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=campus
#SBATCH --qos=campus
#SBATCH --time=1:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

eval "$(conda shell.bash hook)"

conda activate bbmap

# convert gfa to fasta - primary contigs
awk '/^S/{print ">"$2;print $3}' \
dd_as111_assembly.hic.p_ctg.gfa \
> dd_as111_assembly.hic.p_ctg.fa

# convert gfa to fasta - primary haplotype
awk '/^S/{print ">"$2;print $3}' \
dd_as111_assembly.hic.hap1.p_ctg.gfa \
> dd_as111_assembly.hic.hap1.p_ctg.fa

# convert gfa to fasta - alternative haplotype
awk '/^S/{print ">"$2;print $3}' \
dd_as111_assembly.hic.hap2.p_ctg.gfa \
> dd_as111_assembly.hic.hap2.p_ctg.fa


# get stats

/nfs/home/lwy647/.conda/envs/bbmap/bin/stats.sh -Xmx20g \
in=dd_as111_assembly.hic.p_ctg.fa \
> dd_as111_assembly.hic.p_ctg.stats.txt

/nfs/home/lwy647/.conda/envs/bbmap/bin/stats.sh -Xmx20g \
in=dd_as111_assembly.hic.hap1.p_ctg.fa \
> dd_as111_assembly.hic.hap1.p_ctg.stats.txt

7. BUSCO

Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/analysis/04_busco

Link fasta file from primary assembly

ln -s /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/analysis/03_hifiasm/dd_as111_assembly.hic.p_ctg.fa .

Conda install BUSCO

conda create -n busco -c bioconda busco

Run BUSCO with

nano run_busco.qsh

#!/bin/bash
#SBATCH --job-name=run_busco
#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --mem=10G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=condo-ut-genomics
#SBATCH --qos=genomics
#SBATCH --time=1:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

eval "$(conda shell.bash hook)"

conda activate busco

busco \
        -i dd_as111_assembly.hic.p_ctg.fa \
        -m genome \
        -l sordariomycetes_odb10 \
        -c 40 \
        --out dd_as111_assembly.hic.p_ctg.BUSCO

---------------------------------------------------
|Results from dataset sordariomycetes_odb10        |
---------------------------------------------------
|C:92.3%[S:92.0%,D:0.3%],F:1.9%,M:5.8%,n:3817      |
|3520    Complete BUSCOs (C)         	       |
|3510    Complete and single-copy BUSCOs (S)       |
|10    Complete and duplicated BUSCOs (D)          |
|71    Fragmented BUSCOs (F)                       |
|226    Missing BUSCOs (M)                         |
|3817    Total BUSCO groups searched               |
---------------------------------------------------

8. Removing mitochondrial sequences

Directory: /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/analysis/05_remove_mitochondria

Download the most related and complete mitochondrial genome from NCBI. As of 2/24/2024, the closest one to Discula destructiva is Chrysoporthe puriensis isolate CMW54409 (GenBank: CP064901.1).
Link the primary assembly

ln -s /lustre/isaac/proj/UTK0032/sniece/DisculaDestructiva_GenomeAssembly/analysis/03_hifiasm/dd_as111_assembly.hic.p_ctg.fa .

Conda install minimap

conda create -n minimap -c bioconda minimap

Use minimap2 to align to the primary genome assembly to the mitochondrial assembly and remove mitochondrial derived sequences

nano run_minimap2mito.qsh

#!/bin/bash
#SBATCH --job-name=run_minimap2mito
#SBATCH --nodes=1
#SBATCH --ntasks=10
#SBATCH --mem=10G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=condo-ut-genomics
#SBATCH --qos=genomics
#SBATCH --time=1:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

eval "$(conda shell.bash hook)"
conda activate minimap2

mitochondrial_genome=(Chrysoporthe_puriensis_mito.fasta)
primary_in=(dd_as111_assembly.hic.p_ctg.fa)
primary_mt_out=$(basename $primary_in | sed 's/.fa/.mt_alignments.paf/')

# Mitochondrial alignments
minimap2 \
        -t 10 \
        -x asm5 \
        $primary_in \
        $mitochondrial_genomee \
        > $primary_mt_out

sbatch run_minimap2mito.qsh

Use Meg's python script to pull out contigs with >90% coverage from the PAF alignments

Conda install python and pandas

conda -n python -c bioconda python=3.9
conda activate python
conda install pandas

nano run_find_scaffolds_by_paf_coverage.qsh

#!/bin/bash
#SBATCH --job-name=run_minimap2mito
#SBATCH --nodes=1
#SBATCH --ntasks=10
#SBATCH --mem=10G
#SBATCH -A ACF-UTK0032
#SBATCH --partition=condo-ut-genomics
#SBATCH --qos=genomics
#SBATCH --time=1:00:00
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH [email protected]

eval "$(conda shell.bash hook)"
conda activate python

python3 find_scaffolds_by_paf_coverage.py dd_as111_assembly.hic.p_ctg.mt_alignments.paf > dd_as111_assembly.hic.p_ctg.mt_list.txt

Python script:

nano find_scaffolds_by_paf_coverage.py

import re, sys, getopt
import pandas as pd

##---------------------------------------------------------------------
##--- MAIN BODY -----
##---------------------------------------------------------------------
paf_file = str(sys.argv[1])
percent_cov = .90

paf_df = pd.read_csv(paf_file, sep='\t')
paf_df.columns = ['query','qlen','qstart','qend','strand','target','targetlen','targetstart','targetend','matches','alnlen','mqual','a1','a2','a3','a4','a5','a6']
paf_df_reduced = paf_df['query','qlen','alnlen'](/ShadeNiece/DisculaDestructiva_GenomeAssembly-Annotation/wiki/'query','qlen','alnlen')

#print(paf_df_reduced)

for name, group in paf_df_reduced.groupby('query'):
    qlen = group.iat[0,1]
    alnsum = group['alnlen'].sum()
    #print(name)
    #print(group)
    #print(qlen)
    #print(alnsum)
    #print('\n')
    if float(alnsum) > float(qlen * percent_cov):
        print(name+"\t"+str(qlen)+"\t"+str(alnsum))

Nothing was being pulled out, and I think it's because I have low coverage so the script isn't working how it should. I'm going to stop here with pulling the mitochondrial sequences and maybe try another reference too just to make sure. I'm moving onto Juicer for now.

9. Juicer: Hi-C contacts

I am going to be moving from ISAAC-NG to Centaur because Juicer is already installed there, and Juicer is not the easiest thing to install to begin with. Directory in Centaur:/pickett_centaur/project/lwy647/discula_genome_assembly/analysis/01_juicer

I'm going to be using the wrapper that Zane Smith created for Juicer and adapting it for use on Centaur. The script requires that your assemblies have the ".fasta" suffix and that your HiC reads have the "R1.fastq.gz" and "R2.fastq.gz" suffixes.

(don't forget to make a screen session)

nano run_juicer.sh

# This script assumes your assemblies are in fasta format and have the ".fasta" extension. If you are using the ".fa" extension, either rename your files
# or replace your file extensions in the scripts below.

##################
##### Set-Up #####
##################

##########################################################################################################################################
# Enter your paths **not including file names** in the empty parantheses below. Also, **do not** include a trailing slash.
# Example:
#path2fastas=(/pickett_flora/projects/ash_longread_genomes/hifi_assemblies/hifi_novogene_12.2023/analysis/08_remove_mitochondria)

# Enter path to fasta assemblies here.
path2fastas=(/pickett_centaur/project/lwy647/discula_genome_assembly/analysis/01_juicer)
# Enter path to raw hic reads here.
path2hic_reads=(/pickett_centaur/project/lwy647/discula_genome_assembly/raw_data/Hi-C/results/23ddas111_1327064/Hi-C)

###########################################################################################################################################

# Establish filepathing (no action required)
hap1=($path2fastas/*.p_ctg.fasta)
hic_R1=($path2hic_reads/*R1.fastq.gz)
hic_R2=($path2hic_reads/*R2.fastq.gz)

# Establish arguments for juicer proper
project_dir=($(pwd))
hap1_juicer=($(basename $hap1))
base_hap1_juicer=($(basename $hap1 | sed 's/.fasta//'))
base_hap2_juicer=($(basename $hap2 | sed 's/.fasta//'))

## Add juicer to path
export PATH=$PATH:/pickett_centaur/software/juicer-1.6/CPU/juicer.sh

# Load dependencies: BWA & Java
spack load bwa
spack load /wgx2dcw

# Create directory structure
mkdir references
mkdir restriction_sites
mkdir fastq

# Link Juicer scripts and executable
## The following have already been performed on Flora.
ln -s /pickett_centaur/software/juicer-1.6/CPU scripts
wget https://s3.amazonaws.com/hicfiles.tc4ga.com/public/juicer/juicer_tools_1.22.01.jar
mv juicer_tools_1.22.01.jar scripts/common/
ln -s juicer_tools_1.22.01.jar juicer_tools.jar

# Link files to their respective directories.
## Assmeblies go in restriction_sites directory
ln -s $hap1 references

## HiC reads go in the fastq directory.
ln -s $hic_R1 fastq
ln -s $hic_R2 fastq

# Index assemblies
if [ -e "$project_dir/references/$hap1_juicer.amb" ]
then
    	echo "Indexes found. Skipping to restriction site identification."
else
    	for fasta in references/*fasta
        do
          	bwa index $fasta
        done
fi

##################
#### Workflow ####
##################

# 01. Identify restriction sites
for fasta in references/*fasta
do
       base=$(basename $fasta | sed 's/.fasta//')
       echo "Executed command: python3 /pickett_flora/software/juicer-1.6/misc/generate_site_positions.py DpnII $base $fasta"
       python3 /pickett_centaur/software/juicer-1.6/misc/generate_site_positions.py DpnII restriction_sites/$base $fasta
done

# 02 Estimate chromosome sizes
for file in restriction_sites/*DpnII.txt
do
  	base=$(basename $file | sed 's/.txt//')
        awk 'BEGIN{OFS="\t"}{print $1, $NF}' $file > restriction_sites/$base.chrom.size
done

# 03. Juicer - Hap1
/pickett_centaur/software/juicer-1.6/CPU/juicer.sh \
    -g $base_hap1_juicer \
    -t 20 \
    -d $project_dir \
    -D $project_dir \
    -z references/$hap1_juicer \
    -p restriction_sites/$base_hap1_juicer\_DpnII.chrom.size \
    -y restriction_sites/$base_hap1_juicer\_DpnII.txt \
    >& $base_hap1_juicer.juicer.out && \
# Rename aligned directory in preparation for hap2
mv aligned hap1_aligned

# Run juicer clean-up scripts
./scripts/common/cleanup.sh

bash run_juicer.sh

10. 3D-DNA: 3D Structuring

Directory: /pickett_centaur/project/lwy647/discula_genome_assembly/analysis/02_3d-dna Documentation: https://github.com/aidenlab/3d-dna

Rename the merged_nodups.txt files from juicer to avoid overwriting them, then change back to the 3D-DNA directory. Renaming is more important if you have more than one haplotype (I don't), but I'm including this in case it's of use in the future.

cd /pickett_centaur/project/lwy647/discula_genome_assembly/analysis/01_juicer/aligned
mv merged_nodups.txt dd_as111_assembly.hic.p_ctg_merged_nodups.txt
cd /pickett_centaur/project/lwy647/discula_genome_assembly/analysis/02_3d-dna

Before running 3D-DNA, link the original assembly fasta and copy the merged_dup.txt files from juicer into the new directory.

ln -s /pickett_centaur/project/lwy647/discula_genome_assembly/analysis/01_juicer/dd_as111_assembly.hic.p_ctg.fasta .
cp /pickett_centaur/project/lwy647/discula_genome_assembly/analysis/01_juicer/aligned/*merged_nodups.txt .

Run 3D-DNA using Beant's parameters (don't forget to make a screen session)

nano run_3D-DNA.sh

#Hap1
hap1_fasta=dd_as111_assembly.hic.p_ctg.fasta
hap1_base=$(basename $hap1_fasta | sed 's/.fasta//')

# Load Dependencies - if required (not required with conda install)
## gnu parallel gcc8.4.1
spack load /acgsl7y
## Load java
spack load /wgx2dcw

# Run 3D-DNA - Hap1
bash /pickett_centaur/software/3d-dna/run-asm-pipeline.sh \
        --editor-repeat-coverage 8 \
        --editor-coarse-resolution 50000 \
        --editor-coarse-region 250000 \
        $hap1_fasta \
        ${hap1_base}_merged_nodups.txt \
        >& ${hap1_base}.3d_dna.out

run_3D-DNA.sh

Copy the final.hic and final.assembly files (from both haplotypes if applicable) to your local computer and visualize them in Juicebox.

The Jukebox desktop application can be downloaded here