Module_2_Lab_1 - heelsplitter/Grootmyers_EPP_531_Applied_Genome_Analytics GitHub Wiki

Symbolically link data to current directory

HiFi Data

cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy
mkdir 01_data
cd 01_data
ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/sassafras/raw_data/PacBioHiFi/m84109_240206_204137_s2.hifi_reads.bc2017.bam .
ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/sassafras/raw_data/PacBioHiFi/m84109_240206_204137_s2.hifi_reads.bc2017.bam.pbi .
ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/sassafras/raw_data/PacBioHiFi/gbru.m84109_240206_204137_s2.hifi_reads.bc2017.bam.md5 .

Hi-C Data

ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/sassafras/raw_data/Hi-C/results/salbidum01_1334140/Hi-C/salbidum01_1334141_S3HiC_R1.fastq.gz .
ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/sassafras/raw_data/Hi-C/results/salbidum01_1334140/Hi-C/salbidum01_1334141_S3HiC_R2.fastq.gz .

Check. Numbers have to match.

md5sum m84109_240206_204137_s2.hifi_reads.bc2017.bam.pbi
cat gbru.m84109_240206_204137_s2.hifi_reads.bc2017.bam.md5

Sassafras Genome Assembly Pipeline

Step 1: QC of Hifi Data

Put the following in your bash script:

nano ~/.bashrc
export PATH=$PATH:/pickett_shared/software/apptainer_unprivileged/bin/

Apply the change:

source ~/.bashrc

Test Longqc

apptainer exec -B $PWD /sphinx_local/images/longqc_latest.sif* longQC.py --version

Run LongQC

screen -S LongQc
screen -r LongQc
cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy/01_data
source ~/.bashrc
spack load apptainer
spack load squashfuse
apptainer exec -B "$PWD,/pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/sassafras/raw_data/PacBioHiFi" /sphinx_local/images/longqc_latest.sif* longQC.py sampleqc -x pb-hifi -p 4 -o longqc_out/ m84109_240206_204137_s2.hifi_reads.bc2017.bam

LongQC result

Step 2: Convert BAM to Fastq

export SPACK_ROOT=/pickett_shared/spack
PATH=$PATH:$HOME/bin:$SPACK_ROOT/bin
. $SPACK_ROOT/share/spack/setup-env.sh
spack list bedtools
spack load bedtools2
bedtools bamtofastq -i m84109_240206_204137_s2.hifi_reads.bc2017.bam -fq sassafras_bedtools_HiFI_reads.fq

HiFi Fastq Data

ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/sassafras/raw_data/PacBioHiFi/sassafras_samtools_HiFI_reads.fq .

Step 3: QC of Hi-C Data

Load FastQC

spack load fastqc

Run FastQC on Hi-C Files

mkdir Hi-C_fastqc
fastqc *fastq.gz -o Hi-C_fastqc

FastQC R1 Result FastQC R2 Result

Step 4: Downsample HiFi Reads

cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy/01_data
source ~/.bashrc
spack load seqtk
spack load /yfui77z
seqtk sample -s100 sassafras_samtools_HiFI_reads.fq 1058313 > sassafras_samtools_HiFI_reads_20x.fq

Step 5: Genome Assembly with Hifi Data (No Hi-C)

This run was not downsampled and was killed before completion.

screen -S NoHiC
screen -r NoHiC
cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy/01_data
source ~/.bashrc
/sphinx_local/software/hifiasm/hifiasm \
-o Sassafras_V1.0_no_Hi-C \
-t 4 \
--hg-size 800m \
sassafras_samtools_HiFI_reads.fq

Downsampled to 20x:

screen -S NoHiC_down
/sphinx_local/software/hifiasm/hifiasm \
-o Sassafras_V1.0_no_Hi-C \
-t 4 \
--hg-size 800m \
sassafras_samtools_HiFI_reads_20x.fq

Step 6: Genome Assembly with Hifi + Hi-C Data

screen -S HiC_down
/sphinx_local/software/hifiasm/hifiasm \
-o Sassafras_V1.0_with_Hi-C \
-t 4 \
--hg-size 800m \
--h1 salbidum01_1334141_S3HiC_R1.fastq.gz \
--h2 salbidum01_1334141_S3HiC_R2.fastq.gz \
sassafras_samtools_HiFI_reads_20x.fq

Install Conda

First, go to this page and download the Miniconda bash script in your home directory -

cd /pickett_sphinx/projects/EPP531_AGA/dgrootmy
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh

NOTE - Please double check that you have the latest link available.

Then, run the script -

bash Miniconda3-py39_4.12.0-Linux-x86_64.sh

Conda will be installed. You will have to log out and then log back in. Then, in order to correctly use Bioconda, run these commands (you have to run these just once) -

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

You are all set.