Genome Assembly - Lavadav/EPP531_AGA GitHub Wiki

Data For Class

Cherokee Rose Subset Dataset

Step 1: Make the folders for data analysis

mkdir Raw_Data, Analysis, Results

Step 2: Copy the subset dataset into your folders.

cp /work/pbgg8900/instructor_data/Genome_Assembly_Data/Pacbio_Data/subset_SRR29286022.fastq .

Step 3: Soft link the dataset into your working directory.

ln -s path_to_raw_Data/ .

Step 4: Run hifiasm with and without Hi-C data.

Without Hi-C Data

ml hifiasm/0.25.0
hifiasm -o Hifiasm_output --hg-size 50m subset_SRR29286022.fastq

With Hi-C Data

ml hifiasm/0.25.0
hifiasm -o Hifiasm_output_Hi-C --hg-size 50m --h1 subset_HiC_R1.fastq.gz --h2 subset_HiC_R2.fastq.gz subset_SRR29286022.fastq

Step 5: Convert the .gfa file to FASTA file.

awk '/^S/{print ">"$2;print $3}' Hifiasm_output.bp.p_ctg.gfa > Hifiasm_output.bp.p_ctg.fasta

Step 6: Access the assembly statistics.

ml BBMap/39.19-GCC-13.3.0
stats.sh Hifiasm_output.bp.p_ctg.fasta > Hifiasm_output.bp.p_ctg.stats.txt

Homework: Repeat the above steps with Hi-C files being used in for Hifiasm assembly.