Module 2 Lab 2: Troubleshooting Hifiasm Assembly - jacksonhturner/epp_531 GitHub Wiki

This lab describes the processes necessary for the troubleshooting and refining of the sassafras genome assembly using results from the previous lab.

Convert Hifiasm Results to FASTA format

Converting the hifiasm results to fasta format will allow us to evaluate the quality of the assemblies for each haplotype. To accomplish this, hifiasm output must first be converted with the following script:

awk '/^S/{print ">"$2;print $3}' \
Sassafras_V1.0_with_Hi-C_30X.hic.hap1.p_ctg.gfa\
> Sassafras_V1.0_with_Hi-C_30X.hic.hap1.p_ctg.fasta

awk '/^S/{print ">"$2;print $3}' \
Sassafras_V1.0_with_Hi-C_30X.hic.hap2.p_ctg.gfa\
> Sassafras_V1.0_with_Hi-C_30X.hic.hap2.p_ctg.fasta

Acquiring and Interpreting Assembly Statistics

BBMAP will allow us to get stats for each haplotype assembly. The following script takes assemblies in fasta format as inputs and returns a text file populated with metrics used to evaluate our assemblies.

/sphinx_local/software/bbmap/stats.sh -Xmx10g \
in=Sassafras_V1.0_with_Hi-C_30X.hic.hap1.p_ctg.fasta \
> Sassafras_V1.0_with_Hi-C_30X.hic.hap1.p_ctg.stats.txt

/sphinx_local/software/bbmap/stats.sh -Xmx10g \
in=Sassafras_V1.0_with_Hi-C_30X.hic.hap2.p_ctg.fasta \
> Sassafras_V1.0_with_Hi-C_30X.hic.hap2.p_ctg.stats.txt

These stats can be accessed by opening the file with a text editor such as nano. The image below displays the class spreadsheet with the statistics of the assemblies created from Module 2 Lab 1:

Complete genome assemblies should contain a low number of long contigs, as opposed to a high number of short contigs. Thus, N/L50 are popular assembly statistics, and high N50 and low L50 are considered best. Haplotype assemblies should be balanced, and should have similar contigs with similar lengths. The variation in scaffold number and N/L50 values between each assembly suggests that these haplotype assemblies are imbalanced. By comparing these assembly stats with others created through different parameters, it is revealed that the parameters selected for this project are suboptimal.

Removing Mitochondrial/Plastome DNA

Next, chloroplast and mitochondrial sequences must be removed from assemblies. Download the camphor plant mitochondrion and sassafras plastome.

scp C_camphor_plastome.fasta [email protected]:/pickett_sphinx/projects/EPP531_AGA/turner/M2Lab2/analysis
scp S_albidum_plastome.fasta [email protected]:/pickett_sphinx/projects/EPP531_AGA/turner/M2Lab2/analysis