Module 2 Lab 1: Sassafras Haplotype‐Resolved Genome Assembly - jacksonhturner/turner_epp531 GitHub Wiki

This documentation details the procedures within EPP 531 Module 2 Lab 1. This lab is executed in /pickett_sphinx/projects/EPP531_AGA/turner/ML2_Lab1.

Loading in data

Because Hi-C and Hifi data are stored on elsewhere on sphinx, I soft link it to my working directory as follows:

ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/sassafras/raw_data/PacBioHiFi/m84109_240206_204137_s2.hifi_reads.bc2017.bam .
ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/sassafras/raw_data/PacBioHiFi/m84109_240206_204137_s2.hifi_reads.bc2017.bam.pbi .
ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/sassafras/raw_data/PacBioHiFi/gbru.m84109_240206_204137_s2.hifi_reads.bc2017.bam.md5 .
ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/sassafras/raw_data/Hi-C/results/salbidum01_1334140/Hi-C/salbidum01_1334141_S3HiC_R1.fastq.gz .
ln -s /pickett_sphinx/projects/EPP531_AGA/lyadav_EPPAGA/sassafras/raw_data/Hi-C/results/salbidum01_1334140/Hi-C/salbidum01_1334141_S3HiC_R2.fastq.gz .

QC

Before running QC, check md5sum numbers.

md5sum gbru.m84109_240206_204137_s2.hifi_reads.bc2017.bam.md5

Run LongQC with the following command.

apptainer exec -B $PWD /sphinx_local/images/longqc_latest.sif* longQC.py sampleqc -x pb-hifi -p 4 -o longqc_out/ m84109_240206_204137_s2.hifi_reads.bc2017.bam

I download the html file generated from LongQC to my personal computer. scp [email protected]:/pickett_sphinx/projects/EPP531_AGA/turner/ML2_Lab1/longqc/*.html .

Consider the following when interpreting LongQC results:

Adapter trimming is accomplished before data is made so adapters shouldn't be present.
The graph for read length is commonly represented as a normal distribution.
The majority of sequences should not be lower than 5k bp.
PacBio uses a sequence pooling method to calculate per read coverage since there it does not use Phred scores.
Variance and means should be equal under a Poisson distribution, so shouldn't be a second bell curve.
We see sequence dispersion in the per read coverage graph, this happens for three reasons:
1: Libraries are not made or loaded properly, size selection could be of poor quality.
2: When PCR was performed for PacBio, some sequences could be overrepresented.
3: The complexity of genome -- sometimes, repetitive elements of genome are present and there are lots of repeat elements.
However, two curves could be beneficial when there are multiple organisms present in a sample (e.g. metagenomics).
Depending on what is sequence, the sequence dispersion within a sample may differ.
Non-sense reads are very small reads. If several exist, then sequencing may not be successfully completed.
GC content: most eukaryotes are around 40% GC content.
Results should be sharp peak in gc content, represented through probability density.
Density can be cumulative: each bar is shown separately sums to 100%.
Probability density: bars pile up on one another to exhibit a sharp exponential peak in GC, which is desired.
If multiple peaks are exhibited, multiple organisms are likely present within a sample.
The flanking region analysis demonstrates another metric measuring adapter contamination. There shouldn't be many distances from 5' terminal that aren't 0.
Sequence complexity varies a lot depending on organism and mostly fluctuates between .2 to .4.
There are several of conserved regions that are difficult to differentiate: low sequence complexity, where sequences are highly similar.
If a sample contains a hybrid or heterozygous organism, the sequence complexity might be increased.

I convert the .bam file of hifi reads into fastq format with the following command:

samtools bam2fq m84109_240206_204137_s2.hifi_reads.bc2017.bam > sassafras_samtools_HiFI_reads.fq

I load fastqc with the following command:

spack load FastQC

To perform FastQC on the long reads, I execute the following:

mkdir Hi-C_fastqc
fastqc *fastq.gz -o Hi-C_fastqc

To view the .html files generated by FastQC, secure copy them to your personal PC with this command:

scp [email protected]:/pickett_sphinx/projects/EPP531_AGA/turner/ML2_Lab1/Hi-C_fastqc/*.html .

Consider the following when interpreting fastqc results:

Per tile sequence quality -- cells can predict sequences quality, depending on how many samples loaded.
Per sequence quality scores are another representation of base calling quality
If there's more than a 10% difference in per base sequence content, bad.
In RNA-Seq studies, sometimes AT-GC are 20% different.
Per sequence GC content can show contamination, presence of multiple organisms depending on number of peaks.
When a sequencing machine calls a sequence and isn't sure, it'll place N. Typically this doesn't happen often.
Sequence length distribution shows how long in bp sequences are, for Illumina they should be around 150
Sequence duplication levels: sometimes RNA will be chopped up in Hi-C data, don't want lots of duplications in Illumina data, we want duplications in RNA-Seq data.
Overrepresented sequences and adapter content are pretty self-explanatory.

R1 FastQC Results | R2 FastQC Results | LongQC Results

Downsampling Reads

To find number of reads for simulated coverage to use, take number of reads, divide by coverage, then multiply by desired coverage. The result is the number of reads required to simulate a provided coverage value. My assigned simulated coverage is 60x. There are 5450316 total reads and a known coverage of 103x. I find the number of reads by dividing the total number of reads by the actual coverage, and then multiplying the result by the desired coerage. To accomplish this, I will perform the following commands:

spack load /yfui77z
seqtk sample -s100 sassafras_samtools_HiFI_reads.fq 3174941 > sassafras_downsampled_reads.fqseqtk sample -s100  number_of_sequences output

Assembly

I will first use Hifi data to assemble the sassafras genome with the following command. I create a shell script and add the following inside and then run the command.

/sphinx_local/software/hifiasm/hifiasm \
-o Sassafras_V1.0_no_Hi-C \
-t 4 \
--hg-size 800m \
sassafras_samtools_HiFI_reads.fq

I will then use Hifi (PacBio) data in conjunction with Hi-C (Illumina) data to assemble the sassafras genome with the following command using the downsampled reads. This will be written and executed through a shell script per the prior command.

/sphinx_local/software/hifiasm/hifiasm \
-o Sassafras_V1.0_with_Hi-C-60X \
-t 4 \
--hg-size 800m \
--h1 salbidum01_1334141_S3HiC_R1.fastq.gz \
--h2 salbidum01_1334141_S3HiC_R2.fastq.gz \
sassafras_downsampled_reads.fq