4 Scaffolding - coopermkr/sdepressaAssembly GitHub Wiki
To scaffold our contaminant-free contigs we need to bring the Hi-C data and our assembly .gfa file back into the mix. Because we have access to very long Nanopore reads and some high quality pacbio data, we have a real shot at assembling chromosome-scale scaffolds. We also need to use a program that will help us correct some of the more severe misjoins in our tetraploid unitigs. Enter: Haphic
To install, see specifications on their github, it is far simpler than what we endured with blobplot: https://github.com/zengxiaofei/HapHiC
We need to provide haphic with a bam file alignment of the paired end HiC reads against our contigs.
#Aligning HiC reads against unitigs
bwa index 2.hifiasm/tetra.utg.fasta
bwa mem -t 30 -5SP 2.hifiasm/tetra.utg.fasta data/hic_R1.fastq data/hic_R2.fastq | ../samblaster/samblaster | samtools view - -@ 30 -S -h -b -F 3340 -o contighic.bam
# Filter alignments with MAPQ1 and NM3
filter_bam contighic.bam 1 --nm 3 --threads 30 | samtools view - -b -@ 30 -o filtered.contighic.bam
Now we can feed our contigs, Hi-C alignment, and .gfa file into HapHiC for scaffolding. We must provide the number of chromosomes we are trying to create. In our case, we want to create 18 chromosomes to represent both of the historical lineages of the tetraploid.
# Use gfa coverage depth to filter out collapsed contigs
# No need to adjust RE cut sequence bc default is DpnII
haphic pipeline 2.hifiasm/tetra.utg.fasta 4.scaffolding/filtered.contighic.bam 18 --gfa 2.hifiasm/tetra.hic.p_utg.gfa
We can run stats on our scaffolds again using the assemblathon code from the 2 hifiasm section.