4_Scaffolding - bennestor/hakea_genome GitHub Wiki
I wasn’t able to get Hi-C data for H. prostrata due to its high molecular weight. I used Ragtag to scaffold the contigs based on chromosome sequences of the recently assembled Proteaceae Telopea speciosissima.
Ragtag to scaffold H. prostrata contigs based on T. speciosissima
#Get sequence lengths cat ragtag.scaffold.fasta | bioawk -c fastx '{ print $name, length($seq) }' | sort -k 2,2nr | head -n 20
ragtag.scaffold.fasta:
num_seqs | sum_len | min_len | avg_len | max_len | GC (%) | |
---|---|---|---|---|---|---|
334 | 712,397,851 | 485 | 2,132,927.7 | 80,311,853 | 65,037,154 | 38.81 |
Scaffolding statistics:
placed_sequences | placed_bp | unplaced_sequences | unplaced_bp |
---|---|---|---|
873 | 665843562 | 312 | 46469189 |
This analysis was done using the genome_v1 sequence of Hakea prostrata
Conda installations
#minimap2 (v2.22-r1101) #LASTZ v1.02
Rename chromosomes in Hakea and Telopea
#Rename telopea chromosomes to chr1-11 cat telopea_genome.fa | seqkit replace -p '(CM.+)' -r '{kv}' -k alias.txt -U > telopea_chr.fa #Rename hakea chromosomes (ragtag.scaffold.fasta renamed to hakea_ragtag_v2.fa) cat hakea_ragtag_v2.fa | seqkit replace -p '(CM.+)' -r '{kv}' -k ../alias.txt -U > hakea_chr_v2.fa
Extract pseudomolecules of Hakea and Telopea
#extract telopea chromosomes samtools faidx telopea_chr.fa chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 > telopea_justchr.fa
Split into separate fasta files
#Split Telopea pseudomolecules seqkit split telopea_justchr.fa -i #rename each fasta file to chr*.fa
Run LASTZ with pairwise chromosome comparisons
#Optional multiple arguments: ../ncbi_genomes/telopea_justchr_split/chr1.fa ../1_ragtag_out/hakea_justchr_v2_split/chr1.fa > chr1.lav … ../ncbi_genomes/telopea_justchr_split/chr11.fa ../1_ragtag_out/hakea_justchr_v2_split/chr11.fa > chr11.lav
Visualise alignments using Laj plots
Downloaded laj on local computer. Run instructions https://globin.bx.psu.edu/dist/laj/
#Running laj on linux java -Xmx10G -jar /home/ben/Desktop/IRDS/Genome/lav/laj/laj.jar
There is a weird X shaped plot when running LASTZ with --chain. Using --identity=90 removes this, but also removes some sections bridging gaps in the alignments (making insertions appear). Without –chain it is is very messy and --identity 90 is needed to see any of the main alignments.