6A. Long and short read Genome Assembly - bioinfokushwaha/Livestock_Genomics GitHub Wiki

Long Read genome Assembly

Login to server

ssh -X [email protected]
Password 123456
cd /home/nanobioinfo22/NGSWorkshop_2024/your_name

mkdir Long_assembly
cd Long_assembly
cp ../../Day3/mouse_demodata.fastq ./
conda activate assembly1

Quality control:

  • The quality of the input reads are assessed using Porechop which finds and removes adapters from Oxford Nanopore reads.
porechop -i mouse_demodata.fastq -o mouse_demodata_trim.fastq -t 5 --extra_end_trim 0 --extra_middle_trim_good_side 0 --extra_middle_trim_bad_side 0

Genome Assembly:

  • Flye is a de novo assembler for single-molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PacBio / ONT reads as input and outputs polished contigs. Flye also has a special mode for metagenome assembly.
flye --threads 2 --nano-raw mouse_demodata_trim.fastq --genome-size 2g --out-dir mouse_assembly
  • The assembly.fasta generated from the above step in the directory mouse_assembly is then utilised to close the gaps emerging during the scaffolding process via TGS-GapCloser, further improving the overall quality. It is a gap-closing software tool that uses error-prone long reads generated by third-generation-sequence techniques (Pacbio, Oxford Nanopore, etc.) or preassembled contigs to fill N-gap in the genome assembly.
tgsgapcloser --scaff mouse_assembly/assembly.fasta --reads mouse_demodata_trim.fastq --output tgs_gapcloser_muslong --racon /home/nanobioinfo22/.conda/envs/assembly1/bin/racon
conda deactivate
  • To describe the completeness and contiguity of a genome assembly, several summary statistics and in-silico validations are performed. Quast i.e Quality Assessment Tool for Genome Assemblies is one such tool for genome assembly evaluation.
quast.py tgs_gapcloser_muslong.scaff_seqs -t 2 -o tgs_gapcloser_quas
  • BUSCO (Benchmarking Universal Single-Copy Orthologs) is yet another correctness measure that provides measures for quantitative assessment of genome assembly based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs.
busco -f -i tgs_gapcloser_muslong.scaff_seqs -m genome -l metazoa_odb10 -o output_busco -c 124

Short Read genome Assembly

cd /home/nanobioinfo22/NGSWorkshop_2024/your_name

mkdir Short_assembly
cd Short_assembly
cp ../../Day3/Cow_* ./
conda activate assembly

Quality control:

  • The quality of raw fastq reads are assessed using FASTQC. The graphical representations generated by FASTQC are evaluated, and sequence trimming is performed to remove any contamination, such as adapter contamination.
fastp --detect_adapter_for_pe -i Cow_1.fastq.gz -I Cow_2.fastq.gz -o dm1_trim.fq -O dm2_trim.fq -q 30 --json=fastp.json --html=fastp.html -w 30
  • Before assembly, the genome size is estimated at 17 kmer using frequency-based Jellyfish and GenomeScope. It provides an overall statistics at a particular kmer value of the input data including depth, heterozygosity, repetition and so on.
jellyfish count -C -m 17 -s 1000000000 -t 2 dm1_trim.fq dm2_trim.fq -o reads_17.jf
jellyfish histo -t 2 reads_17.jf > reads_17.histo
conda deactivate
/home/nanobioinfo22/shailesh_nipgr/app/genomescope/genomescope.R reads_17.histo 17 150 genomescope_dm_stats

Genome Assembly:

spades.py -1 dm1_trim.fq -2 dm2_trim.fq -o spade_out --only-assembler --careful -t 2
  • Creates several files and outputs. The final assembled genome is present in “scaffolds.ref.fa”. The assembled genome is further scaffolded using RagTag. Homology-based assembly scaffolding is an approach of ordering and orienting the draft assembly (query) sequences into longer sequences by comparing against the closely related genome.
ragtag.py scaffold <reference.fa> scaffold.fasta
  • The scaffold.fasta generated from the above step is then utilized to close the gaps emerging during the scaffolding process via GapCloser, further improving the overall quality. Make sure to change the path of the input fq reads in the example.config file.
nano example.config
#maximal read length
max_rd_len=100
[LIB]
#average insert size
avg_ins=300
#if sequence needs to be reversed
reverse_seq=0
#a pair of fastq files, read1 file should be followed by read2 file
q1=/home/nanobioinfo22/NGSWorkshop_2024/your_name/Short_read_assembly/Cow_1.fastq.gz
q2=/home/nanobioinfo22/NGSWorkshop_2024/your_name/Short_read_assembly/Cow_2.fastq.gz
GapCloser -a scaffold.fasta -b <example.config> -o Draft_genome.fa -t 224 -l 150
conda deactivate
  • In order to describe the completeness and contiguity of a genome assembly, several summary statistics and in-silico validations are performed. Quast i.e Quality Assessment Tool for Genome Assemblies is one such tool for genome assembly evaluation.
quast.py Draft_genome.fa -o DG_evalulate -t 224
OR
quast.py Draft_genome.fa -o DG_evalulate -r <ref.fa> -g ref.gff -t 22
  • BUSCO (Benchmarking Universal Single-Copy Orthologs) is yet another correctness measure that provides measures for quantitative assessment of genome assembly based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs.
busco -f -i Draft_genome.fa -m genome -l diptera_odb10 -o output_busco -c 124
⚠️ **GitHub.com Fallback** ⚠️