Flye is a de novo assembler for single-molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PacBio / ONT reads as input and outputs polished contigs. Flye also has a special mode for metagenome assembly.
The assembly.fasta generated from the above step in the directory mouse_assembly is then utilised to close the gaps emerging during the scaffolding process via TGS-GapCloser, further improving the overall quality. It is a gap-closing software tool that uses error-prone long reads generated by third-generation-sequence techniques (Pacbio, Oxford Nanopore, etc.) or preassembled contigs to fill N-gap in the genome assembly.
To describe the completeness and contiguity of a genome assembly, several summary statistics and in-silico validations are performed. Quast i.e Quality Assessment Tool for Genome Assemblies is one such tool for genome assembly evaluation.
BUSCO (Benchmarking Universal Single-Copy Orthologs) is yet another correctness measure that provides measures for quantitative assessment of genome assembly based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs.
The quality of raw fastq reads are assessed using FASTQC. The graphical representations generated by FASTQC are evaluated, and sequence trimming is performed to remove any contamination, such as adapter contamination.
Before assembly, the genome size is estimated at 17 kmer using frequency-based Jellyfish and GenomeScope. It provides an overall statistics at a particular kmer value of the input data including depth, heterozygosity, repetition and so on.
Creates several files and outputs. The final assembled genome is present in “scaffolds.ref.fa”. The assembled genome is further scaffolded using RagTag. Homology-based assembly scaffolding is an approach of ordering and orienting the draft assembly (query) sequences into longer sequences by comparing against the closely related genome.
ragtag.py scaffold <reference.fa> scaffold.fasta
The scaffold.fasta generated from the above step is then utilized to close the gaps emerging during the scaffolding process via GapCloser, further improving the overall quality. Make sure to change the path of the input fq reads in the example.config file.
nano example.config
#maximal read length
max_rd_len=100
[LIB]
#average insert size
avg_ins=300
#if sequence needs to be reversed
reverse_seq=0
#a pair of fastq files, read1 file should be followed by read2 file
q1=/home/nanobioinfo22/NGSWorkshop_2024/your_name/Short_read_assembly/Cow_1.fastq.gz
q2=/home/nanobioinfo22/NGSWorkshop_2024/your_name/Short_read_assembly/Cow_2.fastq.gz
In order to describe the completeness and contiguity of a genome assembly, several summary statistics and in-silico validations are performed. Quast i.e Quality Assessment Tool for Genome Assemblies is one such tool for genome assembly evaluation.
BUSCO (Benchmarking Universal Single-Copy Orthologs) is yet another correctness measure that provides measures for quantitative assessment of genome assembly based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs.