AssemblyQualityAssessmentandProteinPrediction - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Assembly Quality Assessment and Protein Prediction

Why We Need Assembly Quality Assessment

  • Understand the quality of data obtained from assembling with MEGAHIT.
  • Do not overly trust your tools; fully understand your data.

How to Know Our Contigs Data

BUSCO

  • Install BUSCO using conda:
conda activate envname
conda install -c conda-forge -c bioconda busco=5.3.2
  • Run BUSCO:
busco -i genome.fa -c 10 -o outputdir -m geno/prot/tran -l refdatabase_path --offline

QUAST

  • Install QUAST using conda:
conda activate envname
conda install quast
  • Basic usage:
quast.py contigs.fas
  • More sophisticated usage:
quast.py contigs_1.fa contigs_2.fa -r reference.fa -g genome.gff -1 reads1.fastq.gz -2 reads2.fastq.gz -o quast_out -t 12
  • QUAST can run without a reference genome, but there will be no gene alignment information.

getorf

  • Install EMBOSS to use getorf:
conda install -c bioconda emboss
  • Command example:
getorf -minsize 600 -sequence input.fna -outseq output.faa
  • getorf extracts Open Reading Frames (ORFs) from a nucleotide sequence and translates them into protein sequences.