yeast genome de novo assembly from PacBio HiFi reads using Flye - splaisan/analyses GitHub Wiki

(last edit: 2022-10-04)

Table of Contents

[Aim]

[Resources]

[Analysis & Results]

[Conclusion]

[Command help]


Aim

This tutorial reports the de-novo assembly of the Saccharomyces cerevisiae S288C genome using only Sequel-IIe HiFi reads.

Resources

''Note: The different steps of the process are largely inspired from existing web-resources and publications. We try to acknowledge them where appropriate but if you feel like a reference is missing, please let us know and it will be amended.''

The steps of this analysis are inspired from a recent benchmarking review by Zhang et al of current assemblers and finishing tools. We selected Flye for assembling the HiFi reads as it combines a good performance, good N50 contig size without too much mis-assemblies (as seen with for instance next-denovo) and ease of use.

Reference data

In order to compare the obtained assembly to the Saccharomyces cerevisiae s288c reference genome, both sequence and annotations were obtained for the subject from the EnsEMBL site.

(base URL: http://ftp.ensembl.org/pub/)

  • genome: current_fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
  • proteins: current_fasta/saccharomyces_cerevisiae/pep/Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz
  • annotations: current_gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.107.gff3.gz
  • known variants: current_variation/vcf/saccharomyces_cerevisiae/saccharomyces_cerevisiae.vcf.gz
mkdir -p reference
baseurl="http://ftp.ensembl.org/pub/"
wget -P reference ${baseurl}/current_fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
wget -P reference ${baseurl}/current_fasta/saccharomyces_cerevisiae/pep/Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz
wget -P reference ${baseurl}/current_gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.107.gff3.gz
wget -P reference ${baseurl}/current_variation/vcf/saccharomyces_cerevisiae/saccharomyces_cerevisiae.vcf.gz

# create decompressed versions with easier names
gunzip -c reference/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz \
  > reference/Sc.R64-1-1.fa

gunzip -c reference/Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz \
  > reference/Sc.R64-1-1_proteins.fa

gunzip -c reference/Saccharomyces_cerevisiae.R64-1-1.107.gff3.gz \
  > reference/Sc.R64-1-1.107.gff3
Alt using the NCBI yeast reference

The NCBI data for yeast can be preferred and is found the following location.

(base URL: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/)

  • genome: GCF_000146045.2_R64_genomic.fna.gz
  • proteins: GCF_000146045.2_R64_protein.faa.gz
  • annotations: GCF_000146045.2_R64_genomic.gff.gz
mkdir -p reference
baseurl="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/"
wget -P reference ${baseurl}/GCF_000146045.2_R64_genomic.fna.gz
wget -P reference ${baseurl}/GCF_000146045.2_R64_protein.faa.gz
wget -P reference ${baseurl}/GCF_000146045.2_R64_genomic.gff.gz

These files will be used at later stages for comparison.

[back-to-top]

Analysis & Results

The workflow is detailed in the next sections and commands are added at each step, considering that the user is working in a writable project folder, has sufficient rights to create folders and save files, and has installed the necessary programs (not detailed here). For a number of resources, conda install was done and should be reproducible on the user computer provided miniconda has been installed and configured correctly.

We provide a conda config file yeast_asm_environment.yml which should allow you to reproduce the environment used in this report. Since this env has been built sequentially, it is possible that some packages refuse to install, in which case you can create a single env for refractory tools.

[back-to-top]

Raw Data

The SRR18210286 reads used in this tutorial was obtained from the SRA project PRJNA792930.

--

The SRA data was downloaded and converted to Fastq using the NCBI SRA tools (version 2.8.0).

# sra_tools installed with conda
mkdir -p sra rawreads
sraid=SRR18210286
thr=8
prefetch -O sra ${sraid}

# convert to fastq with sra-tools version 2.x
fastq-dump -O rawreads -origfmt -I --split-files --gzip ${sraid}

# NOTE with the more recent 3.0 version of the tools
#      fasterq-dump replaced fastq-dump and is much faster indeed
fasterq-dump -e ${thr} -O rawreads -S sra/${sraid}/${sraid}.sra && \
  bgzip -@ ${thr} rawreads/${sraid}.fastq

# resulting in a file: rawreads/SRR18210286.fastq.gz

[back-to-top]

Filtering reads for a 100x genome coverage depth

A subset of 100x coverage was extracted from the big raw reads file (380x coverage) with filterlong (version 0.2.1) in order to have sufficient (yet reasonable) data for assembly. During filtering, shorter than 10kb reads were excluded and only the 95% best average quality reads were retained.

# installed with conda
# >10kb
# 100x depth for S288c (12Mb)
# remove worse 5% by quality

mkdir -p reads

filtlong \
  --min_length 10000 \
  --target_bases 1200000000 \
  --keep_percent 95 \
  rawreads/SRR18210286.fastq.gz | \
  bgzip -c > reads/SRR18210286_HiFi_gt10k_100x.fq.gz

# filterlong analyzed 227,298 reads (4,583,595,361 bp) in the input file and kept 50'892 reads (1,200,003,122 bp)

[back-to-top]

QC of the resulting reads

The reads were analyzed using FastQC (version 0.11.9)

mkdir -p reads/fastqc_results
fastqc -o reads/fastqc_results reads/SRR18210286_HiFi_gt10k_100x.fq.gz

The size distribution of the reads shows a N50 / N90 of 23582 bp / 19106 bp, corresponding to a very good quality HiFi library.

[back-to-top]

De Novo assembly of the reads using Flye

The filtered HiFi read subsets were assembled using Flye (version 2.9.1-b1780), resulting in a multifasta files containing contigs for the host genome and for potential extra-chromosomal plasmids.

The initial step is using the longest 40x reads for contig assembly and all reads are then used for later processes.

# Flye installed using conda
mkdir -p 100x_Flye_assembly

reads="reads/SRR18210286_HiFi_gt10k_100x.fq.gz"
thr=24

flye --pacbio-hifi \
  ${reads} \
  --out-dir 100x_Flye_assembly \
  --genome-size 12m \
  --threads ${thr} \
  --asm-coverage 40

Flye reported the following metrics:

  • Total length: 12881856
  • Fragments: 60
  • Fragments N50: 724534
  • Largest frg: 1472942
  • Scaffolds: 0
  • Mean coverage: 90

The Flye assembly produced several important files among which:

  • a summary of the assembly: assembly_info.txt (reproduced next)
  • the assembly in Fasta format: assembly.fasta
  • an assembly graph: assembly_graph.gfa

only contigs larger than 100kb are listed here

seq_name length cov. circ. repeat mult. alt_group graph_path
contig_30 1472942 96 N N 1 * ,30,
contig_33 1060609 98 N N 1 * ,33,
contig_100 902343 97 N N 1 * ,100,
contig_11 866978 99 N N 1 * *,11
contig_99 834891 99 N N 1 * ,-6,99,
contig_105 780267 97 N N 1 * ,105,
contig_7 724534 95 N N 1 * *,-63,7
contig_103 675125 100 N N 1 * ,103,
contig_106 624754 96 N N 1 * 106,*
contig_98 590312 97 N N 1 * ,98,
contig_102 564595 96 N N 1 * 102,*
contig_2 509594 93 N N 1 * 2,6,*
contig_91 477797 96 N N 1 * *,91,-20,-101,-20,-101,-20,-101,-20,-101
contig_42 432189 98 N N 1 * ,42,
contig_104 369953 95 N N 1 * *,-63,104
contig_8 287696 97 N N 1 * ,8,
contig_72 199833 87 N N 1 * 72,*
contig_23 133434 106 N N 1 * *,23
contig_69 108934 33 N N 1 * *,67,69
contig_68 108928 28 N N 1 * *,67,68

The Flye assembly graphs can be converted to a plot with Bandage (version 0.8.1) which illustrate the linear or circular character of the different contigs and their relationships.

# bandage installed with conda
gfa="100x_Flye_assembly/assembly_graph.gfa"

Bandage image \
  ${gfa} \
  100x_Flye_assembly/s288c_bandage_plot.png \
  --names \
  --depth \
  --lengths \
  --fontsize 6 \
  --edgelen 50 \
  --toutline 2 \
  --minnodlen 50 \
  --colour blastsolid

[back-to-top]

QC of the assembly and comparison to s288c

Quast analysis of the assembly

Quast (version 5.2.0) was used to evaluate the assembly and compare it to the reference genome

# quast installed with conda
mkdir -p assembly_QC

cp reference/Sc.R64-1-1.fa assembly_QC/s288c.fa
cp reference/Sc.R64-1-1.107.gff3 assembly_QC/s288c.gff
cp 100x_Flye_assembly/assembly.fasta assembly_QC/Flye_denovo_100x.fa

cd assembly_QC
thr=8

quast -o quast_out \
  -r s288c.fa \
  -g s288c.gff \
  -t ${thr} \
  --min-identity=85 \
  -l "s288c,Flye_denovo_100x" \
  s288c.fa \
  Flye_denovo_100x.fa

cd ../

[back-to-top]

pairwise alignment of the de-novo and reference genomes

Both genomes were aligned using Mummer4 (version 4.0.0rc1) and a bi-plot produced from the results.

# mummer4 installed with conda
# we used here a custom script mummer_biplot.sh that can be found on our wiki:
# https://github.com/Nucleomics-VIB/ngs-tools/blob/master/mummer_biplot.sh

mummer_biplot.sh \
  -o assembly_QC/denovo_vs_s288c \
  -c 100 -I 98 -L 100 -f png -T 8 \
  -x assembly_QC/s288c.fa \
  -y 100x_Flye_assembly/assembly.fasta && \
  mv assembly_vs_s288c_mummer3-log_1664891829.txt assembly_QC/denovo_vs_s288c/

The two assemblies appears as a nice diagonal, suggesting a nice sequence homology and content.

[back-to-top]

BUSCO analysis of the assembly

BUSCO (version 5.3.2) is traditionally applied to eukaryote assemblies to evaluate the fraction of essential genes present, which is a proxy to assembly completeness.

# BUSCO is installed from a docker image
# from https://busco.ezlab.org/busco_userguide.html#manage-run-parameters-in-config-files

mkdir -p assembly_QC/busco_out
thr=8

sudo docker run -u $(id -u) -v $(pwd):/host_mount -w /host_mount ezlabgva/busco:v5.3.2_cv1 busco \
  --augustus \
  --augustus_species saccharomyces_cerevisiae_S288C \
  --mode genome \
  --lineage_dataset saccharomycetes_odb10 \
  -o assembly_QC/Flye_busco_out \
  -i 100x_Flye_assembly/assembly.fasta \
  -c ${thr}

# also run on ref genome
sudo docker run -u $(id -u) -v $(pwd):/host_mount -w /host_mount ezlabgva/busco:v5.3.2_cv1 busco \
  --augustus \
  --augustus_species saccharomyces_cerevisiae_S288C \
  --mode genome \
  --lineage_dataset saccharomycetes_odb10 \
  -o assembly_QC/ref_busco_out \
  -i assembly_QC/s288c.fa \
  -c ${thr}

# copy for multi-plot
mkdir -p assembly_QC/busco_plot

cp Flye_busco_out/short_summary.specific.saccharomycetes_odb10.Flye_busco_out.txt \ 
  busco_plot/short_summary.specific.saccharomycetes_odb10.Flye_denovo.txt

cp ref_busco_out/short_summary.specific.saccharomycetes_odb10.ref_busco_out.txt \ 
  busco_plot/short_summary.specific.saccharomycetes_odb10.s288c.txt

# plot
sudo docker run -u $(id -u) -v $PWD:/busco_wd ezlabgva/busco:v5.3.2_cv1 /usr/local/bin/generate_plot.py -wd assembly_QC/busco_plot

BUSCO results for the Flye assembly genome

--------------------------------------------------
|Results from dataset saccharomycetes_odb10       |
--------------------------------------------------
|C:99.4%[S:93.4%,D:6.0%],F:0.0%,M:0.6%,n:2137     |
|2125   Complete BUSCOs (C)                       |
|1996   Complete and single-copy BUSCOs (S)       |
|129    Complete and duplicated BUSCOs (D)        |
|0      Fragmented BUSCOs (F)                     |
|12     Missing BUSCOs (M)                        |
|2137   Total BUSCO groups searched               |
--------------------------------------------------

BUSCO results for the s288c reference genome

--------------------------------------------------
|Results from dataset saccharomycetes_odb10       |
--------------------------------------------------
|C:99.5%[S:97.3%,D:2.2%],F:0.0%,M:0.5%,n:2137     |
|2126   Complete BUSCOs (C)                       |
|2080   Complete and single-copy BUSCOs (S)       |
|46     Complete and duplicated BUSCOs (D)        |
|0      Fragmented BUSCOs (F)                     |
|11     Missing BUSCOs (M)                        |
|2137   Total BUSCO groups searched               |
--------------------------------------------------

[back-to-top]

Assembly polishing with Pilon

The contigs obtained with Flye were further polished with Pilon (version 1.24) Walker et al, 2014.

Note: Although Flye already performs polishing, this extra step introduced a few sequence edits and was performed to comply with state of the art workflows and take advantage from the full sequencing data.

\footnotesize

# pilon installed with conda 
# as well as bwa 0.7.17-r1188, sambamba 0.8.2, and samtools 1.15
# install samtools with 'conda install -c conda-forge -c bioconda samtools' to get the right openssl installed

base=$PWD
smpl=SRR18210286
thr=8
thr2=4

outfolder=100x_Flye_assembly_pilon
mkdir -p ${outfolder} && cd ${outfolder}

cp ../100x_Flye_assembly/assembly.fasta flye-asm_${smpl}.fa

bwa index -p flye-asm_${smpl} flye-asm_${smpl}.fa
bwa mem -t ${thr} flye-asm_${smpl} ../reads/SRR18210286_HiFi_gt10k_100x.fq.gz \
  | samtools sort -@ ${thr2} -O bam -o align_flye.asm.${smpl}.bam && \
  samtools index -@ ${thr2} align_flye.asm.${smpl}.bam

# mark duplicates
sambamba markdup -t ${thr} align_flye.asm.${smpl}.bam align_flye.asm.${smpl}_markdup.bam

# filter high quality mappings for correction
samtools view -@ ${thr2} -b -q 30 align_flye.asm.${smpl}_markdup.bam > align_filter.bam && \
  samtools index -@ ${thr2} align_filter.bam

java -Xmx20G -jar $CONDA_PREFIX/share/pilon-1.24-0/pilon.jar \
  --genome flye-asm_${smpl}.fa \
  --frags align_filter.bam \
  --fix snps,indels \
  --output flye-polished_${smpl} \
  --vcf

cd ${base}

[back-to-top]

Assembly reference scaffolding using ragtag

Since denovo assembly contigs bear arbitrary names and often correspond to chromosome fragments (arms), it was decided to scaffold the Flye assemblies using the public Saccharomyces_cerevisiae.R64-1-1 reference as guide and the ragtag tool (version 2.1.0) Alonge et al, 2021.

The obtained scaffolded assemblies bear the official chromosome names where scaffolding succeeded and retain the Flye contig names for non-scaffolded contigs.

# ragtag installed with conda 
base=$PWD
smpl=SRR18210286

thr=8
# since busco seem to take two threads per sample
thr2=4

######################
# RagTag scaffolding
######################

outfolder=100x_Flye_assembly_ragtag
mkdir -p ${outfolder} && cd ${outfolder}

cp ../100x_Flye_assembly_pilon/flye-polished_${smpl}.fasta flye-polished_${smpl}.fa

asm=flye-polished_${smpl}.fa
reads=../reads/SRR18210286_HiFi_gt10k_100x.fq.gz
ref=../reference/Sc.R64-1-1.fa
minimappath=$(which minimap2)

# correct
echo "# run ragtag correction for ${smpl}"

ragtag.py correct \
  ${ref} \
  ${asm} \
  -t ${thr} \
  --aligner ${minimappath} \
  --inter \
  -R ${reads} \
  -T corr \
  -w \
  -u \
  -o ragtag_${smpl}

# scaffold
echo "# run ragtag scaffolding for ${smpl}"

ragtag.py scaffold \
  ${ref} \
  ragtag_${smpl}/ragtag.correct.fasta \
  -t ${thr} \
  --aligner ${minimappath} \
  -w \
  -u \
  -o ragtag_${smpl}

# copy final results back
cp ragtag_${smpl}/ragtag.scaffold.fasta ragtag_scaffolds_${smpl}.fa

# rename and replace long fasta headers in ragtag scaffold output
sed -i 's/contig_\(.*\)_pilon_\(.*_.*\)_+/\1_\2/g' ragtag_scaffolds_${smpl}.fa
sed -i 's/_RagTag//g' ragtag_scaffolds_${smpl}.fa
sed -i 's/>contig_/>/g' ragtag_scaffolds_${smpl}.fa

[back-to-top]

Compare the corrected and scaffolded assembly to the reference

The resulting 'final assembly' is compared to the reference as done with the primary assembly to detect any introduced biases (eg translocations and large misassemblies)

# mummer4, quast, installed with conda
# busco run from docker image
# mummer_biplot.sh from our git: https://github.com/Nucleomics-VIB/ngs-tools/blob/master/mummer_biplot.sh

## run in the ragtag outfolder

smpl=SRR18210286

thr=8
# since busco seem to take two threads per sample
thr2=4

#################
# mummer biplot
#################

conda deactivate
conda activate mummer4

echo "# create assembly biplot for ${smpl}"

mummer_biplot.sh \
  -x ${ref} \
  -y ragtag_scaffolds_${smpl}.fa \
  -o ragtag_scaffolds_${smpl}-vs-R64 \
  -t nucmer \
  -T ${thr} \
  -f png && mv *_mummer3-log_* ragtag_scaffolds_${smpl}-vs-R64/

conda deactivate

#########
# QUAST
#########

conda activate quast

ref=../reference/Sc.R64-1-1.fa
gff3=../reference/Sc.R64-1-1.107.gff3

quast -o quast_${smpl} \
  -r ${ref} \
  -g ${gff3} \
  -t ${thr} \
  --min-identity=85 \
  -l "s288c, Flye-primary, Flye-final" \
  ${ref} \
  ../100x_Flye_assembly/assembly.fasta \
  ragtag_scaffolds_${smpl}.fa

conda deactivate

#########
# BUSCO
#########

# from https://busco.ezlab.org/busco_userguide.html#manage-run-parameters-in-config-files
mkdir -p ../BUSCO_summaries_ragtag
mkdir -p busco_${smpl} && cd busco_${smpl}
cp ../ragtag_scaffolds_${smpl}.fa ragtag_asm.fna

sudo docker run -u $(id -u) -v $(pwd):/host_mount -w /host_mount ezlabgva/busco:v5.3.2_cv1 busco \
  --augustus \
  --augustus_species saccharomyces_cerevisiae_S288C \
  --mode genome \
  --lineage_dataset saccharomycetes_odb10 \
  -o busco_out \
  -i ragtag_asm.fna \
  -c ${thr2}

# copy for multi-plot
cp busco_out/short_summary.specific.saccharomycetes_odb10.busco_out.txt \
  ../../BUSCO_summaries_ragtag/short_summary.specific.saccharomycetes_odb10.Flye_final.txt

cp ../../assembly_QC/busco_plot/short_summary.specific.saccharomycetes_* \
  ../../BUSCO_summaries_ragtag/

# plot
cd ../../BUSCO_summaries_ragtag
sudo docker run -u $(id -u) -v $PWD:/busco_wd ezlabgva/busco:v5.3.2_cv1 /usr/local/bin/generate_plot.py -wd .

pairwise alignment of the final assembly with the reference

Quast results for the primary and final assembly compared to the reference

Busco results for the final assembly

--------------------------------------------------
|Results from dataset saccharomycetes_odb10       |
--------------------------------------------------
|C:99.5%[S:93.5%,D:6.0%],F:0.0%,M:0.5%,n:2137     |
|2127   Complete BUSCOs (C)                       |
|1998   Complete and single-copy BUSCOs (S)       |
|129    Complete and duplicated BUSCOs (D)        |
|0      Fragmented BUSCOs (F)                     |
|10     Missing BUSCOs (M)                        |
|2137   Total BUSCO groups searched               |
--------------------------------------------------

These results are added to the plot with the reference and the primary assembly

[back-to-top]

Add gene annotations to the final assembly using Funannotate

In order to easily locate and access yeast genes, the scaffolded genomes were fed to Funannotate (version 1.8.10) Li et al, 2021 which denovo locates and annotates yeast genes using a variety of embedded tools.

Note: One tool was not included in the Funannotate pipeline, namely GeneMark, which license does not allow usage for profit. All other accessible tools (incl. augustus, busco, interproscan and eggnog) were used in the annotation process. More info about Funannotate and included analyses can be found at (https://funannotate.readthedocs.io/en/latest/.

Note: since there is no RNA-seq data for this project, Funannotate annotations based on transcriptome sequencing are also absent.

Notes:

  • Funannotate can be installed as detailed on the nextgenusfs github page. We ran Funannotate from a docker image obtained with the command : docker pull nextgenusfs/funannotate:latest
  • Eggnog is also required for predictions and is installed using bioconda in a separate environment (emapper-2.1.7 / Expected eggNOG DB version: 5.0.2 / Installed eggNOG DB version: 5.0.2 / Diamond version found: diamond version 2.0.15 / MMseqs2 version found: 13.45111).
  • The most current version of Interproscan (5.57-90.0) was installed locally from the latest git release. Interproscan can also installed using conda in a separate environment (but only until version 5.55-88.0).
  • Interproscan does not include Phobius, signalP, and TMHMM because of their license models but these were added here thanks to the non-profit nature of this tutorial.

Funannotate run with the final Flye assembly

# Funannotate installed from git
# eggnog installed with conda
# interproscan installed

base=$PWD

smpl=SRR18210286
thr=8

###############
# funannotate
###############

outfolder=100x_Flye_assembly_funannotate
mkdir -p ${outfolder} && cd ${outfolder}
cp ../100x_Flye_assembly_ragtag/ragtag_scaffolds_${smpl}.fa .

# preprocess flye assembly for funannotate
sudo /opt/biotools/funannotate/funannotate-docker clean -i ragtag_scaffolds_${smpl}.fa --minlen 1000 -o flye.genome.cleaned.fa
sudo /opt/biotools/funannotate/funannotate-docker sort -i flye.genome.cleaned.fa -b scaffold -o flye.genome.cleaned.sorted.fa
sudo /opt/biotools/funannotate/funannotate-docker mask -i flye.genome.cleaned.fa --cpus ${thr} -o FlyeScaffoldedAssembly_${smpl}.fa

# run funannotate
sudo /opt/biotools/funannotate/funannotate-docker predict \
  -i FlyeScaffoldedAssembly_${smpl}.fa \
  -o fun1 \
  --species "Saccharomyces cerevisiae" \
  --strain ${smpl} \
  --organism fungus \
  --name "FUN1_" \
  --ploidy 1 \
  --busco_db "dikarya" \
  --busco_seed_species "saccharomyces_cerevisiae_S288C" \
  --cpus ${thr}

# accessory annotations, added in the same folder

## interproscan (takes some time)
# run from the local install

mkdir -p iprscan_out
interproscan.sh \
  -i fun1/predict_results/Saccharomyces_cerevisiae_${smpl}.proteins.fa \
  -f gff3,xml,tsv \
  -dp \
  -cpu ${thr} \
  -d iprscan_out

## list of databases used in the annotation
#  [AntiFam-7.0,CDD-3.18,Coils-2.2.1,Gene3D-4.3.0,Hamap-2021_04,MobiDBLite-2.0,PANTHER-15.0,
#   Pfam-35.0,Phobius-1.01,PIRSF-3.10,PIRSR-2021_05,PRINTS-42.0,ProSitePatterns-2022_01,
#   ProSiteProfiles-2022_01,SFLD-4,SignalP_EUK-4.1,SignalP_GRAM_NEGATIVE-4.1,
#   SignalP_GRAM_POSITIVE-4.1,SMART-7.1,SUPERFAMILY-1.75,TIGRFAM-15.0,TMHMM-2.0c]

## eggnog-mapper
# activate corresponding conda env

mkdir -p eggnog_out
emapper.py \
  -i fun1/predict_results/Saccharomyces_cerevisiae_${smpl}.proteins.fa \
  --itype proteins \
  --output_dir eggnog_out \
  -o eggnog \
  --temp_dir . \
  --scratch_dir . \
  --cpu ${thr} \
  --excel

# final annotation
sudo /opt/biotools/funannotate/funannotate-docker annotate \
  -i fun1 \
  --busco_db "dikarya" \
  --iprscan iprscan_out/Saccharomyces_cerevisiae_${smpl}.proteins.fa.xml \
  --eggnog eggnog_out/eggnog.emapper.annotations \
  --cpus ${thr}

# duplicate final folder locally
cp -r fun1/annotate_results .

# extract all annotated gene names
gawk 'BEGIN{FS="\t"; OFS="\t"}{if($3=="gene") {split($9,a,";"); if(a[2]==""){print a[1]}else print a[2]}}' \
  annotate_results/Saccharomyces_cerevisiae_${smpl}.gff3 | sort \
  > annotate_results/${smpl}-genes.txt

The Funannotate annotation was done on the pre-processed assembly detailed as:

num scaffolds: 36
assembly size: 12,618,341 bp
masked repeats: 527,761 bp (4.18%)

[top of current section]

final Funannotate results can be found in the annotate_results subfolder

The files include a genebank formatted file reporting all annotations and both genomic and protein sequences.

Gene2Products.must-fix.txt
Gene2Products.need-curating.txt
Gene2Products.new-names-passed.txt
Saccharomyces_cerevisiae_SRR18210286.agp
Saccharomyces_cerevisiae_SRR18210286.annotations.txt
Saccharomyces_cerevisiae_SRR18210286.cds-transcripts.fa
Saccharomyces_cerevisiae_SRR18210286.contigs.fsa
Saccharomyces_cerevisiae_SRR18210286.discrepency.report.txt
Saccharomyces_cerevisiae_SRR18210286.gbk
Saccharomyces_cerevisiae_SRR18210286.gff3
Saccharomyces_cerevisiae_SRR18210286.mrna-transcripts.fa
Saccharomyces_cerevisiae_SRR18210286.proteins.fa
Saccharomyces_cerevisiae_SRR18210286.scaffolds.fa
Saccharomyces_cerevisiae_SRR18210286.sqn
Saccharomyces_cerevisiae_SRR18210286.stats.json
Saccharomyces_cerevisiae_SRR18210286.tbl
SRR18210286-genes.txt

Funannotate run with the reference assembly

We also run Funannotate on the reference genome sequence in order to see if additional genes can be predicted there.

# Funannotate installed from git
# eggnog installed with conda

##########################################################
# also run funannotate on the ref genome for comparison

smpl=s288c
thr=8
mkdir  Sc.R64-1-1_funannotate && cd Sc.R64-1-1_funannotate
cp ../reference/Sc.R64-1-1.fa .

sudo /opt/biotools/funannotate/funannotate-docker clean -i Sc.R64-1-1.fa --minlen 1000 -o Sc.R64-1-1_cleaned.fa
sudo /opt/biotools/funannotate/funannotate-docker sort -i Sc.R64-1-1_cleaned.fa -b scaffold -o Sc.R64-1-1_cleaned_sorted.fa
sudo /opt/biotools/funannotate/funannotate-docker mask -i Sc.R64-1-1_cleaned_sorted.fa --cpus ${thr} -o ScaffoldedAssembly_Sc.R64-1-1.fa

sudo /opt/biotools/funannotate/funannotate-docker predict \
  -i ScaffoldedAssembly_Sc.R64-1-1.fa \
  -o fun2 \
  --species "Saccharomyces cerevisiae" \
  --strain ${smpl} \
  --organism fungus \
  --name "FUN2_" \
  --ploidy 1 \
  --busco_db "dikarya" \
  --busco_seed_species "saccharomyces_cerevisiae_S288C" \
  --cpus ${thr}

# accessory annotations, added in the same folder

## interproscan (takes some time)
# run from the local install

mkdir -p iprscan_out
interproscan.sh \
  -i fun2/predict_results/Saccharomyces_cerevisiae_${smpl}.proteins.fa \
  -f gff3,xml,tsv \
  -dp \
  -cpu ${thr} \
  -d iprscan_out

## list of databases used in the annotation
#  [AntiFam-7.0,CDD-3.18,Coils-2.2.1,Gene3D-4.3.0,Hamap-2021_04,MobiDBLite-2.0,PANTHER-15.0,
#   Pfam-35.0,Phobius-1.01,PIRSF-3.10,PIRSR-2021_05,PRINTS-42.0,ProSitePatterns-2022_01,
#   ProSiteProfiles-2022_01,SFLD-4,SignalP_EUK-4.1,SignalP_GRAM_NEGATIVE-4.1,
#   SignalP_GRAM_POSITIVE-4.1,SMART-7.1,SUPERFAMILY-1.75,TIGRFAM-15.0,TMHMM-2.0c]

## eggnog-mapper
# activate corresponding conda env

mkdir -p eggnog_out
emapper.py \
  -i fun2/predict_results/Saccharomyces_cerevisiae_${smpl}.proteins.fa \
  --itype proteins \
  --output_dir eggnog_out \
  -o eggnog \
  --temp_dir . \
  --scratch_dir . \
  --cpu ${thr} \
  --excel

# final annotation
sudo /opt/biotools/funannotate/funannotate-docker annotate \
  -i fun2 \
  --busco_db "dikarya" \
  --iprscan iprscan_out/Saccharomyces_cerevisiae_${smpl}.proteins.fa.xml \
  --eggnog eggnog_out/eggnog.emapper.annotations \
  --cpus ${thr}

# duplicate final folder locally
cp -r fun2/annotate_results .

# extract all annotated gene names
gawk 'BEGIN{FS="\t"; OFS="\t"}{if($3=="gene") {split($9,a,";"); if(a[2]==""){print a[1]}else print a[2]}}' \
  annotate_results/Saccharomyces_cerevisiae_${smpl}.gff3 | sort \
  > annotate_results/${smpl}-genes.txt

[top of current section]

Compare Funannotate results

mkdir -p funannotate_compare && cd funannotate_compare
cp ../100x_Flye_assembly_funannotate/annotate_results/Saccharomyces_cerevisiae_SRR18210286.gbk .
cp ../Sc.R64-1-1_funannotate/annotate_results/Saccharomyces_cerevisiae_s288c.gbk .

sudo /opt/biotools/funannotate/funannotate-docker compare \
  -i *.gbk \
  -o compare_out \
  --cpus 24
isolate species locus_tag Assembly Size Largest Scaffold Average Scaffold Num Scaffolds Scaffold N50 Percent GC Num Genes Num Proteins Num tRNA Unique Proteins Prots atleast 1 ortholog Single-copy orthologs
Saccharomyces cerevisiae None s288c 12,157,105 bp 1,531,933 bp 715,124 bp 17 924,431 bp 38.15% 5,661 5,373 288 5,348 5,222 5,042
Saccharomyces cerevisiae None Flye_denovo 12,618,341 bp 1,472,925 bp 350,509 bp 36 902,298 bp 38.20% 5,976 5,660 316 5,464 5,385 5,042

A number of pluts are available in teh compare results and not reproduced here. The comparison shows that the denovo assembly congtains more predicted genes, some are duplications due to the assembly process and other may come from contaminants in the DNA sample or from assembly errors. The assembly itself contains about 500kb extra sequences as compared to the reference that could represent the same entities. Contamination can be tested by blast'ing the unscaffolded sequences against genbank (not done here, see tutorial about bacterial assembly with Flye).

[back-to-top]

Conclusion

This concludes this overview of denovo assembly steps of a yeast genome.

The obtained genome is quasi complete with exception of centromeres and several difficult regions, leading to a split of chromosomes in chromosome arms and in different gapped contigs. Finishing such assembly would require orthogonal technologies like Hi-C or optical mapping which are quite costly but often lead to chromosome size scaffolds.

This workflow can be easily adapted to support multiple samples as obtained from a typical Sequel-IIE run, which, if successful, can accommodate 8-12 barcoded yeast genomes.

[back-to-top]

Command help

You will find below the help page for the main commands used in this report. Additional commands will allow defining some of the arguments in case you do not recycle this tutorial for a yeast but instead for another organism. As always, reading the full online documentation is highly recommended.

command help for Flye
usage: flye (--pacbio-raw | --pacbio-corr | --pacbio-hifi | --nano-raw |
             --nano-corr | --nano-hq ) file1 [file_2 ...]
             --out-dir PATH

             [--genome-size SIZE] [--threads int] [--iterations int]
             [--meta] [--polish-target] [--min-overlap SIZE]
             [--keep-haplotypes] [--debug] [--version] [--help] 
             [--scaffold] [--resume] [--resume-from] [--stop-after] 
             [--read-error float] [--extra-params]

Assembly of long reads with repeat graphs

optional arguments:
  -h, --help            show this help message and exit
  --pacbio-raw path [path ...]
                        PacBio regular CLR reads (<20% error)
  --pacbio-corr path [path ...]
                        PacBio reads that were corrected with other methods (<3% error)
  --pacbio-hifi path [path ...]
                        PacBio HiFi reads (<1% error)
  --nano-raw path [path ...]
                        ONT regular reads, pre-Guppy5 (<20% error)
  --nano-corr path [path ...]
                        ONT reads that were corrected with other methods (<3% error)
  --nano-hq path [path ...]
                        ONT high-quality reads: Guppy5+ SUP or Q20 (<5% error)
  --subassemblies path [path ...]
                        [deprecated] high-quality contigs input
  -g size, --genome-size size
                        estimated genome size (for example, 5m or 2.6g)
  -o path, --out-dir path
                        Output directory
  -t int, --threads int
                        number of parallel threads [1]
  -i int, --iterations int
                        number of polishing iterations [1]
  -m int, --min-overlap int
                        minimum overlap between reads [auto]
  --asm-coverage int    reduced coverage for initial disjointig assembly [not set]
  --hifi-error float    [deprecated] same as --read-error
  --read-error float    adjust parameters for given read error rate (as fraction e.g. 0.03)
  --extra-params extra_params
                        extra configuration parameters list (comma-separated)
  --plasmids            unused (retained for backward compatibility)
  --meta                metagenome / uneven coverage mode
  --keep-haplotypes     do not collapse alternative haplotypes
  --no-alt-contigs      do not output contigs representing alternative haplotypes
  --scaffold            enable scaffolding using graph [disabled by default]
  --trestle             [deprecated] enable Trestle [disabled by default]
  --polish-target path  run polisher on the target sequence
  --resume              resume from the last completed stage
  --resume-from stage_name
                        resume from a custom stage
  --stop-after stage_name
                        stop after the specified stage completed
  --debug               enable debug output
  -v, --version         show program's version number and exit

Input reads can be in FASTA or FASTQ format, uncompressed
or compressed with gz. Currently, PacBio (CLR, HiFi, corrected)
and ONT reads (regular, HQ, corrected) are supported. Expected error rates are
<15% for PB CLR/regular ONT; <5% for ONT HQ, <3% for corrected, and <1% for HiFi. Note that Flye
was primarily developed to run on uncorrected reads. You may specify multiple
files with reads (separated by spaces). Mixing different read
types is not yet supported. The --meta option enables the mode
for metagenome/uneven coverage assembly.

To reduce memory consumption for large genome assemblies,
you can use a subset of the longest reads for initial disjointig
assembly by specifying --asm-coverage and --genome-size options. Typically,
40x coverage is enough to produce good disjointigs.

You can run Flye polisher as a standalone tool using
--polish-target option.
command help for BUSCO
usage: busco -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]

Welcome to BUSCO 5.3.2: the Benchmarking Universal Single-Copy Ortholog assessment tool.
For more detailed usage information, please review the README file provided with this distribution and the BUSCO user guide. Visit this page https://gitlab.com/ezlab/busco#how-to-cite-busco to see how to cite BUSCO

optional arguments:
  -i SEQUENCE_FILE, --in SEQUENCE_FILE
                        Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set. Also possible to use a path to a directory containing multiple input files.
  -o OUTPUT, --out OUTPUT
                        Give your analysis run a recognisable short name. Output folders and files will be labelled with this name. The path to the output folder is set with --out_path.
  -m MODE, --mode MODE  Specify which BUSCO analysis mode to run.
                        There are three valid modes:
                        - geno or genome, for genome assemblies (DNA)
                        - tran or transcriptome, for transcriptome assemblies (DNA)
                        - prot or proteins, for annotated gene sets (protein)
  -l LINEAGE, --lineage_dataset LINEAGE
                        Specify the name of the BUSCO lineage to be used.
  --augustus            Use augustus gene predictor for eukaryote runs
  --augustus_parameters --PARAM1=VALUE1,--PARAM2=VALUE2
                        Pass additional arguments to Augustus. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
  --augustus_species AUGUSTUS_SPECIES
                        Specify a species for Augustus training.
  --auto-lineage        Run auto-lineage to find optimum lineage path
  --auto-lineage-euk    Run auto-placement just on eukaryote tree to find optimum lineage path
  --auto-lineage-prok   Run auto-lineage just on non-eukaryote trees to find optimum lineage path
  -c N, --cpu N         Specify the number (N=integer) of threads/cores to use.
  --config CONFIG_FILE  Provide a config file
  --datasets_version DATASETS_VERSION
                        Specify the version of BUSCO datasets, e.g. odb10
  --download [dataset ...]
                        Download dataset. Possible values are a specific dataset name, "all", "prokaryota", "eukaryota", or "virus". If used together with other command line arguments, make sure to place this last.
  --download_base_url DOWNLOAD_BASE_URL
                        Set the url to the remote BUSCO dataset location
  --download_path DOWNLOAD_PATH
                        Specify local filepath for storing BUSCO dataset downloads
  -e N, --evalue N      E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03 (Default: 1e-03)
  -f, --force           Force rewriting of existing files. Must be used when output files with the provided name already exist.
  -h, --help            Show this help message and exit
  --limit N             How many candidate regions (contig or transcript) to consider per BUSCO (default: 3)
  --list-datasets       Print the list of available BUSCO datasets
  --long                Optimization Augustus self-training mode (Default: Off); adds considerably to the run time, but can improve results for some non-model organisms
  --metaeuk_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"
                        Pass additional arguments to Metaeuk for the first run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
  --metaeuk_rerun_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"
                        Pass additional arguments to Metaeuk for the second run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
  --offline             To indicate that BUSCO cannot attempt to download files
  --out_path OUTPUT_PATH
                        Optional location for results folder, excluding results folder name. Default is current working directory.
  -q, --quiet           Disable the info logs, displays only errors
  -r, --restart         Continue a run that had already partially completed.
  --tar                 Compress some subdirectories with many files to save space
  --update-data         Download and replace with last versions all lineages datasets and files necessary to their automated selection
  -v, --version         Show this version and exit
command help for funannotate
Usage:       funannotate <command> <arguments>
version:     1.8.14

Description: Funannotate is a genome prediction, annotation, and comparison pipeline.

Commands:
  clean       Find/remove small repetitive contigs
  sort        Sort by size and rename contig headers
  mask        Repeatmask genome assembly

  train       RNA-seq mediated training of Augustus/GeneMark
  predict     Run gene prediction pipeline
  fix         Fix annotation errors (generate new GenBank file)
  update      RNA-seq/PASA mediated gene model refinement
  remote      Partial functional annotation using remote servers
  iprscan     InterProScan5 search (Docker or local)
  annotate    Assign functional annotation to gene predictions
  compare     Compare funannotated genomes

  util        Format conversion and misc utilities
  setup       Setup/Install databases
  test        Download/Run funannotate installation tests
  check       Check Python, Perl, and External dependencies [--show-versions]
  species     list pre-trained Augustus species
  database    Manage databases
  outgroups   Manage outgroups for funannotate compare

Written by Jon Palmer (2016-2019) [email protected]
command help for funannotate predict
Usage:       funannotate predict <arguments>
version:     1.8.14

Description: Script takes genome multi-fasta file and a variety of inputs to do a comprehensive whole
             genome gene prediction.  Uses AUGUSTUS, GeneMark, Snap, GlimmerHMM, BUSCO, EVidence Modeler,
             tbl2asn, tRNAScan-SE, Exonerate, minimap2.
Required:
  -i, --input              Genome multi-FASTA file (softmasked repeats)
  -o, --out                Output folder name
  -s, --species            Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"

Optional:
  -p, --parameters         Ab intio parameters JSON file to use for gene predictors
  --isolate                Isolate name, e.g. Af293
  --strain                 Strain name, e.g. FGSCA4
  --name                   Locus tag name (assigned by NCBI?). Default: FUN_
  --numbering              Specify where gene numbering starts. Default: 1
  --maker_gff              MAKER2 GFF file. Parse results directly to EVM.
  --pasa_gff               PASA generated gene models. filename:weight
  --other_gff              Annotation pass-through to EVM. filename:weight
  --rna_bam                RNA-seq mapped to genome to train Augustus/GeneMark-ET
  --stringtie              StringTie GTF result
  -w, --weights            Ab-initio predictor and EVM weight. Example: augustus:2 or pasa:10
  --augustus_species       Augustus species config. Default: uses species name
  --min_training_models    Minimum number of models to train Augustus. Default: 200
  --genemark_mode          GeneMark mode. Default: ES [ES,ET]
  --genemark_mod           GeneMark ini mod file
  --busco_seed_species     Augustus pre-trained species to start BUSCO. Default: anidulans
  --optimize_augustus      Run 'optimze_augustus.pl' to refine training (long runtime)
  --busco_db               BUSCO models. Default: dikarya. `funannotate outgroups --show_buscos`
  --organism               Fungal-specific options. Default: fungus. [fungus,other]
  --ploidy                 Ploidy of assembly. Default: 1
  -t, --tbl2asn            Assembly parameters for tbl2asn. Default: "-l paired-ends"
  -d, --database           Path to funannotate database. Default: $FUNANNOTATE_DB

  --protein_evidence       Proteins to map to genome (prot1.fa prot2.fa uniprot.fa). Default: uniprot.fa
  --protein_alignments     Pre-computed protein alignments in GFF3 format
  --p2g_pident             Exonerate percent identity. Default: 80
  --p2g_diamond_db         Premade diamond genome database for protein2genome mapping
  --p2g_prefilter          Pre-filter hits software selection. Default: diamond [tblastn]
  --transcript_evidence    mRNA/ESTs to align to genome (trans1.fa ests.fa trinity.fa). Default: none
  --transcript_alignments  Pre-computed transcript alignments in GFF3 format
  --augustus_gff           Pre-computed AUGUSTUS GFF3 results (must use --stopCodonExcludedFromCDS=False)
  --genemark_gtf           Pre-computed GeneMark GTF results
  --trnascan               Pre-computed tRNAscanSE results

  --min_intronlen          Minimum intron length. Default: 10
  --max_intronlen          Maximum intron length. Default: 3000
  --soft_mask              Softmasked length threshold for GeneMark. Default: 2000
  --min_protlen            Minimum protein length. Default: 50
  --repeats2evm            Use repeats in EVM consensus model building
  --keep_evm               Keep existing EVM results (for rerunning pipeline)
  --evm-partition-interval Min length between genes to make a partition: Default: 1500
  --no-evm-partitions      Do not split contigs into partitions
  --repeat_filter          Repetitive gene model filtering. Default: overlap blast [overlap,blast,none]
  --keep_no_stops          Keep gene models without valid stops
  --SeqCenter              Sequencing facilty for NCBI tbl file. Default: CFMR
  --SeqAccession           Sequence accession number for NCBI tbl file. Default: 12345
  --force                  Annotated unmasked genome
  --cpus                   Number of CPUs to use. Default: 2
  --no-progress            Do not print progress to stdout for long sub jobs
  --tmpdir                 Volume/location to write temporary files. Default: /tmp
  --header_length          Maximum length of FASTA headers. Default: 16

ENV Vars:  If not specified at runtime, will be loaded from your $PATH
  --EVM_HOME
  --AUGUSTUS_CONFIG_PATH
  --GENEMARK_PATH
  --BAMTOOLS_PATH
command help for funannotate annotate
Usage:       funannotate annotate <arguments>
version:     1.8.14

Description: Script functionally annotates the results from funannotate predict.  It pulls
             annotation from PFAM, InterPro, EggNog, UniProtKB, MEROPS, CAZyme, and GO ontology.

Required:
  -i, --input          Folder from funannotate predict
    or
  --genbank            Genome in GenBank format
  -o, --out            Output folder for results
    or
  --gff                Genome GFF3 annotation file
  --fasta              Genome in multi-fasta format
  -s, --species        Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"
  -o, --out            Output folder for results

Optional:
  --sbt                NCBI submission template file. (Recommended)
  -a, --annotations    Custom annotations (3 column tsv file)
  -m, --mito-pass-thru Mitochondrial genome/contigs. append with :mcode
  --eggnog             Eggnog-mapper annotations file (if NOT installed)
  --antismash          antiSMASH secondary metabolism results (GBK file from output)
  --iprscan            InterProScan5 XML file
  --phobius            Phobius pre-computed results (if phobius NOT installed)
  --signalp            SignalP pre-computed results (-org euk -format short)
  --isolate            Isolate name
  --strain             Strain name
  --rename             Rename GFF gene models with locus_tag from NCBI.
  --fix                Gene/Product names fixed (TSV: GeneID    Name    Product)
  --remove             Gene/Product names to remove (TSV: Gene  Product)
  --busco_db           BUSCO models. Default: dikarya
  -t, --tbl2asn        Additional parameters for tbl2asn. Default: "-l paired-ends"
  -d, --database       Path to funannotate database. Default: $FUNANNOTATE_DB
  --force              Force over-write of output folder
  --cpus               Number of CPUs to use. Default: 2
  --tmpdir             Volume/location to write temporary files. Default: /tmp
  --p2g                protein2genome pre-computed results
  --header_length      Maximum length of FASTA headers. Default: 16
  --no-progress        Do not print progress to stdout for long sub jobs
command help for interproscan.sh
06/10/2022 17:33:47:284 Welcome to InterProScan-5.57-90.0
06/10/2022 17:33:47:285 Running InterProScan v5 in STANDALONE mode... on Linux
usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xms128M
            -Xmx2048M -jar interproscan-5.jar


Please give us your feedback by sending an email to

[email protected]

 -appl,--applications <ANALYSES>                           Optional, comma separated list of analyses.  If this option
                                                           is not set, ALL analyses will be run.
 -b,--output-file-base <OUTPUT-FILE-BASE>                  Optional, base output filename (relative or absolute path).
                                                           Note that this option, the --output-dir (-d) option and the
                                                           --outfile (-o) option are mutually exclusive.  The
                                                           appropriate file extension for the output format(s) will be
                                                           appended automatically. By default the input file path/name
                                                           will be used.
 -cpu,--cpu <CPU>                                          Optional, number of cores for inteproscan.
 -d,--output-dir <OUTPUT-DIR>                              Optional, output directory.  Note that this option, the
                                                           --outfile (-o) option and the --output-file-base (-b) option
                                                           are mutually exclusive. The output filename(s) are the same
                                                           as the input filename, with the appropriate file extension(s)
                                                           for the output format(s) appended automatically .
 -dp,--disable-precalc                                     Optional.  Disables use of the precalculated match lookup
                                                           service.  All match calculations will be run locally.
 -dra,--disable-residue-annot                              Optional, excludes sites from the XML, JSON output
 -etra,--enable-tsv-residue-annot                          Optional, includes sites in TSV output
 -exclappl,--excl-applications <EXC-ANALYSES>              Optional, comma separated list of analyses you want to
                                                           exclude.
 -f,--formats <OUTPUT-FORMATS>                             Optional, case-insensitive, comma separated list of output
                                                           formats. Supported formats are TSV, XML, JSON, and GFF3.
                                                           Default for protein sequences are TSV, XML and GFF3, or for
                                                           nucleotide sequences GFF3 and XML.
 -goterms,--goterms                                        Optional, switch on lookup of corresponding Gene Ontology
                                                           annotation (IMPLIES -iprlookup option)
 -help,--help                                              Optional, display help information
 -i,--input <INPUT-FILE-PATH>                              Optional, path to fasta file that should be loaded on Master
                                                           startup. Alternatively, in CONVERT mode, the InterProScan 5
                                                           XML file to convert.
 -incldepappl,--incl-dep-applications <INC-DEP-ANALYSES>   Optional, comma separated list of deprecated analyses that
                                                           you want included.  If this option is not set, deprecated
                                                           analyses will not run.
 -iprlookup,--iprlookup                                    Also include lookup of corresponding InterPro annotation in
                                                           the TSV and GFF3 output formats.
 -ms,--minsize <MINIMUM-SIZE>                              Optional, minimum nucleotide size of ORF to report. Will only
                                                           be considered if n is specified as a sequence type. Please be
                                                           aware of the fact that if you specify a too short value it
                                                           might be that the analysis takes a very long time!
 -o,--outfile <EXPLICIT_OUTPUT_FILENAME>                   Optional explicit output file name (relative or absolute
                                                           path).  Note that this option, the --output-dir (-d) option
                                                           and the --output-file-base (-b) option are mutually
                                                           exclusive. If this option is given, you MUST specify a single
                                                           output format using the -f option.  The output file name will
                                                           not be modified. Note that specifying an output file name
                                                           using this option OVERWRITES ANY EXISTING FILE.
 -pa,--pathways                                            Optional, switch on lookup of corresponding Pathway
                                                           annotation (IMPLIES -iprlookup option)
 -t,--seqtype <SEQUENCE-TYPE>                              Optional, the type of the input sequences (dna/rna (n) or
                                                           protein (p)).  The default sequence type is protein.
 -T,--tempdir <TEMP-DIR>                                   Optional, specify temporary file directory (relative or
                                                           absolute path). The default location is temp/.
 -verbose,--verbose                                        Optional, display more verbose log output
 -version,--version                                        Optional, display version number
 -vl,--verbose-level <VERBOSE-LEVEL>                       Optional, display verbose log output at level specified.
 -vtsv,--output-tsv-version                                Optional, includes a TSV version file along with any TSV
                                                           output (when TSV output requested)
Copyright © EMBL European Bioinformatics Institute, Hinxton, Cambridge, UK. (http://www.ebi.ac.uk) The InterProScan
software itself is provided under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html).
Third party components (e.g. member database binaries and models) are subject to separate licensing - please see the
individual member database websites for details.

Available analyses:
                      TIGRFAM (15.0) : TIGRFAMs are protein families based on hidden Markov models (HMMs).
                         SFLD (4) : SFLD is a database of protein families based on hidden Markov models (HMMs).
                      Phobius (1.01) : A combined transmembrane topology and signal peptide predictor.
        SignalP_GRAM_NEGATIVE (4.1) : SignalP (gram-negative) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for gram-negative prokaryotes.
                  SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotations for all proteins and genomes.
                      PANTHER (15.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.
                       Gene3D (4.3.0) : Structural assignment for whole genes and genomes using the CATH domain structure database.
                        Hamap (2021_04) : High-quality Automated and Manual Annotation of Microbial Proteomes.
              ProSiteProfiles (2022_01) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
                        Coils (2.2.1) : Prediction of coiled coil regions in proteins.
                        SMART (7.1) : SMART allows the identification and analysis of domain architectures based on hidden Markov models (HMMs). 
                          CDD (3.18) : CDD predicts protein domains and families based on a collection of well-annotated multiple sequence alignment models.
                       PRINTS (42.0) : A compendium of protein fingerprints - a fingerprint is a group of conserved motifs used to characterise a protein family.
                        PIRSR (2021_05) : PIRSR is a database of protein families based on hidden Markov models (HMMs) and Site Rules.
              ProSitePatterns (2022_01) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
                      AntiFam (7.0) : AntiFam is a resource of profile-HMMs designed to identify spurious protein predictions.
                  SignalP_EUK (4.1) : SignalP (eukaryotes) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for eukaryotes.
                         Pfam (35.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
                   MobiDBLite (2.0) : Prediction of intrinsically disordered regions in proteins.
        SignalP_GRAM_POSITIVE (4.1) : SignalP (gram-positive) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for gram-positive prokaryotes.
                        PIRSF (3.10) : The PIRSF concept is used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.
                        TMHMM (2.0c) : Prediction of transmembrane helices in proteins.
command help for eggnog
usage: emapper.py [-h] [-v] [--list_taxa] [--cpu NUM_CPU] [--mp_start_method {fork,spawn,forkserver}] [--resume] [--override] [-i FASTA_FILE] [--itype {CDS,proteins,genome,metagenome}] [--translate]
                  [--annotate_hits_table SEED_ORTHOLOGS_FILE] [-c FILE] [--data_dir DIR] [--genepred {search,prodigal}] [--trans_table TRANS_TABLE_CODE] [--training_genome FILE]
                  [--training_file FILE] [--allow_overlaps {none,strand,diff_frame,all}] [--overlap_tol FLOAT] [-m {diamond,mmseqs,hmmer,no_search,cache}] [--pident PIDENT]
                  [--query_cover QUERY_COVER] [--subject_cover SUBJECT_COVER] [--evalue EVALUE] [--score SCORE] [--dmnd_algo {auto,0,1,ctg}] [--dmnd_db DMND_DB_FILE]
                  [--sensmode {default,fast,mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive}] [--dmnd_iterate {yes,no}]
                  [--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}] [--dmnd_frameshift DMND_FRAMESHIFT] [--gapopen GAPOPEN] [--gapextend GAPEXTEND]
                  [--block_size BLOCK_SIZE] [--index_chunks CHUNKS] [--outfmt_short] [--dmnd_ignore_warnings] [--mmseqs_db MMSEQS_DB_FILE] [--start_sens START_SENS] [--sens_steps SENS_STEPS]
                  [--final_sens FINAL_SENS] [--mmseqs_sub_mat SUBS_MATRIX] [-d HMMER_DB_PREFIX] [--servers_list FILE] [--qtype {hmm,seq}] [--dbtype {hmmdb,seqdb}] [--usemem] [-p PORT]
                  [--end_port PORT] [--num_servers NUM_SERVERS] [--num_workers NUM_WORKERS] [--hmm_maxhits MAXHITS] [--report_no_hits] [--hmm_maxseqlen MAXSEQLEN] [--Z DB_SIZE] [--cut_ga]
                  [--clean_overlaps none|all|clans|hmmsearch_all|hmmsearch_clans] [--no_annot] [--dbmem] [--seed_ortholog_evalue MIN_E-VALUE] [--seed_ortholog_score MIN_SCORE] [--tax_scope TAX_SCOPE]
                  [--tax_scope_mode TAX_SCOPE_MODE] [--target_orthologs {one2one,many2one,one2many,many2many,all}] [--target_taxa LIST_OF_TAX_IDS] [--excluded_taxa LIST_OF_TAX_IDS]
                  [--report_orthologs] [--go_evidence {experimental,non-electronic,all}] [--pfam_realign {none,realign,denovo}] [--md5] [--output FILE_PREFIX] [--output_dir DIR] [--scratch_dir DIR]
                  [--temp_dir DIR] [--no_file_comments] [--decorate_gff DECORATE_GFF] [--decorate_gff_ID_field DECORATE_GFF_ID_FIELD] [--excel]

options:
  -h, --help            show this help message and exit
  -v, --version         show version and exit. (default: False)
  --list_taxa           List taxa available for --tax_scope/--tax_scope_mode, and exit (default: False)

Execution Options:
  --cpu NUM_CPU         Number of CPUs to be used. --cpu 0 to run with all available CPUs. (default: 1)
  --mp_start_method {fork,spawn,forkserver}
                        Sets the python multiprocessing start method. Check https://docs.python.org/3/library/multiprocessing.html. Only use if the default method is not working properly in your OS.
                        (default: spawn)
  --resume              Resumes a previous emapper run, skipping results in existing output files. (default: False)
  --override            Overwrites output files if they exist. By default, execution is aborted if conflicting files are detected. (default: False)

Input Data Options:
  -i FASTA_FILE         Input FASTA file containing query sequences (proteins by default; see --itype and --translate). Required unless -m no_search. (default: None)
  --itype {CDS,proteins,genome,metagenome}
                        Type of data in the input (-i) file. (default: proteins)
  --translate           When --itype CDS, translate CDS to proteins before search. When --itype genome/metagenome and --genepred search, translate predicted CDS from blastx hits to proteins.
                        (default: False)
  --annotate_hits_table SEED_ORTHOLOGS_FILE
                        Annotate TSV formatted table with 4 fields: query, hit, evalue, score. Usually, a .seed_orthologs file from a previous emapper.py run. Requires -m no_search. (default: None)
  -c FILE, --cache FILE
                        File containing annotations and md5 hashes of queries, to be used as cache. Required if -m cache (default: None)
  --data_dir DIR        Path to eggnog-mapper databases. By default, "data/" or the path specified in the environment variable EGGNOG_DATA_DIR. (default: None)

Gene Prediction Options:
  --genepred {search,prodigal}
                        This is applied when --itype genome or --itype metagenome. search: gene prediction is inferred from Diamond/MMseqs2 blastx hits. prodigal: gene prediction is performed using
                        Prodigal. (default: search)
  --trans_table TRANS_TABLE_CODE
                        This option will be used for Prodigal, Diamond or MMseqs2, depending on the mode. For Diamond searches, this option corresponds to the --query-gencode option. For MMseqs2
                        searches, this option corresponds to the --translation-table option. For Prodigal, this option corresponds to the -g/--trans_table option. It is also used when --translate,
                        check https://biopython.org/docs/1.75/api/Bio.Seq.html#Bio.Seq.Seq.translate. Default is the corresponding programs defaults. (default: None)
  --training_genome FILE
                        A genome to run Prodigal in training mode. If this parameter is used, Prodigal will run in two steps: firstly in training mode, and secondly using the training to analize the
                        emapper input data. See Prodigal documentation about Traning mode for more info. Only used if --genepred prodigal. (default: None)
  --training_file FILE  A training file from Prodigal training mode. If this parameter is used, Prodigal will run using this training file to analyze the emapper input data. Only used if --genepred
                        prodigal. (default: None)
  --allow_overlaps {none,strand,diff_frame,all}
                        When using 'blastx'-based genepred (--genepred search --itype genome/metagenome) this option controls whether overlapping hits are reported or not, or if only those
                        overlapping hits in a different strand or frame are reported. Also, the degree of accepted overlap can be controlled with --overlap_tol. (default: none)
  --overlap_tol FLOAT   This value (0-1) is the proportion such that if (overlap size / hit length) > overlap_tol, hits are considered to overlap. e.g: if overlap_tol is 0.0, any overlap is
                        considered as such. e.g: if overlap_tol is 1.0, one of the hits must overlap entirely to consider that hits do overlap. (default: 0.0)

Search Options:
  -m {diamond,mmseqs,hmmer,no_search,cache}
                        diamond: search seed orthologs using diamond (-i is required). mmseqs: search seed orthologs using MMseqs2 (-i is required). hmmer: search seed orthologs using HMMER. (-i is
                        required). no_search: skip seed orthologs search (--annotate_hits_table is required, unless --no_annot). cache: skip seed orthologs search and annotate based on cached results
                        (-i and -c are required). (default: diamond)

Search filtering common options:
  --pident PIDENT       Report only alignments above or equal to the given percentage of identity (0-100).No effect if -m hmmer. (default: None)
  --query_cover QUERY_COVER
                        Report only alignments above or equal the given percentage of query cover (0-100).No effect if -m hmmer. (default: None)
  --subject_cover SUBJECT_COVER
                        Report only alignments above or equal the given percentage of subject cover (0-100).No effect if -m hmmer. (default: None)
  --evalue EVALUE       Report only alignments below or equal the e-value threshold. (default: 0.001)
  --score SCORE         Report only alignments above or equal the score threshold. (default: None)

Diamond Search Options:
  --dmnd_algo {auto,0,1,ctg}
                        Diamond's --algo option, which can be tuned to search small query sets. By default, it is adjusted automatically. However, the ctg option should be activated manually. If you
                        plan to search a small input set of sequences, use --dmnd_algo ctg to make it faster. (default: auto)
  --dmnd_db DMND_DB_FILE
                        Path to DIAMOND-compatible database (default: None)
  --sensmode {default,fast,mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive}
                        Diamond's sensitivity mode. Note that emapper's default is sensitive, which is different from diamond's default, which can be activated with --sensmode default. (default:
                        sensitive)
  --dmnd_iterate {yes,no}
                        --dmnd_iterate yes --> activates the --iterate option of diamond for iterative searches, from faster, less sensitive modes, up to the sensitivity specified with --sensmode.
                        Available since diamond 2.0.11. --dmnd_iterate no --> disables the --iterate mode. (default: yes)
  --matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}
                        Scoring matrix (default: None)
  --dmnd_frameshift DMND_FRAMESHIFT
                        Diamond --frameshift/-F option. Not used by default. Recommended by diamond: 15. (default: None)
  --gapopen GAPOPEN     Gap open penalty (default: None)
  --gapextend GAPEXTEND
                        Gap extend penalty (default: None)
  --block_size BLOCK_SIZE
                        Diamond -b/--block-size option. Default is the diamond's default. (default: None)
  --index_chunks CHUNKS
                        Diamond -c/--index-chunks option. Default is the diamond's default. (default: None)
  --outfmt_short        Diamond output will include only qseqid sseqid evalue and score. This could help obtain better performance, if also no --pident, --query_cover or --subject_cover thresholds
                        are used. This option is ignored when the diamond search is run in blastx mode for gene prediction (see --genepred). (default: False)
  --dmnd_ignore_warnings
                        Diamond --ignore-warnings option. It avoids Diamond stopping due to warnings (e.g. when a protein contains only ATGC symbols. (default: False)

MMseqs2 Search Options:
  --mmseqs_db MMSEQS_DB_FILE
                        Path to MMseqs2-compatible database (default: None)
  --start_sens START_SENS
                        Starting sensitivity. (default: 3)
  --sens_steps SENS_STEPS
                        Number of sensitivity steps. (default: 3)
  --final_sens FINAL_SENS
                        Final sensititivy step. (default: 7)
  --mmseqs_sub_mat SUBS_MATRIX
                        Matrix to be used for --sub-mat MMseqs2 search option. Default=default used by MMseqs2 (default: None)

HMMER Search Options:
  -d HMMER_DB_PREFIX, --database HMMER_DB_PREFIX
                        specify the target database for sequence searches. Choose among: euk,bact,arch, or a database loaded in a server, db.hmm:host:port (see hmm_server.py) (default: None)
  --servers_list FILE   A FILE with a list of remote hmmpgmd servers. Each row in the file represents a server, in the format 'host:port'. If --servers_list is specified, host and port from -d option
                        will be ignored. (default: None)
  --qtype {hmm,seq}     Type of input data (-i). (default: seq)
  --dbtype {hmmdb,seqdb}
                        Type of data in DB (-db). (default: hmmdb)
  --usemem              Use this option to allocate the whole database (-d) in memory using hmmpgmd. If --dbtype hmm, the database must be a hmmpress-ed database. If --dbtype seqdb, the database must
                        be a HMMER-format database created with esl-reformat. Database will be unloaded after execution. Note that this only works for HMMER based searches. To load the eggnog-mapper
                        annotation DB into memory use --dbmem. (default: False)
  -p PORT, --port PORT  Port used to setup HMM server, when --usemem. Also used for --pfam_realign modes. (default: 51700)
  --end_port PORT       Last port to be used to setup HMM server, when --usemem. Also used for --pfam_realign modes. (default: 53200)
  --num_servers NUM_SERVERS
                        When using --usemem, specify the number of servers to fire up.Note that cpus specified with --cpu will be distributed among servers and workers. Also used for --pfam_realign
                        modes. (default: 1)
  --num_workers NUM_WORKERS
                        When using --usemem, specify the number of workers per server (--num_servers) to fire up. By default, cpus specified with --cpu will be distributed among servers and workers.
                        Also used for --pfam_realign modes. (default: 1)
  --hmm_maxhits MAXHITS
                        Max number of hits to report (0 to report all). (default: 1)
  --report_no_hits      Whether queries without hits should be included in the output table. (default: False)
  --hmm_maxseqlen MAXSEQLEN
                        Ignore query sequences larger than `maxseqlen`. (default: 5000)
  --Z DB_SIZE           Fixed database size used in phmmer/hmmscan (allows comparing e-values among databases). (default: 40000000)
  --cut_ga              Adds the --cut_ga to hmmer commands (useful for Pfam mappings, for example). See hmmer documentation. (default: False)
  --clean_overlaps none|all|clans|hmmsearch_all|hmmsearch_clans
                        Removes those hits which overlap, keeping only the one with best evalue. Use the "all" and "clans" options when performing a hmmscan type search (i.e. domains are in the
                        database). Use the "hmmsearch_all" and "hmmsearch_clans" options when using a hmmsearch type search (i.e. domains are the queries from -i file). The "clans" and
                        "hmmsearch_clans" and options will only have effect for hits to/from Pfam. (default: None)

Annotation Options:
  --no_annot            Skip functional annotation, reporting only hits. (default: False)
  --dbmem               Use this option to allocate the whole eggnog.db DB in memory. Database will be unloaded after execution. (default: False)
  --seed_ortholog_evalue MIN_E-VALUE
                        Min E-value expected when searching for seed eggNOG ortholog. Queries not having a significant seed orthologs will not be annotated. (default: 0.001)
  --seed_ortholog_score MIN_SCORE
                        Min bit score expected when searching for seed eggNOG ortholog. Queries not having a significant seed orthologs will not be annotated. (default: None)
  --tax_scope TAX_SCOPE
                        Fix the taxonomic scope used for annotation, so only speciation events from a particular clade are used for functional transfer. More specifically, the --tax_scope list is
                        intersected with the seed orthologs clades, and the resulting clades are used for annotation based on --tax_scope_mode. Note that those seed orthologs without clades
                        intersecting with --tax_scope will be filtered out, and won't annotated. Possible arguments for --tax_scope are: 1) A path to a file defined by the user, which contains a list
                        of tax IDs and/or tax names. 2) The name of a pre-configured tax scope, whose source is a file stored within the 'eggnogmapper/annotation/tax_scopes/' directory By default,
                        available ones are: 'auto' ('all'), 'auto_broad' ('all_broad'), 'all_narrow', 'archaea', 'bacteria', 'bacteria_broad', 'eukaryota', 'eukaryota_broad' and 'prokaryota_broad'.3)
                        A comma-separated list of taxonomic names and/or taxonomic IDs, sorted by preference. An example of list of tax IDs would be 2759,2157,2,1 for Eukaryota, Archaea, Bacteria and
                        root, in that order of preference. 4) 'none': do not filter out annotations based on taxonomic scope. (default: auto)
  --tax_scope_mode TAX_SCOPE_MODE
                        For a seed ortholog which passed the filter imposed by --tax_scope, --tax_scope_mode controls which specific clade, to which the seed ortholog belongs, will be used for
                        annotation. Options: 1) broadest: use the broadest clade. 2) inner_broadest: use the broadest clade from the intersection with --tax_scope. 3) inner_narrowest: use the
                        narrowest clade from the intersection with --tax_scope. 4) narrowest: use the narrowest clade. 5) A taxonomic scope as in --tax_scope: use this second list to intersect with
                        seed ortholog clades and use the narrowest (as in inner_narrowest) from the intersection to annotate. (default: inner_narrowest)
  --target_orthologs {one2one,many2one,one2many,many2many,all}
                        defines what type of orthologs (in relation to the seed ortholog) should be used for functional transfer (default: all)
  --target_taxa LIST_OF_TAX_IDS
                        Only orthologs from the specified comma-separated list of taxa and all its descendants will be used for annotation transference. By default, all taxa are used. (default: None)
  --excluded_taxa LIST_OF_TAX_IDS
                        Orthologs from the specified comma-separated list of taxa and all its descendants will not be used for annotation transference. By default, no taxa is excluded. (default:
                        None)
  --report_orthologs    Output the list of orthologs found for each query to a .orthologs file (default: False)
  --go_evidence {experimental,non-electronic,all}
                        Defines what type of GO terms should be used for annotation. experimental = Use only terms inferred from experimental evidence. non-electronic = Use only non-electronically
                        curated terms (default: non-electronic)
  --pfam_realign {none,realign,denovo}
                        Realign the queries to the PFAM domains. none = no realignment is performed. PFAM annotation will be that transferred as specify in the --pfam_transfer option. realign =
                        queries will be realigned to the PFAM domains found according to the --pfam_transfer option. denovo = queries will be realigned to the whole PFAM database, ignoring the
                        --pfam_transfer option. Check hmmer options (--num_servers, --num_workers, --port, --end_port) to change how the hmmpgmd server is run. (default: none)
  --md5                 Adds the md5 hash of each query as an additional field in annotations output files. (default: False)

Output options:
  --output FILE_PREFIX, -o FILE_PREFIX
                        base name for output files (default: None)
  --output_dir DIR      Where output files should be written (default: /data/biodata/references/galGal6a/Gheyas_variants)
  --scratch_dir DIR     Write output files in a temporary scratch dir, move them to the final output dir when finished. Speed up large computations using network file systems. (default: None)
  --temp_dir DIR        Where temporary files are created. Better if this is a local disk. (default: /data/biodata/references/galGal6a/Gheyas_variants)
  --no_file_comments    No header lines nor stats are included in the output files (default: False)
  --decorate_gff DECORATE_GFF
                        Add search hits and/or annotation results to GFF file from gene prediction of a user specified one. no = no GFF decoration at all. GFF file from blastx-based gene prediction
                        will be created anyway. yes = add search hits and/or annotations to GFF file from Prodigal or blastx-based gene prediction. FILE = decorate the specified pre-existing GFF
                        FILE. e.g. --decorage_gff myfile.gff You change the field interpreted as ID of the feature with --decorate_gff_ID_field. (default: no)
  --decorate_gff_ID_field DECORATE_GFF_ID_FIELD
                        Change the field used in GFF files as ID of the feature. (default: ID)
  --excel               Output annotations also in .xlsx format. (default: False)

[back-to-top]

⚠️ **GitHub.com Fallback** ⚠️