yeast genome de novo assembly from PacBio HiFi reads using Flye - splaisan/analyses GitHub Wiki
(last edit: 2022-10-04)
[Aim]
- [Raw Data]
- [Filtering reads for a 100x genome coverage depth]
- [QC of the resulting reads]
- [De Novo assembly of the reads using Flye]
- [QC of the assembly and comparison to s288c]
- [Assembly polishing with Pilon]
- [Assembly reference scaffolding with Ragtag]
- [Compare the corrected and scaffolded assembly to the reference]
- [Add gene annotations to the final assembly using Funannotate]
This tutorial reports the de-novo assembly of the Saccharomyces cerevisiae S288C genome using only Sequel-IIe HiFi reads.
''Note: The different steps of the process are largely inspired from existing web-resources and publications. We try to acknowledge them where appropriate but if you feel like a reference is missing, please let us know and it will be amended.''
The steps of this analysis are inspired from a recent benchmarking review by Zhang et al of current assemblers and finishing tools. We selected Flye for assembling the HiFi reads as it combines a good performance, good N50 contig size without too much mis-assemblies (as seen with for instance next-denovo) and ease of use.
In order to compare the obtained assembly to the Saccharomyces cerevisiae s288c reference genome, both sequence and annotations were obtained for the subject from the EnsEMBL site.
(base URL: http://ftp.ensembl.org/pub/)
- genome: current_fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
- proteins: current_fasta/saccharomyces_cerevisiae/pep/Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz
- annotations: current_gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.107.gff3.gz
- known variants: current_variation/vcf/saccharomyces_cerevisiae/saccharomyces_cerevisiae.vcf.gz
mkdir -p reference
baseurl="http://ftp.ensembl.org/pub/"
wget -P reference ${baseurl}/current_fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
wget -P reference ${baseurl}/current_fasta/saccharomyces_cerevisiae/pep/Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz
wget -P reference ${baseurl}/current_gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.107.gff3.gz
wget -P reference ${baseurl}/current_variation/vcf/saccharomyces_cerevisiae/saccharomyces_cerevisiae.vcf.gz
# create decompressed versions with easier names
gunzip -c reference/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz \
> reference/Sc.R64-1-1.fa
gunzip -c reference/Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz \
> reference/Sc.R64-1-1_proteins.fa
gunzip -c reference/Saccharomyces_cerevisiae.R64-1-1.107.gff3.gz \
> reference/Sc.R64-1-1.107.gff3
Alt using the NCBI yeast reference
The NCBI data for yeast can be preferred and is found the following location.
(base URL: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/)
- genome: GCF_000146045.2_R64_genomic.fna.gz
- proteins: GCF_000146045.2_R64_protein.faa.gz
- annotations: GCF_000146045.2_R64_genomic.gff.gz
mkdir -p reference
baseurl="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/"
wget -P reference ${baseurl}/GCF_000146045.2_R64_genomic.fna.gz
wget -P reference ${baseurl}/GCF_000146045.2_R64_protein.faa.gz
wget -P reference ${baseurl}/GCF_000146045.2_R64_genomic.gff.gz
These files will be used at later stages for comparison.
The workflow is detailed in the next sections and commands are added at each step, considering that the user is working in a writable project folder, has sufficient rights to create folders and save files, and has installed the necessary programs (not detailed here). For a number of resources, conda install was done and should be reproducible on the user computer provided miniconda has been installed and configured correctly.
We provide a conda config file yeast_asm_environment.yml which should allow you to reproduce the environment used in this report. Since this env has been built sequentially, it is possible that some packages refuse to install, in which case you can create a single env for refractory tools.
The SRR18210286 reads used in this tutorial was obtained from the SRA project PRJNA792930.

--
The SRA data was downloaded and converted to Fastq using the NCBI SRA tools (version 2.8.0).
# sra_tools installed with conda
mkdir -p sra rawreads
sraid=SRR18210286
thr=8
prefetch -O sra ${sraid}
# convert to fastq with sra-tools version 2.x
fastq-dump -O rawreads -origfmt -I --split-files --gzip ${sraid}
# NOTE with the more recent 3.0 version of the tools
# fasterq-dump replaced fastq-dump and is much faster indeed
fasterq-dump -e ${thr} -O rawreads -S sra/${sraid}/${sraid}.sra && \
bgzip -@ ${thr} rawreads/${sraid}.fastq
# resulting in a file: rawreads/SRR18210286.fastq.gz
A subset of 100x coverage was extracted from the big raw reads file (380x coverage) with filterlong (version 0.2.1) in order to have sufficient (yet reasonable) data for assembly. During filtering, shorter than 10kb reads were excluded and only the 95% best average quality reads were retained.
# installed with conda
# >10kb
# 100x depth for S288c (12Mb)
# remove worse 5% by quality
mkdir -p reads
filtlong \
--min_length 10000 \
--target_bases 1200000000 \
--keep_percent 95 \
rawreads/SRR18210286.fastq.gz | \
bgzip -c > reads/SRR18210286_HiFi_gt10k_100x.fq.gz
# filterlong analyzed 227,298 reads (4,583,595,361 bp) in the input file and kept 50'892 reads (1,200,003,122 bp)
The reads were analyzed using FastQC (version 0.11.9)
mkdir -p reads/fastqc_results
fastqc -o reads/fastqc_results reads/SRR18210286_HiFi_gt10k_100x.fq.gz

The size distribution of the reads shows a N50 / N90 of 23582 bp / 19106 bp, corresponding to a very good quality HiFi library.
The filtered HiFi read subsets were assembled using Flye (version 2.9.1-b1780), resulting in a multifasta files containing contigs for the host genome and for potential extra-chromosomal plasmids.
The initial step is using the longest 40x reads for contig assembly and all reads are then used for later processes.
# Flye installed using conda
mkdir -p 100x_Flye_assembly
reads="reads/SRR18210286_HiFi_gt10k_100x.fq.gz"
thr=24
flye --pacbio-hifi \
${reads} \
--out-dir 100x_Flye_assembly \
--genome-size 12m \
--threads ${thr} \
--asm-coverage 40
Flye reported the following metrics:
- Total length: 12881856
- Fragments: 60
- Fragments N50: 724534
- Largest frg: 1472942
- Scaffolds: 0
- Mean coverage: 90
The Flye assembly produced several important files among which:
- a summary of the assembly: assembly_info.txt (reproduced next)
- the assembly in Fasta format: assembly.fasta
- an assembly graph: assembly_graph.gfa
only contigs larger than 100kb are listed here
| seq_name | length | cov. | circ. | repeat | mult. | alt_group | graph_path |
|---|---|---|---|---|---|---|---|
| contig_30 | 1472942 | 96 | N | N | 1 | * | ,30, |
| contig_33 | 1060609 | 98 | N | N | 1 | * | ,33, |
| contig_100 | 902343 | 97 | N | N | 1 | * | ,100, |
| contig_11 | 866978 | 99 | N | N | 1 | * | *,11 |
| contig_99 | 834891 | 99 | N | N | 1 | * | ,-6,99, |
| contig_105 | 780267 | 97 | N | N | 1 | * | ,105, |
| contig_7 | 724534 | 95 | N | N | 1 | * | *,-63,7 |
| contig_103 | 675125 | 100 | N | N | 1 | * | ,103, |
| contig_106 | 624754 | 96 | N | N | 1 | * | 106,* |
| contig_98 | 590312 | 97 | N | N | 1 | * | ,98, |
| contig_102 | 564595 | 96 | N | N | 1 | * | 102,* |
| contig_2 | 509594 | 93 | N | N | 1 | * | 2,6,* |
| contig_91 | 477797 | 96 | N | N | 1 | * | *,91,-20,-101,-20,-101,-20,-101,-20,-101 |
| contig_42 | 432189 | 98 | N | N | 1 | * | ,42, |
| contig_104 | 369953 | 95 | N | N | 1 | * | *,-63,104 |
| contig_8 | 287696 | 97 | N | N | 1 | * | ,8, |
| contig_72 | 199833 | 87 | N | N | 1 | * | 72,* |
| contig_23 | 133434 | 106 | N | N | 1 | * | *,23 |
| contig_69 | 108934 | 33 | N | N | 1 | * | *,67,69 |
| contig_68 | 108928 | 28 | N | N | 1 | * | *,67,68 |
The Flye assembly graphs can be converted to a plot with Bandage (version 0.8.1) which illustrate the linear or circular character of the different contigs and their relationships.
# bandage installed with conda
gfa="100x_Flye_assembly/assembly_graph.gfa"
Bandage image \
${gfa} \
100x_Flye_assembly/s288c_bandage_plot.png \
--names \
--depth \
--lengths \
--fontsize 6 \
--edgelen 50 \
--toutline 2 \
--minnodlen 50 \
--colour blastsolid

Quast analysis of the assembly
Quast (version 5.2.0) was used to evaluate the assembly and compare it to the reference genome
# quast installed with conda
mkdir -p assembly_QC
cp reference/Sc.R64-1-1.fa assembly_QC/s288c.fa
cp reference/Sc.R64-1-1.107.gff3 assembly_QC/s288c.gff
cp 100x_Flye_assembly/assembly.fasta assembly_QC/Flye_denovo_100x.fa
cd assembly_QC
thr=8
quast -o quast_out \
-r s288c.fa \
-g s288c.gff \
-t ${thr} \
--min-identity=85 \
-l "s288c,Flye_denovo_100x" \
s288c.fa \
Flye_denovo_100x.fa
cd ../

pairwise alignment of the de-novo and reference genomes
Both genomes were aligned using Mummer4 (version 4.0.0rc1) and a bi-plot produced from the results.
# mummer4 installed with conda
# we used here a custom script mummer_biplot.sh that can be found on our wiki:
# https://github.com/Nucleomics-VIB/ngs-tools/blob/master/mummer_biplot.sh
mummer_biplot.sh \
-o assembly_QC/denovo_vs_s288c \
-c 100 -I 98 -L 100 -f png -T 8 \
-x assembly_QC/s288c.fa \
-y 100x_Flye_assembly/assembly.fasta && \
mv assembly_vs_s288c_mummer3-log_1664891829.txt assembly_QC/denovo_vs_s288c/

The two assemblies appears as a nice diagonal, suggesting a nice sequence homology and content.
BUSCO analysis of the assembly
BUSCO (version 5.3.2) is traditionally applied to eukaryote assemblies to evaluate the fraction of essential genes present, which is a proxy to assembly completeness.
# BUSCO is installed from a docker image
# from https://busco.ezlab.org/busco_userguide.html#manage-run-parameters-in-config-files
mkdir -p assembly_QC/busco_out
thr=8
sudo docker run -u $(id -u) -v $(pwd):/host_mount -w /host_mount ezlabgva/busco:v5.3.2_cv1 busco \
--augustus \
--augustus_species saccharomyces_cerevisiae_S288C \
--mode genome \
--lineage_dataset saccharomycetes_odb10 \
-o assembly_QC/Flye_busco_out \
-i 100x_Flye_assembly/assembly.fasta \
-c ${thr}
# also run on ref genome
sudo docker run -u $(id -u) -v $(pwd):/host_mount -w /host_mount ezlabgva/busco:v5.3.2_cv1 busco \
--augustus \
--augustus_species saccharomyces_cerevisiae_S288C \
--mode genome \
--lineage_dataset saccharomycetes_odb10 \
-o assembly_QC/ref_busco_out \
-i assembly_QC/s288c.fa \
-c ${thr}
# copy for multi-plot
mkdir -p assembly_QC/busco_plot
cp Flye_busco_out/short_summary.specific.saccharomycetes_odb10.Flye_busco_out.txt \
busco_plot/short_summary.specific.saccharomycetes_odb10.Flye_denovo.txt
cp ref_busco_out/short_summary.specific.saccharomycetes_odb10.ref_busco_out.txt \
busco_plot/short_summary.specific.saccharomycetes_odb10.s288c.txt
# plot
sudo docker run -u $(id -u) -v $PWD:/busco_wd ezlabgva/busco:v5.3.2_cv1 /usr/local/bin/generate_plot.py -wd assembly_QC/busco_plot
BUSCO results for the Flye assembly genome
--------------------------------------------------
|Results from dataset saccharomycetes_odb10 |
--------------------------------------------------
|C:99.4%[S:93.4%,D:6.0%],F:0.0%,M:0.6%,n:2137 |
|2125 Complete BUSCOs (C) |
|1996 Complete and single-copy BUSCOs (S) |
|129 Complete and duplicated BUSCOs (D) |
|0 Fragmented BUSCOs (F) |
|12 Missing BUSCOs (M) |
|2137 Total BUSCO groups searched |
--------------------------------------------------
BUSCO results for the s288c reference genome
--------------------------------------------------
|Results from dataset saccharomycetes_odb10 |
--------------------------------------------------
|C:99.5%[S:97.3%,D:2.2%],F:0.0%,M:0.5%,n:2137 |
|2126 Complete BUSCOs (C) |
|2080 Complete and single-copy BUSCOs (S) |
|46 Complete and duplicated BUSCOs (D) |
|0 Fragmented BUSCOs (F) |
|11 Missing BUSCOs (M) |
|2137 Total BUSCO groups searched |
--------------------------------------------------

The contigs obtained with Flye were further polished with Pilon (version 1.24) Walker et al, 2014.
Note: Although Flye already performs polishing, this extra step introduced a few sequence edits and was performed to comply with state of the art workflows and take advantage from the full sequencing data.
\footnotesize
# pilon installed with conda
# as well as bwa 0.7.17-r1188, sambamba 0.8.2, and samtools 1.15
# install samtools with 'conda install -c conda-forge -c bioconda samtools' to get the right openssl installed
base=$PWD
smpl=SRR18210286
thr=8
thr2=4
outfolder=100x_Flye_assembly_pilon
mkdir -p ${outfolder} && cd ${outfolder}
cp ../100x_Flye_assembly/assembly.fasta flye-asm_${smpl}.fa
bwa index -p flye-asm_${smpl} flye-asm_${smpl}.fa
bwa mem -t ${thr} flye-asm_${smpl} ../reads/SRR18210286_HiFi_gt10k_100x.fq.gz \
| samtools sort -@ ${thr2} -O bam -o align_flye.asm.${smpl}.bam && \
samtools index -@ ${thr2} align_flye.asm.${smpl}.bam
# mark duplicates
sambamba markdup -t ${thr} align_flye.asm.${smpl}.bam align_flye.asm.${smpl}_markdup.bam
# filter high quality mappings for correction
samtools view -@ ${thr2} -b -q 30 align_flye.asm.${smpl}_markdup.bam > align_filter.bam && \
samtools index -@ ${thr2} align_filter.bam
java -Xmx20G -jar $CONDA_PREFIX/share/pilon-1.24-0/pilon.jar \
--genome flye-asm_${smpl}.fa \
--frags align_filter.bam \
--fix snps,indels \
--output flye-polished_${smpl} \
--vcf
cd ${base}
Since denovo assembly contigs bear arbitrary names and often correspond to chromosome fragments (arms), it was decided to scaffold the Flye assemblies using the public Saccharomyces_cerevisiae.R64-1-1 reference as guide and the ragtag tool (version 2.1.0) Alonge et al, 2021.
The obtained scaffolded assemblies bear the official chromosome names where scaffolding succeeded and retain the Flye contig names for non-scaffolded contigs.
# ragtag installed with conda
base=$PWD
smpl=SRR18210286
thr=8
# since busco seem to take two threads per sample
thr2=4
######################
# RagTag scaffolding
######################
outfolder=100x_Flye_assembly_ragtag
mkdir -p ${outfolder} && cd ${outfolder}
cp ../100x_Flye_assembly_pilon/flye-polished_${smpl}.fasta flye-polished_${smpl}.fa
asm=flye-polished_${smpl}.fa
reads=../reads/SRR18210286_HiFi_gt10k_100x.fq.gz
ref=../reference/Sc.R64-1-1.fa
minimappath=$(which minimap2)
# correct
echo "# run ragtag correction for ${smpl}"
ragtag.py correct \
${ref} \
${asm} \
-t ${thr} \
--aligner ${minimappath} \
--inter \
-R ${reads} \
-T corr \
-w \
-u \
-o ragtag_${smpl}
# scaffold
echo "# run ragtag scaffolding for ${smpl}"
ragtag.py scaffold \
${ref} \
ragtag_${smpl}/ragtag.correct.fasta \
-t ${thr} \
--aligner ${minimappath} \
-w \
-u \
-o ragtag_${smpl}
# copy final results back
cp ragtag_${smpl}/ragtag.scaffold.fasta ragtag_scaffolds_${smpl}.fa
# rename and replace long fasta headers in ragtag scaffold output
sed -i 's/contig_\(.*\)_pilon_\(.*_.*\)_+/\1_\2/g' ragtag_scaffolds_${smpl}.fa
sed -i 's/_RagTag//g' ragtag_scaffolds_${smpl}.fa
sed -i 's/>contig_/>/g' ragtag_scaffolds_${smpl}.fa
The resulting 'final assembly' is compared to the reference as done with the primary assembly to detect any introduced biases (eg translocations and large misassemblies)
# mummer4, quast, installed with conda
# busco run from docker image
# mummer_biplot.sh from our git: https://github.com/Nucleomics-VIB/ngs-tools/blob/master/mummer_biplot.sh
## run in the ragtag outfolder
smpl=SRR18210286
thr=8
# since busco seem to take two threads per sample
thr2=4
#################
# mummer biplot
#################
conda deactivate
conda activate mummer4
echo "# create assembly biplot for ${smpl}"
mummer_biplot.sh \
-x ${ref} \
-y ragtag_scaffolds_${smpl}.fa \
-o ragtag_scaffolds_${smpl}-vs-R64 \
-t nucmer \
-T ${thr} \
-f png && mv *_mummer3-log_* ragtag_scaffolds_${smpl}-vs-R64/
conda deactivate
#########
# QUAST
#########
conda activate quast
ref=../reference/Sc.R64-1-1.fa
gff3=../reference/Sc.R64-1-1.107.gff3
quast -o quast_${smpl} \
-r ${ref} \
-g ${gff3} \
-t ${thr} \
--min-identity=85 \
-l "s288c, Flye-primary, Flye-final" \
${ref} \
../100x_Flye_assembly/assembly.fasta \
ragtag_scaffolds_${smpl}.fa
conda deactivate
#########
# BUSCO
#########
# from https://busco.ezlab.org/busco_userguide.html#manage-run-parameters-in-config-files
mkdir -p ../BUSCO_summaries_ragtag
mkdir -p busco_${smpl} && cd busco_${smpl}
cp ../ragtag_scaffolds_${smpl}.fa ragtag_asm.fna
sudo docker run -u $(id -u) -v $(pwd):/host_mount -w /host_mount ezlabgva/busco:v5.3.2_cv1 busco \
--augustus \
--augustus_species saccharomyces_cerevisiae_S288C \
--mode genome \
--lineage_dataset saccharomycetes_odb10 \
-o busco_out \
-i ragtag_asm.fna \
-c ${thr2}
# copy for multi-plot
cp busco_out/short_summary.specific.saccharomycetes_odb10.busco_out.txt \
../../BUSCO_summaries_ragtag/short_summary.specific.saccharomycetes_odb10.Flye_final.txt
cp ../../assembly_QC/busco_plot/short_summary.specific.saccharomycetes_* \
../../BUSCO_summaries_ragtag/
# plot
cd ../../BUSCO_summaries_ragtag
sudo docker run -u $(id -u) -v $PWD:/busco_wd ezlabgva/busco:v5.3.2_cv1 /usr/local/bin/generate_plot.py -wd .
pairwise alignment of the final assembly with the reference

Quast results for the primary and final assembly compared to the reference

Busco results for the final assembly
--------------------------------------------------
|Results from dataset saccharomycetes_odb10 |
--------------------------------------------------
|C:99.5%[S:93.5%,D:6.0%],F:0.0%,M:0.5%,n:2137 |
|2127 Complete BUSCOs (C) |
|1998 Complete and single-copy BUSCOs (S) |
|129 Complete and duplicated BUSCOs (D) |
|0 Fragmented BUSCOs (F) |
|10 Missing BUSCOs (M) |
|2137 Total BUSCO groups searched |
--------------------------------------------------
These results are added to the plot with the reference and the primary assembly

In order to easily locate and access yeast genes, the scaffolded genomes were fed to Funannotate (version 1.8.10) Li et al, 2021 which denovo locates and annotates yeast genes using a variety of embedded tools.
Note: One tool was not included in the Funannotate pipeline, namely GeneMark, which license does not allow usage for profit. All other accessible tools (incl. augustus, busco, interproscan and eggnog) were used in the annotation process. More info about Funannotate and included analyses can be found at (https://funannotate.readthedocs.io/en/latest/.
Note: since there is no RNA-seq data for this project, Funannotate annotations based on transcriptome sequencing are also absent.
Notes:
- Funannotate can be installed as detailed on the nextgenusfs github page. We ran Funannotate from a docker image obtained with the command :
docker pull nextgenusfs/funannotate:latest - Eggnog is also required for predictions and is installed using bioconda in a separate environment (emapper-2.1.7 / Expected eggNOG DB version: 5.0.2 / Installed eggNOG DB version: 5.0.2 / Diamond version found: diamond version 2.0.15 / MMseqs2 version found: 13.45111).
- The most current version of Interproscan (5.57-90.0) was installed locally from the latest git release. Interproscan can also installed using conda in a separate environment (but only until version 5.55-88.0).
- Interproscan does not include Phobius, signalP, and TMHMM because of their license models but these were added here thanks to the non-profit nature of this tutorial.
Funannotate run with the final Flye assembly
# Funannotate installed from git
# eggnog installed with conda
# interproscan installed
base=$PWD
smpl=SRR18210286
thr=8
###############
# funannotate
###############
outfolder=100x_Flye_assembly_funannotate
mkdir -p ${outfolder} && cd ${outfolder}
cp ../100x_Flye_assembly_ragtag/ragtag_scaffolds_${smpl}.fa .
# preprocess flye assembly for funannotate
sudo /opt/biotools/funannotate/funannotate-docker clean -i ragtag_scaffolds_${smpl}.fa --minlen 1000 -o flye.genome.cleaned.fa
sudo /opt/biotools/funannotate/funannotate-docker sort -i flye.genome.cleaned.fa -b scaffold -o flye.genome.cleaned.sorted.fa
sudo /opt/biotools/funannotate/funannotate-docker mask -i flye.genome.cleaned.fa --cpus ${thr} -o FlyeScaffoldedAssembly_${smpl}.fa
# run funannotate
sudo /opt/biotools/funannotate/funannotate-docker predict \
-i FlyeScaffoldedAssembly_${smpl}.fa \
-o fun1 \
--species "Saccharomyces cerevisiae" \
--strain ${smpl} \
--organism fungus \
--name "FUN1_" \
--ploidy 1 \
--busco_db "dikarya" \
--busco_seed_species "saccharomyces_cerevisiae_S288C" \
--cpus ${thr}
# accessory annotations, added in the same folder
## interproscan (takes some time)
# run from the local install
mkdir -p iprscan_out
interproscan.sh \
-i fun1/predict_results/Saccharomyces_cerevisiae_${smpl}.proteins.fa \
-f gff3,xml,tsv \
-dp \
-cpu ${thr} \
-d iprscan_out
## list of databases used in the annotation
# [AntiFam-7.0,CDD-3.18,Coils-2.2.1,Gene3D-4.3.0,Hamap-2021_04,MobiDBLite-2.0,PANTHER-15.0,
# Pfam-35.0,Phobius-1.01,PIRSF-3.10,PIRSR-2021_05,PRINTS-42.0,ProSitePatterns-2022_01,
# ProSiteProfiles-2022_01,SFLD-4,SignalP_EUK-4.1,SignalP_GRAM_NEGATIVE-4.1,
# SignalP_GRAM_POSITIVE-4.1,SMART-7.1,SUPERFAMILY-1.75,TIGRFAM-15.0,TMHMM-2.0c]
## eggnog-mapper
# activate corresponding conda env
mkdir -p eggnog_out
emapper.py \
-i fun1/predict_results/Saccharomyces_cerevisiae_${smpl}.proteins.fa \
--itype proteins \
--output_dir eggnog_out \
-o eggnog \
--temp_dir . \
--scratch_dir . \
--cpu ${thr} \
--excel
# final annotation
sudo /opt/biotools/funannotate/funannotate-docker annotate \
-i fun1 \
--busco_db "dikarya" \
--iprscan iprscan_out/Saccharomyces_cerevisiae_${smpl}.proteins.fa.xml \
--eggnog eggnog_out/eggnog.emapper.annotations \
--cpus ${thr}
# duplicate final folder locally
cp -r fun1/annotate_results .
# extract all annotated gene names
gawk 'BEGIN{FS="\t"; OFS="\t"}{if($3=="gene") {split($9,a,";"); if(a[2]==""){print a[1]}else print a[2]}}' \
annotate_results/Saccharomyces_cerevisiae_${smpl}.gff3 | sort \
> annotate_results/${smpl}-genes.txt
The Funannotate annotation was done on the pre-processed assembly detailed as:
num scaffolds: 36
assembly size: 12,618,341 bp
masked repeats: 527,761 bp (4.18%)
final Funannotate results can be found in the annotate_results subfolder
The files include a genebank formatted file reporting all annotations and both genomic and protein sequences.
Gene2Products.must-fix.txt
Gene2Products.need-curating.txt
Gene2Products.new-names-passed.txt
Saccharomyces_cerevisiae_SRR18210286.agp
Saccharomyces_cerevisiae_SRR18210286.annotations.txt
Saccharomyces_cerevisiae_SRR18210286.cds-transcripts.fa
Saccharomyces_cerevisiae_SRR18210286.contigs.fsa
Saccharomyces_cerevisiae_SRR18210286.discrepency.report.txt
Saccharomyces_cerevisiae_SRR18210286.gbk
Saccharomyces_cerevisiae_SRR18210286.gff3
Saccharomyces_cerevisiae_SRR18210286.mrna-transcripts.fa
Saccharomyces_cerevisiae_SRR18210286.proteins.fa
Saccharomyces_cerevisiae_SRR18210286.scaffolds.fa
Saccharomyces_cerevisiae_SRR18210286.sqn
Saccharomyces_cerevisiae_SRR18210286.stats.json
Saccharomyces_cerevisiae_SRR18210286.tbl
SRR18210286-genes.txt
Funannotate run with the reference assembly
We also run Funannotate on the reference genome sequence in order to see if additional genes can be predicted there.
# Funannotate installed from git
# eggnog installed with conda
##########################################################
# also run funannotate on the ref genome for comparison
smpl=s288c
thr=8
mkdir Sc.R64-1-1_funannotate && cd Sc.R64-1-1_funannotate
cp ../reference/Sc.R64-1-1.fa .
sudo /opt/biotools/funannotate/funannotate-docker clean -i Sc.R64-1-1.fa --minlen 1000 -o Sc.R64-1-1_cleaned.fa
sudo /opt/biotools/funannotate/funannotate-docker sort -i Sc.R64-1-1_cleaned.fa -b scaffold -o Sc.R64-1-1_cleaned_sorted.fa
sudo /opt/biotools/funannotate/funannotate-docker mask -i Sc.R64-1-1_cleaned_sorted.fa --cpus ${thr} -o ScaffoldedAssembly_Sc.R64-1-1.fa
sudo /opt/biotools/funannotate/funannotate-docker predict \
-i ScaffoldedAssembly_Sc.R64-1-1.fa \
-o fun2 \
--species "Saccharomyces cerevisiae" \
--strain ${smpl} \
--organism fungus \
--name "FUN2_" \
--ploidy 1 \
--busco_db "dikarya" \
--busco_seed_species "saccharomyces_cerevisiae_S288C" \
--cpus ${thr}
# accessory annotations, added in the same folder
## interproscan (takes some time)
# run from the local install
mkdir -p iprscan_out
interproscan.sh \
-i fun2/predict_results/Saccharomyces_cerevisiae_${smpl}.proteins.fa \
-f gff3,xml,tsv \
-dp \
-cpu ${thr} \
-d iprscan_out
## list of databases used in the annotation
# [AntiFam-7.0,CDD-3.18,Coils-2.2.1,Gene3D-4.3.0,Hamap-2021_04,MobiDBLite-2.0,PANTHER-15.0,
# Pfam-35.0,Phobius-1.01,PIRSF-3.10,PIRSR-2021_05,PRINTS-42.0,ProSitePatterns-2022_01,
# ProSiteProfiles-2022_01,SFLD-4,SignalP_EUK-4.1,SignalP_GRAM_NEGATIVE-4.1,
# SignalP_GRAM_POSITIVE-4.1,SMART-7.1,SUPERFAMILY-1.75,TIGRFAM-15.0,TMHMM-2.0c]
## eggnog-mapper
# activate corresponding conda env
mkdir -p eggnog_out
emapper.py \
-i fun2/predict_results/Saccharomyces_cerevisiae_${smpl}.proteins.fa \
--itype proteins \
--output_dir eggnog_out \
-o eggnog \
--temp_dir . \
--scratch_dir . \
--cpu ${thr} \
--excel
# final annotation
sudo /opt/biotools/funannotate/funannotate-docker annotate \
-i fun2 \
--busco_db "dikarya" \
--iprscan iprscan_out/Saccharomyces_cerevisiae_${smpl}.proteins.fa.xml \
--eggnog eggnog_out/eggnog.emapper.annotations \
--cpus ${thr}
# duplicate final folder locally
cp -r fun2/annotate_results .
# extract all annotated gene names
gawk 'BEGIN{FS="\t"; OFS="\t"}{if($3=="gene") {split($9,a,";"); if(a[2]==""){print a[1]}else print a[2]}}' \
annotate_results/Saccharomyces_cerevisiae_${smpl}.gff3 | sort \
> annotate_results/${smpl}-genes.txt
Compare Funannotate results
mkdir -p funannotate_compare && cd funannotate_compare
cp ../100x_Flye_assembly_funannotate/annotate_results/Saccharomyces_cerevisiae_SRR18210286.gbk .
cp ../Sc.R64-1-1_funannotate/annotate_results/Saccharomyces_cerevisiae_s288c.gbk .
sudo /opt/biotools/funannotate/funannotate-docker compare \
-i *.gbk \
-o compare_out \
--cpus 24
| isolate | species | locus_tag | Assembly Size | Largest Scaffold | Average Scaffold | Num Scaffolds | Scaffold N50 | Percent GC | Num Genes | Num Proteins | Num tRNA | Unique Proteins | Prots atleast 1 ortholog | Single-copy orthologs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Saccharomyces cerevisiae | None | s288c | 12,157,105 bp | 1,531,933 bp | 715,124 bp | 17 | 924,431 bp | 38.15% | 5,661 | 5,373 | 288 | 5,348 | 5,222 | 5,042 |
| Saccharomyces cerevisiae | None | Flye_denovo | 12,618,341 bp | 1,472,925 bp | 350,509 bp | 36 | 902,298 bp | 38.20% | 5,976 | 5,660 | 316 | 5,464 | 5,385 | 5,042 |
A number of pluts are available in teh compare results and not reproduced here. The comparison shows that the denovo assembly congtains more predicted genes, some are duplications due to the assembly process and other may come from contaminants in the DNA sample or from assembly errors. The assembly itself contains about 500kb extra sequences as compared to the reference that could represent the same entities. Contamination can be tested by blast'ing the unscaffolded sequences against genbank (not done here, see tutorial about bacterial assembly with Flye).
This concludes this overview of denovo assembly steps of a yeast genome.
The obtained genome is quasi complete with exception of centromeres and several difficult regions, leading to a split of chromosomes in chromosome arms and in different gapped contigs. Finishing such assembly would require orthogonal technologies like Hi-C or optical mapping which are quite costly but often lead to chromosome size scaffolds.
This workflow can be easily adapted to support multiple samples as obtained from a typical Sequel-IIE run, which, if successful, can accommodate 8-12 barcoded yeast genomes.
You will find below the help page for the main commands used in this report. Additional commands will allow defining some of the arguments in case you do not recycle this tutorial for a yeast but instead for another organism. As always, reading the full online documentation is highly recommended.
command help for Flye
usage: flye (--pacbio-raw | --pacbio-corr | --pacbio-hifi | --nano-raw |
--nano-corr | --nano-hq ) file1 [file_2 ...]
--out-dir PATH
[--genome-size SIZE] [--threads int] [--iterations int]
[--meta] [--polish-target] [--min-overlap SIZE]
[--keep-haplotypes] [--debug] [--version] [--help]
[--scaffold] [--resume] [--resume-from] [--stop-after]
[--read-error float] [--extra-params]
Assembly of long reads with repeat graphs
optional arguments:
-h, --help show this help message and exit
--pacbio-raw path [path ...]
PacBio regular CLR reads (<20% error)
--pacbio-corr path [path ...]
PacBio reads that were corrected with other methods (<3% error)
--pacbio-hifi path [path ...]
PacBio HiFi reads (<1% error)
--nano-raw path [path ...]
ONT regular reads, pre-Guppy5 (<20% error)
--nano-corr path [path ...]
ONT reads that were corrected with other methods (<3% error)
--nano-hq path [path ...]
ONT high-quality reads: Guppy5+ SUP or Q20 (<5% error)
--subassemblies path [path ...]
[deprecated] high-quality contigs input
-g size, --genome-size size
estimated genome size (for example, 5m or 2.6g)
-o path, --out-dir path
Output directory
-t int, --threads int
number of parallel threads [1]
-i int, --iterations int
number of polishing iterations [1]
-m int, --min-overlap int
minimum overlap between reads [auto]
--asm-coverage int reduced coverage for initial disjointig assembly [not set]
--hifi-error float [deprecated] same as --read-error
--read-error float adjust parameters for given read error rate (as fraction e.g. 0.03)
--extra-params extra_params
extra configuration parameters list (comma-separated)
--plasmids unused (retained for backward compatibility)
--meta metagenome / uneven coverage mode
--keep-haplotypes do not collapse alternative haplotypes
--no-alt-contigs do not output contigs representing alternative haplotypes
--scaffold enable scaffolding using graph [disabled by default]
--trestle [deprecated] enable Trestle [disabled by default]
--polish-target path run polisher on the target sequence
--resume resume from the last completed stage
--resume-from stage_name
resume from a custom stage
--stop-after stage_name
stop after the specified stage completed
--debug enable debug output
-v, --version show program's version number and exit
Input reads can be in FASTA or FASTQ format, uncompressed
or compressed with gz. Currently, PacBio (CLR, HiFi, corrected)
and ONT reads (regular, HQ, corrected) are supported. Expected error rates are
<15% for PB CLR/regular ONT; <5% for ONT HQ, <3% for corrected, and <1% for HiFi. Note that Flye
was primarily developed to run on uncorrected reads. You may specify multiple
files with reads (separated by spaces). Mixing different read
types is not yet supported. The --meta option enables the mode
for metagenome/uneven coverage assembly.
To reduce memory consumption for large genome assemblies,
you can use a subset of the longest reads for initial disjointig
assembly by specifying --asm-coverage and --genome-size options. Typically,
40x coverage is enough to produce good disjointigs.
You can run Flye polisher as a standalone tool using
--polish-target option.
command help for BUSCO
usage: busco -i [SEQUENCE_FILE] -l [LINEAGE] -o [OUTPUT_NAME] -m [MODE] [OTHER OPTIONS]
Welcome to BUSCO 5.3.2: the Benchmarking Universal Single-Copy Ortholog assessment tool.
For more detailed usage information, please review the README file provided with this distribution and the BUSCO user guide. Visit this page https://gitlab.com/ezlab/busco#how-to-cite-busco to see how to cite BUSCO
optional arguments:
-i SEQUENCE_FILE, --in SEQUENCE_FILE
Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set. Also possible to use a path to a directory containing multiple input files.
-o OUTPUT, --out OUTPUT
Give your analysis run a recognisable short name. Output folders and files will be labelled with this name. The path to the output folder is set with --out_path.
-m MODE, --mode MODE Specify which BUSCO analysis mode to run.
There are three valid modes:
- geno or genome, for genome assemblies (DNA)
- tran or transcriptome, for transcriptome assemblies (DNA)
- prot or proteins, for annotated gene sets (protein)
-l LINEAGE, --lineage_dataset LINEAGE
Specify the name of the BUSCO lineage to be used.
--augustus Use augustus gene predictor for eukaryote runs
--augustus_parameters --PARAM1=VALUE1,--PARAM2=VALUE2
Pass additional arguments to Augustus. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
--augustus_species AUGUSTUS_SPECIES
Specify a species for Augustus training.
--auto-lineage Run auto-lineage to find optimum lineage path
--auto-lineage-euk Run auto-placement just on eukaryote tree to find optimum lineage path
--auto-lineage-prok Run auto-lineage just on non-eukaryote trees to find optimum lineage path
-c N, --cpu N Specify the number (N=integer) of threads/cores to use.
--config CONFIG_FILE Provide a config file
--datasets_version DATASETS_VERSION
Specify the version of BUSCO datasets, e.g. odb10
--download [dataset ...]
Download dataset. Possible values are a specific dataset name, "all", "prokaryota", "eukaryota", or "virus". If used together with other command line arguments, make sure to place this last.
--download_base_url DOWNLOAD_BASE_URL
Set the url to the remote BUSCO dataset location
--download_path DOWNLOAD_PATH
Specify local filepath for storing BUSCO dataset downloads
-e N, --evalue N E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03 (Default: 1e-03)
-f, --force Force rewriting of existing files. Must be used when output files with the provided name already exist.
-h, --help Show this help message and exit
--limit N How many candidate regions (contig or transcript) to consider per BUSCO (default: 3)
--list-datasets Print the list of available BUSCO datasets
--long Optimization Augustus self-training mode (Default: Off); adds considerably to the run time, but can improve results for some non-model organisms
--metaeuk_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"
Pass additional arguments to Metaeuk for the first run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
--metaeuk_rerun_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"
Pass additional arguments to Metaeuk for the second run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
--offline To indicate that BUSCO cannot attempt to download files
--out_path OUTPUT_PATH
Optional location for results folder, excluding results folder name. Default is current working directory.
-q, --quiet Disable the info logs, displays only errors
-r, --restart Continue a run that had already partially completed.
--tar Compress some subdirectories with many files to save space
--update-data Download and replace with last versions all lineages datasets and files necessary to their automated selection
-v, --version Show this version and exit
command help for funannotate
Usage: funannotate <command> <arguments>
version: 1.8.14
Description: Funannotate is a genome prediction, annotation, and comparison pipeline.
Commands:
clean Find/remove small repetitive contigs
sort Sort by size and rename contig headers
mask Repeatmask genome assembly
train RNA-seq mediated training of Augustus/GeneMark
predict Run gene prediction pipeline
fix Fix annotation errors (generate new GenBank file)
update RNA-seq/PASA mediated gene model refinement
remote Partial functional annotation using remote servers
iprscan InterProScan5 search (Docker or local)
annotate Assign functional annotation to gene predictions
compare Compare funannotated genomes
util Format conversion and misc utilities
setup Setup/Install databases
test Download/Run funannotate installation tests
check Check Python, Perl, and External dependencies [--show-versions]
species list pre-trained Augustus species
database Manage databases
outgroups Manage outgroups for funannotate compare
Written by Jon Palmer (2016-2019) [email protected]
command help for funannotate predict
Usage: funannotate predict <arguments>
version: 1.8.14
Description: Script takes genome multi-fasta file and a variety of inputs to do a comprehensive whole
genome gene prediction. Uses AUGUSTUS, GeneMark, Snap, GlimmerHMM, BUSCO, EVidence Modeler,
tbl2asn, tRNAScan-SE, Exonerate, minimap2.
Required:
-i, --input Genome multi-FASTA file (softmasked repeats)
-o, --out Output folder name
-s, --species Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"
Optional:
-p, --parameters Ab intio parameters JSON file to use for gene predictors
--isolate Isolate name, e.g. Af293
--strain Strain name, e.g. FGSCA4
--name Locus tag name (assigned by NCBI?). Default: FUN_
--numbering Specify where gene numbering starts. Default: 1
--maker_gff MAKER2 GFF file. Parse results directly to EVM.
--pasa_gff PASA generated gene models. filename:weight
--other_gff Annotation pass-through to EVM. filename:weight
--rna_bam RNA-seq mapped to genome to train Augustus/GeneMark-ET
--stringtie StringTie GTF result
-w, --weights Ab-initio predictor and EVM weight. Example: augustus:2 or pasa:10
--augustus_species Augustus species config. Default: uses species name
--min_training_models Minimum number of models to train Augustus. Default: 200
--genemark_mode GeneMark mode. Default: ES [ES,ET]
--genemark_mod GeneMark ini mod file
--busco_seed_species Augustus pre-trained species to start BUSCO. Default: anidulans
--optimize_augustus Run 'optimze_augustus.pl' to refine training (long runtime)
--busco_db BUSCO models. Default: dikarya. `funannotate outgroups --show_buscos`
--organism Fungal-specific options. Default: fungus. [fungus,other]
--ploidy Ploidy of assembly. Default: 1
-t, --tbl2asn Assembly parameters for tbl2asn. Default: "-l paired-ends"
-d, --database Path to funannotate database. Default: $FUNANNOTATE_DB
--protein_evidence Proteins to map to genome (prot1.fa prot2.fa uniprot.fa). Default: uniprot.fa
--protein_alignments Pre-computed protein alignments in GFF3 format
--p2g_pident Exonerate percent identity. Default: 80
--p2g_diamond_db Premade diamond genome database for protein2genome mapping
--p2g_prefilter Pre-filter hits software selection. Default: diamond [tblastn]
--transcript_evidence mRNA/ESTs to align to genome (trans1.fa ests.fa trinity.fa). Default: none
--transcript_alignments Pre-computed transcript alignments in GFF3 format
--augustus_gff Pre-computed AUGUSTUS GFF3 results (must use --stopCodonExcludedFromCDS=False)
--genemark_gtf Pre-computed GeneMark GTF results
--trnascan Pre-computed tRNAscanSE results
--min_intronlen Minimum intron length. Default: 10
--max_intronlen Maximum intron length. Default: 3000
--soft_mask Softmasked length threshold for GeneMark. Default: 2000
--min_protlen Minimum protein length. Default: 50
--repeats2evm Use repeats in EVM consensus model building
--keep_evm Keep existing EVM results (for rerunning pipeline)
--evm-partition-interval Min length between genes to make a partition: Default: 1500
--no-evm-partitions Do not split contigs into partitions
--repeat_filter Repetitive gene model filtering. Default: overlap blast [overlap,blast,none]
--keep_no_stops Keep gene models without valid stops
--SeqCenter Sequencing facilty for NCBI tbl file. Default: CFMR
--SeqAccession Sequence accession number for NCBI tbl file. Default: 12345
--force Annotated unmasked genome
--cpus Number of CPUs to use. Default: 2
--no-progress Do not print progress to stdout for long sub jobs
--tmpdir Volume/location to write temporary files. Default: /tmp
--header_length Maximum length of FASTA headers. Default: 16
ENV Vars: If not specified at runtime, will be loaded from your $PATH
--EVM_HOME
--AUGUSTUS_CONFIG_PATH
--GENEMARK_PATH
--BAMTOOLS_PATH
command help for funannotate annotate
Usage: funannotate annotate <arguments>
version: 1.8.14
Description: Script functionally annotates the results from funannotate predict. It pulls
annotation from PFAM, InterPro, EggNog, UniProtKB, MEROPS, CAZyme, and GO ontology.
Required:
-i, --input Folder from funannotate predict
or
--genbank Genome in GenBank format
-o, --out Output folder for results
or
--gff Genome GFF3 annotation file
--fasta Genome in multi-fasta format
-s, --species Species name, use quotes for binomial, e.g. "Aspergillus fumigatus"
-o, --out Output folder for results
Optional:
--sbt NCBI submission template file. (Recommended)
-a, --annotations Custom annotations (3 column tsv file)
-m, --mito-pass-thru Mitochondrial genome/contigs. append with :mcode
--eggnog Eggnog-mapper annotations file (if NOT installed)
--antismash antiSMASH secondary metabolism results (GBK file from output)
--iprscan InterProScan5 XML file
--phobius Phobius pre-computed results (if phobius NOT installed)
--signalp SignalP pre-computed results (-org euk -format short)
--isolate Isolate name
--strain Strain name
--rename Rename GFF gene models with locus_tag from NCBI.
--fix Gene/Product names fixed (TSV: GeneID Name Product)
--remove Gene/Product names to remove (TSV: Gene Product)
--busco_db BUSCO models. Default: dikarya
-t, --tbl2asn Additional parameters for tbl2asn. Default: "-l paired-ends"
-d, --database Path to funannotate database. Default: $FUNANNOTATE_DB
--force Force over-write of output folder
--cpus Number of CPUs to use. Default: 2
--tmpdir Volume/location to write temporary files. Default: /tmp
--p2g protein2genome pre-computed results
--header_length Maximum length of FASTA headers. Default: 16
--no-progress Do not print progress to stdout for long sub jobs
command help for interproscan.sh
06/10/2022 17:33:47:284 Welcome to InterProScan-5.57-90.0
06/10/2022 17:33:47:285 Running InterProScan v5 in STANDALONE mode... on Linux
usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xms128M
-Xmx2048M -jar interproscan-5.jar
Please give us your feedback by sending an email to
[email protected]
-appl,--applications <ANALYSES> Optional, comma separated list of analyses. If this option
is not set, ALL analyses will be run.
-b,--output-file-base <OUTPUT-FILE-BASE> Optional, base output filename (relative or absolute path).
Note that this option, the --output-dir (-d) option and the
--outfile (-o) option are mutually exclusive. The
appropriate file extension for the output format(s) will be
appended automatically. By default the input file path/name
will be used.
-cpu,--cpu <CPU> Optional, number of cores for inteproscan.
-d,--output-dir <OUTPUT-DIR> Optional, output directory. Note that this option, the
--outfile (-o) option and the --output-file-base (-b) option
are mutually exclusive. The output filename(s) are the same
as the input filename, with the appropriate file extension(s)
for the output format(s) appended automatically .
-dp,--disable-precalc Optional. Disables use of the precalculated match lookup
service. All match calculations will be run locally.
-dra,--disable-residue-annot Optional, excludes sites from the XML, JSON output
-etra,--enable-tsv-residue-annot Optional, includes sites in TSV output
-exclappl,--excl-applications <EXC-ANALYSES> Optional, comma separated list of analyses you want to
exclude.
-f,--formats <OUTPUT-FORMATS> Optional, case-insensitive, comma separated list of output
formats. Supported formats are TSV, XML, JSON, and GFF3.
Default for protein sequences are TSV, XML and GFF3, or for
nucleotide sequences GFF3 and XML.
-goterms,--goterms Optional, switch on lookup of corresponding Gene Ontology
annotation (IMPLIES -iprlookup option)
-help,--help Optional, display help information
-i,--input <INPUT-FILE-PATH> Optional, path to fasta file that should be loaded on Master
startup. Alternatively, in CONVERT mode, the InterProScan 5
XML file to convert.
-incldepappl,--incl-dep-applications <INC-DEP-ANALYSES> Optional, comma separated list of deprecated analyses that
you want included. If this option is not set, deprecated
analyses will not run.
-iprlookup,--iprlookup Also include lookup of corresponding InterPro annotation in
the TSV and GFF3 output formats.
-ms,--minsize <MINIMUM-SIZE> Optional, minimum nucleotide size of ORF to report. Will only
be considered if n is specified as a sequence type. Please be
aware of the fact that if you specify a too short value it
might be that the analysis takes a very long time!
-o,--outfile <EXPLICIT_OUTPUT_FILENAME> Optional explicit output file name (relative or absolute
path). Note that this option, the --output-dir (-d) option
and the --output-file-base (-b) option are mutually
exclusive. If this option is given, you MUST specify a single
output format using the -f option. The output file name will
not be modified. Note that specifying an output file name
using this option OVERWRITES ANY EXISTING FILE.
-pa,--pathways Optional, switch on lookup of corresponding Pathway
annotation (IMPLIES -iprlookup option)
-t,--seqtype <SEQUENCE-TYPE> Optional, the type of the input sequences (dna/rna (n) or
protein (p)). The default sequence type is protein.
-T,--tempdir <TEMP-DIR> Optional, specify temporary file directory (relative or
absolute path). The default location is temp/.
-verbose,--verbose Optional, display more verbose log output
-version,--version Optional, display version number
-vl,--verbose-level <VERBOSE-LEVEL> Optional, display verbose log output at level specified.
-vtsv,--output-tsv-version Optional, includes a TSV version file along with any TSV
output (when TSV output requested)
Copyright © EMBL European Bioinformatics Institute, Hinxton, Cambridge, UK. (http://www.ebi.ac.uk) The InterProScan
software itself is provided under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html).
Third party components (e.g. member database binaries and models) are subject to separate licensing - please see the
individual member database websites for details.
Available analyses:
TIGRFAM (15.0) : TIGRFAMs are protein families based on hidden Markov models (HMMs).
SFLD (4) : SFLD is a database of protein families based on hidden Markov models (HMMs).
Phobius (1.01) : A combined transmembrane topology and signal peptide predictor.
SignalP_GRAM_NEGATIVE (4.1) : SignalP (gram-negative) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for gram-negative prokaryotes.
SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotations for all proteins and genomes.
PANTHER (15.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.
Gene3D (4.3.0) : Structural assignment for whole genes and genomes using the CATH domain structure database.
Hamap (2021_04) : High-quality Automated and Manual Annotation of Microbial Proteomes.
ProSiteProfiles (2022_01) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
Coils (2.2.1) : Prediction of coiled coil regions in proteins.
SMART (7.1) : SMART allows the identification and analysis of domain architectures based on hidden Markov models (HMMs).
CDD (3.18) : CDD predicts protein domains and families based on a collection of well-annotated multiple sequence alignment models.
PRINTS (42.0) : A compendium of protein fingerprints - a fingerprint is a group of conserved motifs used to characterise a protein family.
PIRSR (2021_05) : PIRSR is a database of protein families based on hidden Markov models (HMMs) and Site Rules.
ProSitePatterns (2022_01) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.
AntiFam (7.0) : AntiFam is a resource of profile-HMMs designed to identify spurious protein predictions.
SignalP_EUK (4.1) : SignalP (eukaryotes) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for eukaryotes.
Pfam (35.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
MobiDBLite (2.0) : Prediction of intrinsically disordered regions in proteins.
SignalP_GRAM_POSITIVE (4.1) : SignalP (gram-positive) predicts the presence and location of signal peptide cleavage sites in amino acid sequences for gram-positive prokaryotes.
PIRSF (3.10) : The PIRSF concept is used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.
TMHMM (2.0c) : Prediction of transmembrane helices in proteins.
command help for eggnog
usage: emapper.py [-h] [-v] [--list_taxa] [--cpu NUM_CPU] [--mp_start_method {fork,spawn,forkserver}] [--resume] [--override] [-i FASTA_FILE] [--itype {CDS,proteins,genome,metagenome}] [--translate]
[--annotate_hits_table SEED_ORTHOLOGS_FILE] [-c FILE] [--data_dir DIR] [--genepred {search,prodigal}] [--trans_table TRANS_TABLE_CODE] [--training_genome FILE]
[--training_file FILE] [--allow_overlaps {none,strand,diff_frame,all}] [--overlap_tol FLOAT] [-m {diamond,mmseqs,hmmer,no_search,cache}] [--pident PIDENT]
[--query_cover QUERY_COVER] [--subject_cover SUBJECT_COVER] [--evalue EVALUE] [--score SCORE] [--dmnd_algo {auto,0,1,ctg}] [--dmnd_db DMND_DB_FILE]
[--sensmode {default,fast,mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive}] [--dmnd_iterate {yes,no}]
[--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}] [--dmnd_frameshift DMND_FRAMESHIFT] [--gapopen GAPOPEN] [--gapextend GAPEXTEND]
[--block_size BLOCK_SIZE] [--index_chunks CHUNKS] [--outfmt_short] [--dmnd_ignore_warnings] [--mmseqs_db MMSEQS_DB_FILE] [--start_sens START_SENS] [--sens_steps SENS_STEPS]
[--final_sens FINAL_SENS] [--mmseqs_sub_mat SUBS_MATRIX] [-d HMMER_DB_PREFIX] [--servers_list FILE] [--qtype {hmm,seq}] [--dbtype {hmmdb,seqdb}] [--usemem] [-p PORT]
[--end_port PORT] [--num_servers NUM_SERVERS] [--num_workers NUM_WORKERS] [--hmm_maxhits MAXHITS] [--report_no_hits] [--hmm_maxseqlen MAXSEQLEN] [--Z DB_SIZE] [--cut_ga]
[--clean_overlaps none|all|clans|hmmsearch_all|hmmsearch_clans] [--no_annot] [--dbmem] [--seed_ortholog_evalue MIN_E-VALUE] [--seed_ortholog_score MIN_SCORE] [--tax_scope TAX_SCOPE]
[--tax_scope_mode TAX_SCOPE_MODE] [--target_orthologs {one2one,many2one,one2many,many2many,all}] [--target_taxa LIST_OF_TAX_IDS] [--excluded_taxa LIST_OF_TAX_IDS]
[--report_orthologs] [--go_evidence {experimental,non-electronic,all}] [--pfam_realign {none,realign,denovo}] [--md5] [--output FILE_PREFIX] [--output_dir DIR] [--scratch_dir DIR]
[--temp_dir DIR] [--no_file_comments] [--decorate_gff DECORATE_GFF] [--decorate_gff_ID_field DECORATE_GFF_ID_FIELD] [--excel]
options:
-h, --help show this help message and exit
-v, --version show version and exit. (default: False)
--list_taxa List taxa available for --tax_scope/--tax_scope_mode, and exit (default: False)
Execution Options:
--cpu NUM_CPU Number of CPUs to be used. --cpu 0 to run with all available CPUs. (default: 1)
--mp_start_method {fork,spawn,forkserver}
Sets the python multiprocessing start method. Check https://docs.python.org/3/library/multiprocessing.html. Only use if the default method is not working properly in your OS.
(default: spawn)
--resume Resumes a previous emapper run, skipping results in existing output files. (default: False)
--override Overwrites output files if they exist. By default, execution is aborted if conflicting files are detected. (default: False)
Input Data Options:
-i FASTA_FILE Input FASTA file containing query sequences (proteins by default; see --itype and --translate). Required unless -m no_search. (default: None)
--itype {CDS,proteins,genome,metagenome}
Type of data in the input (-i) file. (default: proteins)
--translate When --itype CDS, translate CDS to proteins before search. When --itype genome/metagenome and --genepred search, translate predicted CDS from blastx hits to proteins.
(default: False)
--annotate_hits_table SEED_ORTHOLOGS_FILE
Annotate TSV formatted table with 4 fields: query, hit, evalue, score. Usually, a .seed_orthologs file from a previous emapper.py run. Requires -m no_search. (default: None)
-c FILE, --cache FILE
File containing annotations and md5 hashes of queries, to be used as cache. Required if -m cache (default: None)
--data_dir DIR Path to eggnog-mapper databases. By default, "data/" or the path specified in the environment variable EGGNOG_DATA_DIR. (default: None)
Gene Prediction Options:
--genepred {search,prodigal}
This is applied when --itype genome or --itype metagenome. search: gene prediction is inferred from Diamond/MMseqs2 blastx hits. prodigal: gene prediction is performed using
Prodigal. (default: search)
--trans_table TRANS_TABLE_CODE
This option will be used for Prodigal, Diamond or MMseqs2, depending on the mode. For Diamond searches, this option corresponds to the --query-gencode option. For MMseqs2
searches, this option corresponds to the --translation-table option. For Prodigal, this option corresponds to the -g/--trans_table option. It is also used when --translate,
check https://biopython.org/docs/1.75/api/Bio.Seq.html#Bio.Seq.Seq.translate. Default is the corresponding programs defaults. (default: None)
--training_genome FILE
A genome to run Prodigal in training mode. If this parameter is used, Prodigal will run in two steps: firstly in training mode, and secondly using the training to analize the
emapper input data. See Prodigal documentation about Traning mode for more info. Only used if --genepred prodigal. (default: None)
--training_file FILE A training file from Prodigal training mode. If this parameter is used, Prodigal will run using this training file to analyze the emapper input data. Only used if --genepred
prodigal. (default: None)
--allow_overlaps {none,strand,diff_frame,all}
When using 'blastx'-based genepred (--genepred search --itype genome/metagenome) this option controls whether overlapping hits are reported or not, or if only those
overlapping hits in a different strand or frame are reported. Also, the degree of accepted overlap can be controlled with --overlap_tol. (default: none)
--overlap_tol FLOAT This value (0-1) is the proportion such that if (overlap size / hit length) > overlap_tol, hits are considered to overlap. e.g: if overlap_tol is 0.0, any overlap is
considered as such. e.g: if overlap_tol is 1.0, one of the hits must overlap entirely to consider that hits do overlap. (default: 0.0)
Search Options:
-m {diamond,mmseqs,hmmer,no_search,cache}
diamond: search seed orthologs using diamond (-i is required). mmseqs: search seed orthologs using MMseqs2 (-i is required). hmmer: search seed orthologs using HMMER. (-i is
required). no_search: skip seed orthologs search (--annotate_hits_table is required, unless --no_annot). cache: skip seed orthologs search and annotate based on cached results
(-i and -c are required). (default: diamond)
Search filtering common options:
--pident PIDENT Report only alignments above or equal to the given percentage of identity (0-100).No effect if -m hmmer. (default: None)
--query_cover QUERY_COVER
Report only alignments above or equal the given percentage of query cover (0-100).No effect if -m hmmer. (default: None)
--subject_cover SUBJECT_COVER
Report only alignments above or equal the given percentage of subject cover (0-100).No effect if -m hmmer. (default: None)
--evalue EVALUE Report only alignments below or equal the e-value threshold. (default: 0.001)
--score SCORE Report only alignments above or equal the score threshold. (default: None)
Diamond Search Options:
--dmnd_algo {auto,0,1,ctg}
Diamond's --algo option, which can be tuned to search small query sets. By default, it is adjusted automatically. However, the ctg option should be activated manually. If you
plan to search a small input set of sequences, use --dmnd_algo ctg to make it faster. (default: auto)
--dmnd_db DMND_DB_FILE
Path to DIAMOND-compatible database (default: None)
--sensmode {default,fast,mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive}
Diamond's sensitivity mode. Note that emapper's default is sensitive, which is different from diamond's default, which can be activated with --sensmode default. (default:
sensitive)
--dmnd_iterate {yes,no}
--dmnd_iterate yes --> activates the --iterate option of diamond for iterative searches, from faster, less sensitive modes, up to the sensitivity specified with --sensmode.
Available since diamond 2.0.11. --dmnd_iterate no --> disables the --iterate mode. (default: yes)
--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}
Scoring matrix (default: None)
--dmnd_frameshift DMND_FRAMESHIFT
Diamond --frameshift/-F option. Not used by default. Recommended by diamond: 15. (default: None)
--gapopen GAPOPEN Gap open penalty (default: None)
--gapextend GAPEXTEND
Gap extend penalty (default: None)
--block_size BLOCK_SIZE
Diamond -b/--block-size option. Default is the diamond's default. (default: None)
--index_chunks CHUNKS
Diamond -c/--index-chunks option. Default is the diamond's default. (default: None)
--outfmt_short Diamond output will include only qseqid sseqid evalue and score. This could help obtain better performance, if also no --pident, --query_cover or --subject_cover thresholds
are used. This option is ignored when the diamond search is run in blastx mode for gene prediction (see --genepred). (default: False)
--dmnd_ignore_warnings
Diamond --ignore-warnings option. It avoids Diamond stopping due to warnings (e.g. when a protein contains only ATGC symbols. (default: False)
MMseqs2 Search Options:
--mmseqs_db MMSEQS_DB_FILE
Path to MMseqs2-compatible database (default: None)
--start_sens START_SENS
Starting sensitivity. (default: 3)
--sens_steps SENS_STEPS
Number of sensitivity steps. (default: 3)
--final_sens FINAL_SENS
Final sensititivy step. (default: 7)
--mmseqs_sub_mat SUBS_MATRIX
Matrix to be used for --sub-mat MMseqs2 search option. Default=default used by MMseqs2 (default: None)
HMMER Search Options:
-d HMMER_DB_PREFIX, --database HMMER_DB_PREFIX
specify the target database for sequence searches. Choose among: euk,bact,arch, or a database loaded in a server, db.hmm:host:port (see hmm_server.py) (default: None)
--servers_list FILE A FILE with a list of remote hmmpgmd servers. Each row in the file represents a server, in the format 'host:port'. If --servers_list is specified, host and port from -d option
will be ignored. (default: None)
--qtype {hmm,seq} Type of input data (-i). (default: seq)
--dbtype {hmmdb,seqdb}
Type of data in DB (-db). (default: hmmdb)
--usemem Use this option to allocate the whole database (-d) in memory using hmmpgmd. If --dbtype hmm, the database must be a hmmpress-ed database. If --dbtype seqdb, the database must
be a HMMER-format database created with esl-reformat. Database will be unloaded after execution. Note that this only works for HMMER based searches. To load the eggnog-mapper
annotation DB into memory use --dbmem. (default: False)
-p PORT, --port PORT Port used to setup HMM server, when --usemem. Also used for --pfam_realign modes. (default: 51700)
--end_port PORT Last port to be used to setup HMM server, when --usemem. Also used for --pfam_realign modes. (default: 53200)
--num_servers NUM_SERVERS
When using --usemem, specify the number of servers to fire up.Note that cpus specified with --cpu will be distributed among servers and workers. Also used for --pfam_realign
modes. (default: 1)
--num_workers NUM_WORKERS
When using --usemem, specify the number of workers per server (--num_servers) to fire up. By default, cpus specified with --cpu will be distributed among servers and workers.
Also used for --pfam_realign modes. (default: 1)
--hmm_maxhits MAXHITS
Max number of hits to report (0 to report all). (default: 1)
--report_no_hits Whether queries without hits should be included in the output table. (default: False)
--hmm_maxseqlen MAXSEQLEN
Ignore query sequences larger than `maxseqlen`. (default: 5000)
--Z DB_SIZE Fixed database size used in phmmer/hmmscan (allows comparing e-values among databases). (default: 40000000)
--cut_ga Adds the --cut_ga to hmmer commands (useful for Pfam mappings, for example). See hmmer documentation. (default: False)
--clean_overlaps none|all|clans|hmmsearch_all|hmmsearch_clans
Removes those hits which overlap, keeping only the one with best evalue. Use the "all" and "clans" options when performing a hmmscan type search (i.e. domains are in the
database). Use the "hmmsearch_all" and "hmmsearch_clans" options when using a hmmsearch type search (i.e. domains are the queries from -i file). The "clans" and
"hmmsearch_clans" and options will only have effect for hits to/from Pfam. (default: None)
Annotation Options:
--no_annot Skip functional annotation, reporting only hits. (default: False)
--dbmem Use this option to allocate the whole eggnog.db DB in memory. Database will be unloaded after execution. (default: False)
--seed_ortholog_evalue MIN_E-VALUE
Min E-value expected when searching for seed eggNOG ortholog. Queries not having a significant seed orthologs will not be annotated. (default: 0.001)
--seed_ortholog_score MIN_SCORE
Min bit score expected when searching for seed eggNOG ortholog. Queries not having a significant seed orthologs will not be annotated. (default: None)
--tax_scope TAX_SCOPE
Fix the taxonomic scope used for annotation, so only speciation events from a particular clade are used for functional transfer. More specifically, the --tax_scope list is
intersected with the seed orthologs clades, and the resulting clades are used for annotation based on --tax_scope_mode. Note that those seed orthologs without clades
intersecting with --tax_scope will be filtered out, and won't annotated. Possible arguments for --tax_scope are: 1) A path to a file defined by the user, which contains a list
of tax IDs and/or tax names. 2) The name of a pre-configured tax scope, whose source is a file stored within the 'eggnogmapper/annotation/tax_scopes/' directory By default,
available ones are: 'auto' ('all'), 'auto_broad' ('all_broad'), 'all_narrow', 'archaea', 'bacteria', 'bacteria_broad', 'eukaryota', 'eukaryota_broad' and 'prokaryota_broad'.3)
A comma-separated list of taxonomic names and/or taxonomic IDs, sorted by preference. An example of list of tax IDs would be 2759,2157,2,1 for Eukaryota, Archaea, Bacteria and
root, in that order of preference. 4) 'none': do not filter out annotations based on taxonomic scope. (default: auto)
--tax_scope_mode TAX_SCOPE_MODE
For a seed ortholog which passed the filter imposed by --tax_scope, --tax_scope_mode controls which specific clade, to which the seed ortholog belongs, will be used for
annotation. Options: 1) broadest: use the broadest clade. 2) inner_broadest: use the broadest clade from the intersection with --tax_scope. 3) inner_narrowest: use the
narrowest clade from the intersection with --tax_scope. 4) narrowest: use the narrowest clade. 5) A taxonomic scope as in --tax_scope: use this second list to intersect with
seed ortholog clades and use the narrowest (as in inner_narrowest) from the intersection to annotate. (default: inner_narrowest)
--target_orthologs {one2one,many2one,one2many,many2many,all}
defines what type of orthologs (in relation to the seed ortholog) should be used for functional transfer (default: all)
--target_taxa LIST_OF_TAX_IDS
Only orthologs from the specified comma-separated list of taxa and all its descendants will be used for annotation transference. By default, all taxa are used. (default: None)
--excluded_taxa LIST_OF_TAX_IDS
Orthologs from the specified comma-separated list of taxa and all its descendants will not be used for annotation transference. By default, no taxa is excluded. (default:
None)
--report_orthologs Output the list of orthologs found for each query to a .orthologs file (default: False)
--go_evidence {experimental,non-electronic,all}
Defines what type of GO terms should be used for annotation. experimental = Use only terms inferred from experimental evidence. non-electronic = Use only non-electronically
curated terms (default: non-electronic)
--pfam_realign {none,realign,denovo}
Realign the queries to the PFAM domains. none = no realignment is performed. PFAM annotation will be that transferred as specify in the --pfam_transfer option. realign =
queries will be realigned to the PFAM domains found according to the --pfam_transfer option. denovo = queries will be realigned to the whole PFAM database, ignoring the
--pfam_transfer option. Check hmmer options (--num_servers, --num_workers, --port, --end_port) to change how the hmmpgmd server is run. (default: none)
--md5 Adds the md5 hash of each query as an additional field in annotations output files. (default: False)
Output options:
--output FILE_PREFIX, -o FILE_PREFIX
base name for output files (default: None)
--output_dir DIR Where output files should be written (default: /data/biodata/references/galGal6a/Gheyas_variants)
--scratch_dir DIR Write output files in a temporary scratch dir, move them to the final output dir when finished. Speed up large computations using network file systems. (default: None)
--temp_dir DIR Where temporary files are created. Better if this is a local disk. (default: /data/biodata/references/galGal6a/Gheyas_variants)
--no_file_comments No header lines nor stats are included in the output files (default: False)
--decorate_gff DECORATE_GFF
Add search hits and/or annotation results to GFF file from gene prediction of a user specified one. no = no GFF decoration at all. GFF file from blastx-based gene prediction
will be created anyway. yes = add search hits and/or annotations to GFF file from Prodigal or blastx-based gene prediction. FILE = decorate the specified pre-existing GFF
FILE. e.g. --decorage_gff myfile.gff You change the field interpreted as ID of the feature with --decorate_gff_ID_field. (default: no)
--decorate_gff_ID_field DECORATE_GFF_ID_FIELD
Change the field used in GFF files as ID of the feature. (default: ID)
--excel Output annotations also in .xlsx format. (default: False)