Manual - dzhao2019/Eukfinder-Test GitHub Wiki

Table of Contents

  1. Introduction
  2. System Requirements
  3. Installation
  4. Eukfinder Database Structure
  5. Standard Eukfinder Database
  6. Classification Command Line
  7. Output Formats
  8. Custom Databases
  9. Masking Low-Complexity Sequences
  10. Eukfinder Variables

Introduction

Eukfinder is designed for the classification of eukaryotic sequences in metagenomic data. It processes Illumina short reads (Eukfinder_short) and assemblies/long-read data (Eukfinder_long) through automated classification steps. Users can also apply an additional manual binning workflow to refine nuclear and mitochondrial genomes.

Features:

  • reference-independent and cultivation-free
  • separate reads or contigs into five groups: Bacteria, Archaeal, Eukaryotic, Viral, and Unknown.
  • generate a fasta file with Euk and unknown contigs for binning

System Requirements

  • Disk space:

  • Memory:

  • Dependencies:

    MacOS NOTE: MacOS and other non-Linux operating systems are not explicitly supported by the developers.

  • Network connectivity:

Installation

To begin using Eukfinder, you will first need to install it, and then either download or create a database.

Anaconda or miniconda required

Anaconda or Miniconda must be installed to run this script.

If you don’t already have Anaconda or Miniconda installed, you can follow these links to download and install them:

1. Created environment and install eukfinder

conda create -n eukfinder -c bioconda eukfinder

2. Download or build databases

Default reference databases can be downloaded from Eukfinder Databases

  • Plast Database
  • Centrifuge Database
  • acc2tax Database
  • Human Genome for read decontamination
  • Read Adapters for Illumina sequencing
./download_db.sh

Users can flexibly customize the reference data (see here)

3. (!) Activate eukfinder environment before running the command

If you have Conda 4.4 or later:

conda init
conda activate eukfinder

After this, you can run Eukfinder commands.

If you have Conda prior to 4.4:

source activate eukfinder

Then run Eukfinder as usual.

Eukfinder Databases

Standard Eukfinder Database

Classification

eukfinder read_prep

Run Trimmomatic to remove low-quality reads, and adaptors

Run Bowtie2 to remove host reads

Run Centrifuge for the first round of classification

eukfinder read_prep [-h] --r1 R1 --r2 R2 -n THREADS -i ILLUMINA_CLIP
                           --hcrop HCROP -l LEADING_TRIM -t TRAIL_TRIM --wsize
                           WSIZE --qscore QSCORE --mlen MLEN --hg HG -o
                           OUT_NAME --cdb CDB

eukfinder short_seqs

eukfinder short_seqs [-h] --r1 R1 --r2 R2 --un UN -o OUT_NAME -n
                        NUMBER_OF_THREADS -z NUMBER_OF_CHUNKS -t
                        TAXONOMY_UPDATE -p PLAST_DATABASE -m PLAST_ID_MAP
                        [-p2 ANCILLARY_PLAST_DATABASE]
                        [-m2 ANCILLARY_PLAST_ID_MAP]
                        [--force-pdb FORCE_PDB] -a ACC2TAX_DATABASE --cdb
                        CDB -e E_VALUE --pid PID --cov COV --max_m MAX_M
                        --mhlen MHLEN --pclass PCLASS --uclass UCLASS

eukfinder long_seqs

eukfinder long_seqs [-h] -l LONG_SEQS -o OUT_NAME --mhlen MHLEN --cdb
                       CDB -n NUMBER_OF_THREADS -z NUMBER_OF_CHUNKS -t
                       TAXONOMY_UPDATE -p PLAST_DATABASE -m PLAST_ID_MAP
                       -a ACC2TAX_DATABASE -e E_VALUE --pid PID --cov COV

Output Formats

Sample Report Output Format

Custom Databases

We realize the standard database may not suit everyone's needs. Eukfinder also allows creation of customized databases.

Masking of Low-complexity Sequences

Low-complexity sequences, e.g. "ACACACACACACACACACACACACAC", are known to occur in many different organisms and are typically less informative for use in alignments; the BLAST programs often mask these sequences by default.

Eukfinder Variables