Kodoja Manual - abaizan/kodoja GitHub Wiki
Kodoja Manual: 1.0 (Describes Kodoja version 0.0.6)
Description
Kodoja is a bioinformatics workflow that takes RNA-seq data files and uses k-mer profiling to identify virus sequences that are present. It combines two existing tools, Kraken (1) for taxonomic classification using k-mers at the nucleotide level and Kaiju (2) for sequence matching at the protein level.
Kodoja has three main components (a) kodoja_build: for database generation for Kraken and Kaiju (b) kodoja_search: for taxonomic classification of RNA-seq reads (c) kodoja_retrieve: for extraction of viral sequences by species for downstream analysis.
Kodoja can be used in two ways:
- at the command line in Linux
- as a tool in Galaxy
1.0 Using Kodoja at the command line in Linux
1.1 Download kodoja scripts using Bioconda
Bioconda: https://anaconda.org/bioconda/kodoja. Install the code and all its dependences using
$ conda install -c bioconda kodoja
Kodoja has multiple tools it depends on, and if you already have some of the tools installed you might find that the version required by Kodoja is different to the version you already have installed (for example Kodoja requires Jellyfish 1.1.12 and you might have version 2.2.6 installed). If this is the case Kodoja will fail with a warning message to inform you if a tool version installed is different. To bypass this you need to create a separate conda environment, before you install kodoja
$ conda create –n kodoja
$ source activate kodoja
$ conda install –c bioconda kodoja
1.2 Download the pre-computed virus databases
Kraken and Kaiju require k-mer databases for virus sequences. Pre-computed databases have been provided for download from Zenodo (https://doi.org/10.5281/zenodo.1406071). Kraken requires 4 database files database.idx, database.kdb, names.dmp and nodes.dmp (the .dmp files a required to be located in a sub directory named taxonomy) and kaiju requires a single database file kaiju_library.fmi. All these files (and the required directory structure) are provided as a single zipped tar file. Download this file from zenodo and type:
tar –zxvf kodojaDB_v1.0.tar.gz
to untar and unzip the file. This will create a two directories (i) krakenDB: with the database files (database.idx, database.kdb) and the node.dmp and names.dmp files in a sub directory taxonomy. and (ii) kaijuDB: with the kaiju_library.fmi file.
1.3 Use kodoja_search.py to search for virus sequences in the RNA-seq data files
Input files
- RNA-seq data files: RNA-seq data files (paired or single end) in fastq or fasta format
- Kraken k-mer database files: database.idx, database.kdb, nodes.dmp, names.dmp (see section 2.2)
- Kaiju k-mer database file: kaiju_library.fmi (see section 2.2)
To run the search tool with input datafiles (paired-end RNA-seq) R1.fastq, R2.fastq in directory dat/, Kraken database files in krakenDB/ and Kaiju database files in kaijuDB/ and an output directory outdir/ the command line would be:
kodoja_search.py –o outdir –r1 dat/R1.fastq –r2 dat/R2.fastq –d1 krakenDB/ -d2 kaijuDB/
Output files
- Results file: virus_table.txt, a tab delimited file of the virus sequences identified
- Results files from trimmomatic (trimmed_read1, trimmed_read2)
- Results files from fastqc (trimmed_read1_fastqc.html, trimmed_read2.fastqc.html)
- Zipped results file from kraken (kraken_FormattedTable.txt.gz)
- Zipped results file form kaiju (kaiju_FormattedTable.txt.gz)
- Log file (log_file.txt)
- Additional files are deposited in the output directory, which are required for kodoja_retrieve.py
1.4 Use kodoja_retrieve.py to retrieve reads for a virus of interest
The virus_table.txt results file from kodoja_search.py (see section 2.3) gives a list of virus sequences identified in the RNA-seq files, and includes the NCBI taxonomic identifier (taxid) for each virus. If you want to extract all the RNA-seq reads that are attributed to a virus of interest, then use kodoja_retrieve.py with the taxid. For example, to retrieve the virus reads with taxid 322019 using the results files created by kodoja_search.py in directory outdir/ and the paired-end RNA-seq fastq files in directory dat/ the command line would be:
kodoja_retrieve.py –o outdir –r1 dat/R1.fastq –r2 dat/R2.fastq –t 322019
This generates two files virus_322019_sequences1.fastq and virus_322019_sequences2.fastq in a subdirectory subset_files
1.5. Building your own k-mer databases for kodoja_search.py
Kodoja requires k-mer databases of virus genomes to be created for Kaiju and Kraken. We have provided pre-computed k-mer databases for download (see section 2.2) but you can create your own database files using kodoja_build.py
kodoja_build.py –o kodojaDB/
where –o
is the directory where genome files will be downloaded from RefSeq and database files created.
A database build may take several hours to complete, but only needs to be completed once (and then periodically to keep the virus databases up-to-date or to create databases with different host genomes included).
Output files
- viral_assembly_summary.txt: RefSeq assembly summary file for the virus (VRL) partition
- virushostdb.tsv: Virus host dataset from https://www.genome.jp/virushostdb/)
- fna and .faa files for all viruses listed in the viral_assembly_summary.txt file. These files are downloaded from the ncbi RefSeq database into separate directories e.g. kodoja-db/refseq/viral/GCF_000846865.1
- kraken database files: database.idx, database.kdb in directory krakenDB
- kraken database taxid mapping files: nodes.dmp, names.dmp in directory krakenDB/taxonomy
- kaiju database file: kaiju_library.fmi in directory kaijuDB
If you want to include a host genome that is in RefSeq)in the k-mer databases, this can be done using the –p
option. For example, to create a database with Arabidopsis thaliana (taxid 3702) as a host, use
kodoja_build.py –o kodojaDB/ -p 3702
If you want to include a host genome that is not in RefSeq (or any additional virus genomes not in RefSeq), this can be done using the –e
and –x
options. For example, to add the genome of the parasitic plant Cuscuta australis (taxid 267555), which is in Genbank but not RefSeq, you must first download the genome assembly FASTA format file (GCA_003260385.1_Cau_v1.0_genomic.fna.gz) from the NCBI genome database (https://www.ncbi.nlm.nih.gov/genome/?term=txid267555[orgn]) to an appropriate directory (e.g. host/). Then use
kodoja_build.py –o kodojaDB/ -e host/ GCA_003260385.1_Cau_v1.0_genomic.fna.gz -x 267555
2.0 Using Kodoja in Galaxy
2.1 Install kodoja
Search the Galaxy tool shed https://toolshed.g2.bx.psu.edu/ for “kodoja”. Install the tool into your local installation of Galaxy. See https://galaxyproject.org/admin/tools/add-tool-from-toolshed-tutorial/ for generic information on Galaxy tool installation. You will need administrator rights to your local Galaxy installation to install Kodoja.
2.2 K-mer databases for Kodoja in Galaxy
You will need to download the kraken and kaiju database files from https://doi.org/10.5281/zenodo.1406071 and update your Galaxy configuration so that the databases are visible to the tool. Further information on how to do this can be found in the readme file associated with the Galaxy wrapper for Kodoja (accessed at http://toolshed.g2.bx.psu.edu/view/abaizan/kodoja).
2.3 Use kodoja_search.py within Galaxy to search for virus sequences in the RNA-seq data files
- Upload your RNA-seq fasta or fastq files : Use Get Data, Upload file from computer. Large files may need to be uploaded from your local Galaxy’s FTP site, depending on your setup.
- Select the “Kodoja_database_search tool
- Select the Kraken and Kaiju k-mer databases you want to use
- Select paired or single end options for your read files and then select execute
- Results file: A single results file will be generated: Kodoja species report
How to reference Kodoja
Until our research paper (Kodoja: A workflow for virus detection in plants using k-mer analysis of RNA-sequencing data: Baizan-Edge et al), is published please use the GitHub: https://github.com/abaizan/kodoja/
References
- Wood,D.E. and Salzberg,S.L. (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol., 15, R46.
- Menzel,P., Ng,K.L. and Krogh,A. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun., 7, 1–9.