Running Talos - NorwegianVeterinaryInstitute/Talos GitHub Wiki
To run the workflow on a particular dataset it is needed to have nextflow in your $PATH. The instructions below are for running TALOS on High Performance Computing cluster (e.g. SAGA) with SLURM scheduling. So you need to have slurm to be able to use this workflow.
Before you are able to run this script and analyze your data the following actions need to be taken.
- Create a directory where you want to store the results of this pipeline: e.g. Talos_output
- Copy the nextflow scripts (
00_set_up_environments.nf
,01_run_quality_check.nf
,02_simple_run.nf
) to your output directory for the TALOS directory you installed on your computer. - Copy the file
nextflow.config
to your result directory. This file contains the path to your working directory. Modify what is needed. On our cluster it is$USERWORK/Talos_temp
. The folder will be created when not present. - Copy the folder with the configurations files (folder:
configuration_files
) to your output directory. - Modify the file
user_config_file
which can be found in the folder copied in step 3.-
Make sure you call your sequences correctly (variable
params.reads
). Specify the location and how the forward /reverse reads are specified (e.g. R1/ R2). -
Further check the locations of the Trimmomatic adapters (
params.adapter_dir
). -
The locations of Phix and the Human genome sequences on your system. I am using a masked genome as described here. Here you can download the file to your system: Human_genome_masked
-
The location of the Kraken2 database that you want to use for classification. If you do not have a kraken2 database, than you have to build it yourself as described here: Manual kraken2 database.
-
Setting up the conda environments
This pipeline uses conda for dependency management. If you do not have conda on your system, that install the miniconda system as described here: https://docs.conda.io/en/latest/miniconda.html.
In order to set-up the conda environments for this workflow it is needed to run the script: 00_set_up_environments.nf
.
The script runs locally on the compute node and it will install conda environments in your working directory. These environments are pretty big and by using the working directory we make sure that your system is not filling up with things you do not want to keep. You can run the script in the following way:
nextflow run 00_set_up_environments.nf -c configuration_files/user_config_file -resume
let's debunk this command to understand what is done.
nextflow run
: The main command to execute a pipeline of choice00_run_quality_check.nf
: The nextflow script with the instructions to run the pipeline.-c configuration_files/user_config_file
: indicates which user configuration file to use.-resume
: This tells nextflow to check if the job has already been run before, so it will not redo all steps in the pipeline if that is the case. It is good practice to always add this to your nextflow commands
Running quality control of your metagenomic dataset
After setting up the environments we can now test the quality of our raw datasets. This will use the slurm scheduler on your cluster and will set-up slurm jobs to run fastqc and then when all files are done the results are combined with multiqc.
The command to run the script 01_run_quality_check.nf
is:
nextflow run 01_run_quality_check.nf -c configuration_files/user_config_file -resume
After running the script you can check the output folder you created at the start. There you should find two folders, fastqc and multiqc output.
Processing of a metagenomic dataset
The main script of Talos pipeline takes the sequences of a metagenomic dataset, and processes them to remove low quality bases and sequence, removes contamination, and then uses the clean reads to calculate the Average genome sizes, the sequencing depth (based on the observed diversity of the sequences), calculates distances between the datasets, and finally it does metagenomic classification with Kraken2.
We can run that script in the following way.
nextflow run 03_simple_run.nf -c configuration_files/user_config_file -profile local -resume
The output can then be further analyzed using your own tools.