3.1. Setup on high performance computing cluster - npavlovikj/ProkEvo GitHub Wiki

In order to use ProkEvo, the computational platform needs to have HTCondor, Pegasus WMS and Miniconda. While these can be found on the majority computational platforms, instructions for installation can be found here, here and here respectively.

[[email protected] ~]$ git clone https://github.com/npavlovikj/ProkEvo.git
[[email protected] ~]$ cd ProkEvo/

1. Downloading raw Illumina reads from NCBI

To download raw Illumina paired-end reads from NCBI, as an input, ProkEvo requires only a list of SRA ids stored in the file sra_ids.txt. In this repo, as an example we provide file sra_ids.txt with few Salmonella enterica subsp. enterica serovar Enteritidis genomes:

[[email protected] ProkEvo]$ cat sra_ids.txt 
SRR5160663
SRR8385633
SRR9984383

Once a list of SRA ids is created, the next step is to submit ProkEvo.

2. Using already downloaded raw reads

ProkEvo supports using raw Illumina reads available on the local system. In order to use this feature, a tabular file rc.txt with the name of the sample and its local location should be created. There are multiple ways how a researcher can do this. The command we use is:

while read line
do
echo ''${line}'_1.fastq file:///absolute_path_to_fastq_files/'${line}'_1.fastq site="local"' >> rc.txt
echo ''${line}'_2.fastq file:///absolute_path_to_fastq_files/'${line}'_2.fastq site="local"' >> rc.txt
done < sra_ids.txt

where sra_ids.txt is the file with the SRA ids and absolute_path_to_fastq_files is the absolute path to the reads.

After this, the rc.txt file should look like:

SRR5160663_1.fastq file:///work//npavlovikj//ProkEvo/SRR5160663_1.fastq site="local"
SRR8385633_1.fastq file:///work//npavlovikj//ProkEvo/SRR8385633_1.fastq site="local"
SRR9984383_1.fastq file:///work//npavlovikj//ProkEvo/SRR9984383_1.fastq site="local"
SRR5160663_2.fastq file:///work//npavlovikj//ProkEvo/SRR5160663_2.fastq site="local"
SRR9984383_2.fastq file:///work//npavlovikj//ProkEvo/SRR9984383_2.fastq site="local"
SRR8385633_2.fastq file:///work//npavlovikj//ProkEvo/SRR8385633_2.fastq site="local"

Please note that the absolute path to the raw reads on our system "/work/npavlovikj/ProkEvo/" and this location will be different for you.

Run ProkEvo!

Once the input files are specified, the next step is to submit ProkEvo using the provided submit.sh script:

[[email protected] ProkEvo]$ ./submit.sh

And that's it! The submit script sets the current directory as a working directory where all temporary and final outputs are stored. Running ./submit.sh prints lots of useful information on the command line, including how to check the status of the workflow and remove it if necessary.

Monitoring ProkEvo

Once the workflow is submitted, its status can be checked with the pegasus-status command:

[[email protected] ProkEvo]$ pegasus-status -l /work/npavlovikj/ProkEvo/npavlovikj/pegasus/pipeline/20210107T000041-0600 
STAT  IN_STATE  JOB                                                                                                                
Run      56:55  pipeline-0 ( /work/npavlovikj/ProkEvo/npavlovikj/pegasus/pipeline/20210107T000041-0600 )
Run      30:22   ┣━ex_spades_run_ID0000019                                                                                         
Run      24:16   ┗━ex_spades_run_ID0000007                                                                                         
Summary: 3 Condor jobs total (R:3)

UNRDY READY   PRE  IN_Q  POST  DONE  FAIL %DONE STATE   DAGNAME                                 
   17     0     0     2     0    26     0  57.8 Running *pipeline-0.dag                         
Summary: 1 DAG total (Running:1)

Briefly, this command shows the currently running stats, as well as now much of the pipeline has been completed. As it can be seen on the provided output, at the time the command was run, two Spades jobs were running, and the pipeline was 57.8% done with no failed jobs. Depending on the time the pegasus-status command was run, the shown output will differ.

pegasus-status gives the following output when the pipeline is finished:

[[email protected] ProkEvo]$ pegasus-status -l /work/npavlovikj/ProkEvo/npavlovikj/pegasus/pipeline/20210107T000041-0600 
(no matching jobs found in Condor Q)
UNRDY READY   PRE  IN_Q  POST  DONE  FAIL %DONE STATE   DAGNAME                                 
    0     0     0     0     0    32     0 100.0 Success 00/00/sub-pipeline/sub-pipeline.dag     
    0     0     0     0     0    45     0 100.0 Success *pipeline-0.dag                         
    0     0     0     0     0    77     0 100.0         TOTALS (77 jobs)                        
Summary: 2 DAGs total (Success:2)

Once the pipeline has finished, researchers can run commands such as pegasus-analyzer and pegasus-statistics to obtain statistics about the workflow, such as the number of jobs that failed/succeeded, run time of tasks, etc.

Output

All the output files are stored in the directory outputs which is in the directory where ProkEvo is submitted from:

[[email protected] ProkEvo]$ ls outputs/
fastbaps_baps.csv              sabricate_ncbi_output.csv               SRR5160663_prokka_output.tar.gz         SRR9984383_plasmidfinder_output.tar.gz
fastqc_summary_all.txt         sabricate_plasmidfinder_output.csv      SRR5160663_quast_output                 SRR9984383_prokka_output
fastqc_summary_final.txt       sabricate_resfinder_output.csv          SRR5160663_spades_output                SRR9984383_prokka_output.tar.gz
mlst_output.csv                sabricate_vfdb_output.csv               SRR8385633_plasmidfinder_output.tar.gz  SRR9984383_quast_output
roary_output                   sistr_all.csv                           SRR8385633_prokka_output                SRR9984383_spades_output
roary_output.tar.gz            sistr_all_merge.csv                     SRR8385633_prokka_output.tar.gz         sub-pipeline.dax
sabricate_argannot_output.csv  SRR5160663_plasmidfinder_output.tar.gz  SRR8385633_quast_output
sabricate_card_output.csv      SRR5160663_prokka_output                SRR8385633_spades_output
[[email protected] ProkEvo]$