3.1. Setup on high performance computing cluster - npavlovikj/ProkEvo GitHub Wiki
In order to use ProkEvo, the computational platform needs to have HTCondor, Pegasus WMS and Miniconda. While these can be found on the majority computational platforms, instructions for installation can be found here, here and here respectively.
[[email protected] ~]$ git clone https://github.com/npavlovikj/ProkEvo.git
[[email protected] ~]$ cd ProkEvo/
1. Downloading raw Illumina reads from NCBI
To download raw Illumina paired-end reads from NCBI, as an input, ProkEvo requires only a list of SRA ids stored in the file sra_ids.txt
. In this repo, as an example we provide file sra_ids.txt
with few Salmonella enterica subsp. enterica serovar Enteritidis genomes:
[[email protected] ProkEvo]$ cat sra_ids.txt
SRR5160663
SRR8385633
SRR9984383
Once a list of SRA ids is created, the next step is to submit ProkEvo.
2. Using already downloaded raw reads
ProkEvo supports using raw Illumina reads available on the local system. In order to use this feature, a tabular file rc.txt
with the name of the sample and its local location should be created. There are multiple ways how a researcher can do this. The command we use is:
while read line
do
echo ''${line}'_1.fastq file:///absolute_path_to_fastq_files/'${line}'_1.fastq site="local"' >> rc.txt
echo ''${line}'_2.fastq file:///absolute_path_to_fastq_files/'${line}'_2.fastq site="local"' >> rc.txt
done < sra_ids.txt
where sra_ids.txt
is the file with the SRA ids and absolute_path_to_fastq_files
is the absolute path to the reads.
After this, the rc.txt
file should look like:
SRR5160663_1.fastq file:///work//npavlovikj//ProkEvo/SRR5160663_1.fastq site="local"
SRR8385633_1.fastq file:///work//npavlovikj//ProkEvo/SRR8385633_1.fastq site="local"
SRR9984383_1.fastq file:///work//npavlovikj//ProkEvo/SRR9984383_1.fastq site="local"
SRR5160663_2.fastq file:///work//npavlovikj//ProkEvo/SRR5160663_2.fastq site="local"
SRR9984383_2.fastq file:///work//npavlovikj//ProkEvo/SRR9984383_2.fastq site="local"
SRR8385633_2.fastq file:///work//npavlovikj//ProkEvo/SRR8385633_2.fastq site="local"
Please note that the absolute path to the raw reads on our system "/work/npavlovikj/ProkEvo/" and this location will be different for you.
Run ProkEvo!
Once the input files are specified, the next step is to submit ProkEvo using the provided submit.sh
script:
[[email protected] ProkEvo]$ ./submit.sh
And that's it! The submit script sets the current directory as a working directory where all temporary and final outputs are stored. Running ./submit.sh
prints lots of useful information on the command line, including how to check the status of the workflow and remove it if necessary.
Monitoring ProkEvo
Once the workflow is submitted, its status can be checked with the pegasus-status command:
[[email protected] ProkEvo]$ pegasus-status -l /work/npavlovikj/ProkEvo/npavlovikj/pegasus/pipeline/20210107T000041-0600
STAT IN_STATE JOB
Run 56:55 pipeline-0 ( /work/npavlovikj/ProkEvo/npavlovikj/pegasus/pipeline/20210107T000041-0600 )
Run 30:22 ┣━ex_spades_run_ID0000019
Run 24:16 ┗━ex_spades_run_ID0000007
Summary: 3 Condor jobs total (R:3)
UNRDY READY PRE IN_Q POST DONE FAIL %DONE STATE DAGNAME
17 0 0 2 0 26 0 57.8 Running *pipeline-0.dag
Summary: 1 DAG total (Running:1)
Briefly, this command shows the currently running stats, as well as now much of the pipeline has been completed. As it can be seen on the provided output, at the time the command was run, two Spades jobs were running, and the pipeline was 57.8% done with no failed jobs. Depending on the time the pegasus-status
command was run, the shown output will differ.
pegasus-status
gives the following output when the pipeline is finished:
[[email protected] ProkEvo]$ pegasus-status -l /work/npavlovikj/ProkEvo/npavlovikj/pegasus/pipeline/20210107T000041-0600
(no matching jobs found in Condor Q)
UNRDY READY PRE IN_Q POST DONE FAIL %DONE STATE DAGNAME
0 0 0 0 0 32 0 100.0 Success 00/00/sub-pipeline/sub-pipeline.dag
0 0 0 0 0 45 0 100.0 Success *pipeline-0.dag
0 0 0 0 0 77 0 100.0 TOTALS (77 jobs)
Summary: 2 DAGs total (Success:2)
Once the pipeline has finished, researchers can run commands such as pegasus-analyzer and pegasus-statistics to obtain statistics about the workflow, such as the number of jobs that failed/succeeded, run time of tasks, etc.
Output
All the output files are stored in the directory outputs
which is in the directory where ProkEvo is submitted from:
[[email protected] ProkEvo]$ ls outputs/
fastbaps_baps.csv sabricate_ncbi_output.csv SRR5160663_prokka_output.tar.gz SRR9984383_plasmidfinder_output.tar.gz
fastqc_summary_all.txt sabricate_plasmidfinder_output.csv SRR5160663_quast_output SRR9984383_prokka_output
fastqc_summary_final.txt sabricate_resfinder_output.csv SRR5160663_spades_output SRR9984383_prokka_output.tar.gz
mlst_output.csv sabricate_vfdb_output.csv SRR8385633_plasmidfinder_output.tar.gz SRR9984383_quast_output
roary_output sistr_all.csv SRR8385633_prokka_output SRR9984383_spades_output
roary_output.tar.gz sistr_all_merge.csv SRR8385633_prokka_output.tar.gz sub-pipeline.dax
sabricate_argannot_output.csv SRR5160663_plasmidfinder_output.tar.gz SRR8385633_quast_output
sabricate_card_output.csv SRR5160663_prokka_output SRR8385633_spades_output
[[email protected] ProkEvo]$