Running - lorenzo-arcioni/HPC-T-Annotator GitHub Wiki

Options for command-line execution

There are several options available. Remember that it is absolutely necessary that the paths provided as input to the main.sh script are absolute paths and not relative paths.

Mandatory options

-i <file.fasta> Path to the query input file in multi-FASTA format.
-d <database> Path to the database file.
-b <binary> Path to the binary file (Diamond or BLAST).
-T <function> blastp and blastx are available.
-p <number_of_processes> Number of processes to split the computation into. Consider that the higher the number of processes, the more time will be needed for pre-processing. A range from 5 to 500 is recommended. Note that the number of processes must never exceed the number of sequences.
-t <threads> The number of threads that each process can use.

Other options

-h Shows the usage of the software.
-f <6_BLAST_outformat> The 6th tabular BLAST outformat, for example this is the default outformat:
```
-f "6 qseqid sseqid slen qstart qend length mismatch gapopen gaps sseq"
```
Make sure that the required information is present in the reference database.
-D If we want to use the Diamond software.
--slurm Use this option only if the computation will be run on a cluster with Slurm as the workload manager.

BLAST/Diamond further options

It is of course possible to give further options to the BLAST and Diamond software, this is done via prepared files located in the Bases directory. Simply add the options in the respective file, depending on which tool you are using BLAST or Diamond.

blast_additional_options.txt
diamond_additional_options.txt

For example in the diamond additional options file we can insert:

--ultra-sensitive --quiet

It is mandatory to enter the options all on one line.

SLURM Execution Configuration

Regarding the execution of HPC-T-Annotator on an HPC cluster with SLURM as the workload manager, the user must ensure to properly configure all the configuration files that reside in the Bases folder, namely:

slurm_controlscript_base.txt
slurm_partial_script_base.txt
slurm_start_base.txt

Remember to properly configure these files, as failure to do so may compromise the entire execution.

Please note that for execution through the SLURM workload manager, it is necessary to provide the --slurm option in the command line when running the main.sh script.

Execution pipeline example

After cloning the repository, you can proceed as follows: perform the code generation phase, upload (if necessary) the generated TAR package to the HPC machine, and then start the computation.

Generation of code

There are two methods for generating the scripts.

Command-line generation

A command-line example using the diamond suite.

./main.sh -i /home/user/assembly/slow_fast_degs_hs.fasta -b /home/user/bin/diamond -T blastx -t 48 -D -d /home/user/NR/nr.dmnd -p 50

In this case, we will divide the computation (and the input file) into 50 parts that will be processed simultaneously (with 48 threads each). In the end, the outputs of the 50 jobs will be combined into a single file.

An other example using the BLAST

./main.sh -i ../project/assembly/slow_fast_degs_hs.fasta -b /home/blast/blastx -T blastx -t 48 -d /home/user/DB/nr -p 100

In this case we have split the computation into 100 jobs using the BLAST suite.

Interface generation

Execution on HPC machine

So we extract the generated code.

tar -zxf hpc-t-annotator.tar && rm hpc-t-annotator.tar

Once this is done, you have everything you need to manage and start the computation, so all you have to do is run (if you are on Slurm):

sbatch start.sh

At the end of the calculation, the output will be in the tmp directory with the name final_blast.tsv.