Quick Introduction to GNU Parallel - LangilleLab/microbiome_helper GitHub Wiki

Author: Gavin Douglas
First created: 29 March 2017
Last updated: 4 June 2026

Introduction

GNU Parallel is a helpful tool for processing repetitive commands. This is especially useful in bioinformatics where often we want to run exactly the same command but with many different input (and output) files. GNU Parallel has extensive documentation and can give users sophisticated control. Below I demonstrate how a user would run the basic commands to run multiple jobs simultaneously.

I have also written a basic script for parsing the log of commands run with GNU parallel, to make it easier to figure out which jobs failed and which need to be re-run if needed. See here to learn about that tool.

Downloading the data

For this tutorial we will be running blastp to search a subset of Escherichia coli K-12 proteins against a subset of the Staphylococcus aureus NCTC 8325 proteome. This is just a toy example and isn't meant to demonstrate the best way to run blastp! These proteomes were randomly subsampled from UniProt just for a quick example.

Download the zipfile quick_gnu_parallel_tutorial.zip from https://doi.org/10.5281/zenodo.20544687.

Enter the folder:

unzip quick_gnu_parallel_tutorial.zip
cd quick_gnu_parallel_tutorial

The commands below assume you are within this decompressed directory.

Resources

Remember to cite the paper if you use this tool: GNU Parallel: The Command-Line Power Tool

This detailed forum post is really useful and shows some more sophisticated examples.

Also, be sure to get some GNU parallel merch to get in the mood for some serious multi-processing.

Unique options

You'll need to understand what the below syntax stands for to understand the tutorial commands.

The file name: {}
The file name with the extension removed: {.}
- e.g. test.fa would become test
To remove the path of a file: {/}
- e.g. /output/test.fa would become test.fa
And you can remove the path and extensions: {/.}
- e.g. /output/test.fa would become test
To indicate that everything that follows should be read in from the command line: :::
- e.g. parallel gzip ::: * means to gzip all files in the current working directory, while parallel gzip * won't work. You need to include :::.

There are many other possible options for GNU Parallel as well, which you can read about here.

Preparing the data

We first need to set up the blast database and decompress the query files.

The next two commands will depend on what computer you're using to run these tests. If you're working a cluster that uses SLURM, such as a Alliance Canada server, you can start an interactive session with five cores for one hour with (where XXXXXXXX is the account to use):

salloc --account=XXXXXXXX --nodes=1 --ntasks-per-node=5 --time=01:00:00

Similarly, you need to have blastp and GNU parallel installed. On an Alliance Canada server BLASTp can then load with:

module load blast+/2.17.0

Otherwise, you may need to load a conda environment where blastp is available.

First, decompress the S. aureus proteome (subset to 100 random proteins as an example) and build the BLAST database:

gunzip starting_fastas/staph_ref.fasta.gz
makeblastdb -in starting_fastas/staph_ref.fasta -dbtype prot -out staph_blast_db

The E. coli K-12 proteome has been pre-split into 20 query files. Take a look at them:

ls starting_fastas/ecoli_query_seqs/

If you haven't looked at FASTA files in a while, take a look at one with less. Each protein is indicated by a separate header line (starting with >) and then the amino acid sequence on all lines until a new header line.

Rather than decompressing each query file one at a time, we can run this in parallel:

parallel -j 5 'gunzip {}' ::: starting_fastas/ecoli_query_seqs/*.gz

Retype the ls command above to confirm all files were decompressed. This is a first simple example of GNU Parallel in action -- we ran 20 gunzip commands across 5 cores simultaneously, rather than decompressing each file one at a time. Not a big time saver in this case, but shows a basic command!

Running blastp commands with GNU Parallel

Often in bioinformatics we want to repeat a command over a large number of files. In this example there are 20 E. coli query files we want to run blastp on with the same options. With the below parallel command we'll run blastp on five files at a time with 1 thread allocated for each file. Note that in practice it would be more efficient to use set more threads and run all queries once against the database (so that it only had to be read into memory once), so again the point here is how to run parallel not BLAST.

mkdir blastp_outfiles

parallel --joblog blastp_cmds1.log -j 5 \
  'blastp -db staph_blast_db -query {} -out blastp_outfiles/{/.}.out \
  -evalue 0.0001 -word_size 7 -outfmt "6 std stitle" \
  -max_target_seqs 10 -num_threads 1' ::: starting_fastas/ecoli_query_seqs/*.fasta

The options being given to parallel are everything before the single quotes. The command inside the single quotes contains options for blastp only, which were chosen just for this example (one to note is -num_threads, which is the number of threads to use for each blastp command).

The options we passed to parallel are:

--joblog: Where job log information will be put for each run command. This is very useful to see that all jobs finished successfully or not. More on this in my other tutorial.
-j 5 (or --jobs 5): The number of commands to run at the same time.

If you plan on running GNU Parallel on a server with many CPUs you should note that it will likely be more efficient to run fewer jobs if they are spending most of their time on input/output operations (i.e. I/O bound).

Piping a list of input files

Note that at the end of the parallel command we used the ::: syntax to indicate the input files. You can also pipe (|) input files to parallel rather than using the ::: syntax. If you're not familiar with piping in Linux then you should look up an online tutorial like this one. Piping input files can be a little easier to understand due to the simpler syntax. This parallel command runs the same jobs as the earlier example:

mkdir blastp_outfiles2

ls starting_fastas/ecoli_query_seqs/*.fasta | parallel --joblog blastp_cmds2.log -j 5 \
  'blastp -db staph_blast_db -query {} -out blastp_outfiles2/{/.}.out \
  -evalue 0.0001 -word_size 7 -outfmt "6 std stitle" \
  -max_target_seqs 10 -num_threads 1'

Piping commands from a file

You can also pipe lines of a file to parallel. As an example I will use a simple bash loop to write the commands we ran above to a file and then input these commands line-by-line to parallel. For this example writing a bash loop is much more complicated than just running the commands using either of the methods shown above, but I'll show it anyway since it could be useful in other contexts.

Make a new output folder:

mkdir blastp_outfiles3

Bash loop to produce the commands:

for f in starting_fastas/ecoli_query_seqs/*.fasta
do
  out=$(basename ${f/.fasta/.out})
  echo "blastp -db staph_blast_db -query $f -out blastp_outfiles3/$out -evalue 0.0001 -word_size 7 -outfmt \"6 std stitle\" -max_target_seqs 10 -num_threads 1" >> blastp_cmds3.txt
done

Cat the file of commands and pipe to parallel:

cat blastp_cmds3.txt | parallel --joblog blastp_cmds3.log -j 5 '{}'

In this case since the whole command is being input we can just refer to the input itself with {}.

Take a look at the log files - the columns Exitval and Signal are important for checking whether jobs completed correctly. The next tutorial focuses on this third approach for running batches of commands with GNU parallel and then parsing the logfile.