Metagenomics (IMPACTT December 2022) Pre workshop - LangilleLab/microbiome_helper GitHub Wiki

There are several sections here for you to work through so as you have some familiarity with using the command line and some of the tools that we'll use ahead of the workshop.

1. Introducing the Unix shell

Basic knowledge of the unix shell is required for your successful participation in the workshop. If you are not familiar with basic unix commands, please complete one of the following tutorials prior to the workshop:

Tutorial 1 (good for Mac users): UNIX / Linux Tutorial for Beginners (surrey.ac.uk)
Tutorial 2 (interactive, good for Windows users): Intro to Shell Scripting (datacamp) *
Tutorial 3 (good for Mac and Windows users): Command-line Bootcamp (ucdavis.edu)

*Tutorial 2 is free but does require you to create an account

2. Get or open your Terminal

For the third part of the tutorial, you will need access to a unix shell that you can install things on.

If you are a Mac user, then you simply need to open the Terminal program. You can either follow the instructions in 3A to install parallel on your own computer, or you can log in to our server and do this part of the tutorial there. To do this, type:

ssh [email protected]

Press enter and you should be asked for a password. Please message Robyn Wright on the slack group for the password.

If you are a Windows user, you will need to:

Install PuTTY
Open PuTTY. In the Host Name (or IP address) box please type: [email protected]. Press open. It should prompt you for a password. Message Robyn Wright on the slack group for the password.

Once you have access to one of these, you can move on to the third part of the pre-workshop tutorial.

3. Introducing GNU Parallel

Often in bioinformatics research we need to run files corresponding to different samples through the same pipeline. You might be thinking that you can easily copy and paste commands and change the filenames by hand, but this is problematic for two reasons. Firstly, if you have a large sample size this approach will simply be unfeasible (in bioinformatics we are frequently working with hundreds or thousands of samples). Secondly, sooner or later you will make a typo! Fortunately, GNU Parallel provides a straight-forward way to loop over multiple files. This tool provides an enormous number of options: some of the key features will be highlighted below. We like using this handy cheatsheet with a range frequently used commands, or you can check out all of the options here.

If you are a Mac user, you can choose whether you want to install and run this on your own computer, or whether you want to use our server to log in to as above. If you are using our server, please skip straight to part B.

A) (Optional) Install parallel

First we will need to install GNU Parallel. You can check whether Parallel is already installed on your computer with the which command:

which parallel

If this shows you a file location that looks something like this /usr/local/bin/parallel then this means that Parallel is already installed and you can skip the install instructions.

If not, download the latest version of Parallel and unzip the downloaded folder:

wget http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2
tar xjf parallel-latest.tar.bz2

Change to the directory this will have created:

cd parallel-YYYYMMDD

Note that currently this is parallel-20221122, but this could change in the future. You can type in some of it, and then press the tab button to complete or to see a list of options: cd parallel-2022+tab. Or you could list all files in your current directory like this: ls

Now we will build and install the software:

sudo ./configure && make
sudo make install

Note that the sudo command requires you to be an administrator of your system to use, and it will prompt you for your password. If you do not have admin access on the computer you are using, you can try removing the sudo, like so:

./configure && make
make install

If this doesn't work, you may need to contact the person that is the administrator of the computer that you are using.

You can check that parallel is installed as we did to begin with:

which parallel

And hopefully this will now give you something like this: /usr/local/bin/parallel If you have any issues, feel free to post in the workshop's Slack channel and someone can help you!

B) Parallel tutorial

First, make a new directory and change to this directory. You should name this after yourself so that it doesn't get mixed up with what other people may be doing.

mkdir user
cd user

Note that in both of these commands, you should change "user" to your name. For example, I would run:

mkdir robynwright
cd robynwright

Next we will need to download the zip folder containing our test data using the command wget:

wget https://www.dropbox.com/s/58grzx1ir7o8d3k/gnu_parallel_examples.zip?dl=1 -O gnu_parallel_examples.zip
unzip gnu_parallel_examples.zip

You can use wget to download the files that are at many links online, but if you copy a dropbox link like this one then you will need to change the =0 on the end to =1, like this one. Note that the second command is just to unzip the .zip file.

You can explore the files that are in this folder using the ls command followed by the file path, for example:

ls gnu_parallel_examples/
ls gnu_parallel_examples/example_files/
ls gnu_parallel_examples/example_files/fastqs/

You should see that this has listed all of the files first in the gnu_parallel_examples/ and then in the gnu_parallel_examples/example_files/ folder and finally in the gnu_parallel_examples/example_files/fastqs/ folder.

We can now safely remove the zip file and move into the gnu_parallel_examples folder.

rm gnu_parallel_examples.zip
cd gnu_parallel_examples

Firstly, whenever you're using parallel it's a good idea to use the --dry-run option, which will print the commands to screen, but will not run them.

This command is a basic example of how to use parallel:

parallel --dry-run -j 2 'gzip {}' ::: example_files/fastqs/*fastq

This command will run gzip (compression) on the test files that are in example_files/fastqs/. There are 3 parts to this command:

The options passed to GNU Parallel: --dry-run -j 2. The -j option specifies how many jobs should be run at a time, which is 2 in this case. This means that both of the commands printed out will be run at the same time. This is what makes parallel so powerful as it allows you to run multiple commands on different files at the same time.

NOTE you should avoid trying to run more jobs at once than there are available processors on your workstation This will vary depending on the hardware you are working on, but usually selecting 4 will be a fairly safe bet.

The commands to be run: 'gzip {}'. The syntax {} stands for the input filename that is the last part of the command. In parallel the command you want to run should always be surrounded by the character '
The files to loop over: ::: example_files/fastqs/*fastq. Everything after ::: is interpreted as the files to read in!

Remove --dry-run from the above command and try running it again. This should result in both fastq files located in example_files/fastqs/ to be compressed in gzip format. You can recognise this compression type as the files will always end in .gz.

If you list the files in the example_files/fastqs/ again using the ls command, you should see that now they are in gzip format.

Sometimes multiple files need to be given to the same command you're trying to run in parallel. As an example of how to do this, take a look at the paired-end FASTQ files in this example folder:

ls example_files/paired_fastqs

The "R1" and "R2" in the filenames specify forward and reverse reads respectively. If we wanted to concatenate the forward and reverse files together (paste them one after another into a single file for each pair) we could use the below command:

parallel --dry-run -j 2 --link 'cat {1} {2} > {1/.}_R2_cat.fastq' ::: example_files/paired_fastqs/*R1.fastq ::: example_files/paired_fastqs/*R2.fastq

There are a few additions to this command:

Two different sets of input files are given, which are separated by ::: (the first set is files matching *R1.fastq and the second set is files matching *R2.fastq).
The --link option will allow files from multiple input sets to be used in the same command in order. {1} and {2} stand for input files from the first and second sets, respectively.
Finally, the output file was specified as {1/.}_R2_cat.fastq. The / will remove the full path to the file and . removes the extension.

It's important to check that each samples' files are being linked correctly when the commands are printed to screen with the --dry-run option. If the commands look OK then try running them without --dry-run now!

You should now have two files in your directory named test[1-2]_R1_R2_cat.fastq. Lets see how many lines these files contain with the following command:

wc -l test*

You can return to the higher directory with this command:

cd ..

Finally one thing to keep in mind when running commands with parallel is that you can keep track of their progress (e.g. how many are left to run and the average runtime) with the --eta option.

4. Cytoscape

Cytoscape will be used for data visualisation during the transcriptomics part of the workshop. Prior to the workshop, you should:

Install Cytoscape. Note that you should install Cytoscape version 3.7.2 instead of the latest version.
Go through a quick tour of Cytoscape to familiarize yourself with the tool.
If possible, go through this tutorial on using Cytoscape for basic data visualization.

5. Extras

This is by no means essential for the workshop, but if you'd like to do some more background reading then we are in the process of putting together some resources for those new to the microbiome field here.