Microbiome Helper 2 Brief introduction to the command line - LangilleLab/microbiome_helper GitHub Wiki

Authors: Robyn Wright
Modifications by: NA

Please note: We think that everything here should work, but we are still testing/developing this so use with caution :)

Introduction

This is aimed at familiarising people with moving around on the command line, opening files, zipping files, and creating conda environments. If you have never done anything before, it may be useful to also work through this tutorial first, and to review the CBW 2025 AWS/Unix introduction.

Here, we're going to be learning the basics of moving around on the command line, creating directories, zipping and unzipping folders, making tar archives and installing programs on a server. We'll also be downloading files to the server/computer, investigating the format of fasta and fastq files, and looking at the quality of sequenced samples.

It is assumed that you're doing this on a server, but access to the Terminal is really all that is required. We'll add some more details to the "Things you need before starting" and "Computational resources required" pages at some point.

Throughout this module, there are some questions aimed to help your understanding of some of the key concepts.

1. Log into your AWS instance/the server

I will try to add some more general instruction here at some point. For now, you can follow these instructions on setting up an AWS instance if you would like.

Otherwise, this usually looks like going to your Terminal (this used to only exist on Mac/Linux, but modern Windows PC's also have one) and typing in ssh followed by your server address. For example, if I was logging into my account on the Digital Research Alliance Canada's Graham cluster, I would type:

I would then be prompted for my password, which I would type in and then press enter. Note that you won't see yourself typing like you would on most user interfaces, so you'll have to type carefully and try again if you get it wrong!

2. Creating directories and moving around on the command line

Now that we're logged in, we're going to get familiar with what's on the server. First of all, type ls and press enter. You should see all of your directories printed out. In our case, this would be:
R anaconda3 aws scripts
These will all be directories/folders/files on the server. Think of these like you would the directories on your own computer, e.g. "My Documents" and "Downloads". The ls command is just telling the server to list out the files/directories in the directory that we're currently in. If this is your first time logging into your own server, this might be empty.

Next, type in pwd and press enter. The pwd command tells you which directory you're currently in, so this should look something like: /home/ubuntu.

Now we're going to change directory. We can change to any directory that exists here. If you're on the CBW AWS instance, we'll want to change into workspace: so type in cd workspace and press enter. If you're on your own server, it might be that you need to create and change into your own folder, like this: type in mkdir robyn and press enter, and then type in cd robyn and press enter. You will always need to press enter to run a command. If you type in pwd again, you should see that you've changed directory. If you type in ls, you should see that the directory is empty if you're on the CBW AWS instance, or it might contain some files. We might want to create directories for each of the analyses that we do, so we'll make one called microbiome_helper:
mkdir microbiome_helper
Note that it doesn't really matter what we call this directory, but if you name it something different then you'll need to remember that when we run things later on so it's easier to keep it consistent.

Now let's change to the directory we just made:
cd microbiome_helper
If you use the ls command, you should see that it's empty. Now what happens if we want to go back to the robyn directory? We can use cd .., which will take us up one level. Try that and then use the ls command again. You should see the folder that you made. If you use cd on it's own, then it'll take you back to where you started. Try it and then use the pwd command. Now to get back to the directory we created for this module, use: cd ~/robyn/microbiome_helper. Note that any time you log out from the server (or lose connection/time out and get logged out), you will need to change back to the directory that you're working from. Use the pwd command once more to check that you're in the right place (/home/ubuntu/robyn/microbiome_helper).

Remember that wherever it says robyn here, you should change this for your own name!!

3. Use wget to download files

Now we're going to download some files to use. There are a lot of different data repositories available online, but we're going to use some data from the Human Microbiome Project. This repository is quite straightforward because all we need is a link to the data and we can download it, but other repositories require their own programs and file formats for downloading which can make them quite complicated (and frustrating, if you don't know what you're doing!) to use. You can see the webpage that we're taking the files from here.

Now we'll download the files. There are a few different commands that we can use for downloading files, so you might have seen others, or may see others in the future, but for now we'll be using a command called wget. If you just type in wget on its own then you should see some information about it. You'll see that it gives an error message, because we haven't also given it a URL, but it also tells us that we can run wget --help to see more options. Try running that. If you scroll back up to where you ran it, you'll see a huge list of options that you could give wget depending on what you want to do. We won't be using most of these options for now, but you can usually add --help to whatever command you are trying to run to get some more information about what information it is expecting from you (these are called "arguments").

We're going to download three files like so:

wget https://g-227ca.190ebd.75bc.data.globus.org/ibdmdb/raw/HMP2/16S/2018-01-08/206534.fastq.gz
wget https://g-227ca.190ebd.75bc.data.globus.org/ibdmdb/raw/HMP2/16S/2018-01-08/206536.fastq.gz
wget https://g-227ca.190ebd.75bc.data.globus.org/ibdmdb/raw/HMP2/16S/2018-01-08/206538.fastq.gz

You should see some progress bars come up, but these files aren't very big so they shouldn't take very long to download. Now use ls again to see the files. You should see:

206534.fastq.gz  206536.fastq.gz  206538.fastq.gz

4. Move these files to the directories

When we downloaded these, we didn't make a directory to put them in, so let's do that now so that we can tidy them up a bit:

mkdir test_data

And then we can move them to this directory we've just made using the mv command. The mv command is expecting the name of the file that we want to move, and then the directory/path that we want to move this file to as arguments:

mv 206534.fastq.gz test_data/

If you use the ls command again now, you will see that there's only two files (along with the test_data directory). You can also use the ls command on the test_data directory: ls test_data, and you should see the file that we moved into there.

Often, we might have a lot of files to move and we don't want to have to move them one by one. If we want to move multiple files of the same type, we can do that like this:

mv *.fastq.gz test_data/

The asterisk (*) acts as a wildcard, and anything that ends in .fastq.gz will be moved using this command. If you take a look in test_data again using the ls command, you should see all three files in there now, and not in your current directory. Another useful thing that you can do with the ls command is add another argument to get some information about the files: ls -lh test_data/. When you run this, you should see who has permission to read/write to the files, the author and the owner of the files, the size of the files, and when they were created. The -l is what is telling this to give you the information, and adding the h converts this into "human" readable format. Try using it without the h - you'll see that the file sizes are in bytes instead of megabytes (MB/M).

5. Zip and unzip these files

You might have noticed the .gz on the end of the files indicating that they're zipped (or compressed). Sometimes when we run bioinformatics programs they can uncompress the files within the program, but other times we need to uncompress (unzip) them first. These are quite small files so you might think it's unnecessary to zip/compress them, but often we have thousands of sequencing data files and they can each be hundreds of gigabytes (GB) or even terabytes (TB) in size, so it becomes quite important to keep them compressed until we need them.

We'll use the gunzip command to unzip them. Try typing in gunzip test_data/20 and then pressing the tab button. If you press it a couple of times, you should see a list of your options come up. The tab button can be really useful for completing file paths for you and saving you a lot of typing! Continue typing and choose one of the files to unzip (e.g. type in 8 and then press tab again and then enter). Now if you run ls -lh test_data again, you should see that the file you unzipped no longer has the .gz on the end, and it's much larger in size now.

We'll zip the file back up for now: gzip test_data/20 - if you press tab to complete again (and enter), you should find that this will auto-fill with the file that you unzipped, because it's the only file type that the command is able to work with.

What happens if you try to run this on a file that is already zipped? gzip test_data/206538.fastq.gz It should tell you that it's unable to run because .gz is already on that file.

Let's unzip all of the files now: gunzip test_data/* See that we can use the asterisk (*) again as a wildcard and it will unzip every file in the test_data directory. Take a look at the directory with ls or ls -lh if you like. Remember to add the directory name!

6. Create new tar archive of files

There are several different ways of compressing files - there is gzip/gunzip that we just used, but we can also package up multiple files inside a directory together. We'll be using the tar command for this, and as you can see if you run tar --help, you'll see that there are lots of options available for this. Let's try it out with the test_data directory: tar -czvf test_data.tar.gz test_data/ Here we gave the command several arguments: -czvf (see below), test_data.tar.gz (the file name that we want our tar archive to have) and test_data/ (the name of the directory that we want to compress. -czvf is a short way of giving several arguments to the command: c for "create" (creates a new archive), z for "gZip" (this tells tar to write/read through gzip), v stands for "verbose" (meaning that it will print out information about what it is doing), and f for "file" or "archive".

If you now run ls -lh, you should see that the tar archive (test_data.tar.gz) is a smaller size than the 3 files would be together (check by running ls -lh test_data). You can also take a look at what's in the tar archive with the less command: less test_data.tar.gz. Press q (for "quit") when you're ready to exit.

Usually we'll make a tar archive because we want to keep our files but save some space, so let's delete the original folder: rm -r test_data/. Hopefully by now you're getting the hang of how these commands work. The rm command is for removing files - you should always be really careful when using it because it won't ask you if you're sure like your regular computer would, and most servers don't have a "recycle bin" for the files to be sent to, so if you remove them, they're gone for good. The -r argument is for removing a directory rather than just a file.

7. Unzip tar archive

If we need to use the data that we zipped into the tar archive again, we'll need to unzip - or extract - it.

To unzip the tar archive, we can do that like so: tar -xvf test_data.tar.gz. Note that we just replaced the cz with x for "extract". You should be able to see the files back in test_data with the ls command.

8. Look at fasta and fastq files with less

Now we're going to take a look at these files. Let's look at 206538 first: less test_data/206538.fastq. You can scroll through the file, and remember to press q when you want to stop looking at the file. If you want to look at it again, press the up arrow key to run the less command again. You can always press the up arrow to go back through the commands that you've run previously. If you've typed something wrong and want to start again, press ctrl+c to get back to a blank command prompt.

You should have noticed that this file had the .fastq extension. Let's copy across the same files in fasta format: If you're on your own server: wget http://kronos.pharmacology.dal.ca/public_files/CBW_2025/bmb_module1/test_data_fasta.tar.gz and then unzip this: tar -xvf test_data_fasta.tar.gz

If you're on the CBW AWS instance: cp -r ~/CourseData/MIC_data/microbiome_helper/test_data_fasta/ . Here, we're using the cp command to copy across some files that I already set up earlier. The ~ is usually a shortcut for your home directory to save us typing it each time, and the rest is directories that we've already set up. The -r argument is the same as above for rm - it means that we're taking a directory and not a file, and then the . shows that we want to copy the data into the directory that we are currently in. We could replace it with another file path if we wanted.

Take a look at the same file in fasta format: less test_data_fasta/206538.fasta

You can also download these files by going to http://##.uhn-hpc.ca/ (where ## is your personal number). Navigate to the correct directory and then right click the files > save link as > choose location > choose name. You’ll be able to open them with a text editor like TextEdit (Mac) or Notepad (Windows).

Most servers aren't set up with a handy webpage where you can browse your files, like this one is. In that case, you can use the scp (server copy) command to copy them across to your own computer.

If you're on your own server, you can go to a new Terminal window. To download the whole folder, you can run:

scp -r [email protected]:/full_working_directory/test_data_fasta/ .
scp -r [email protected]:/full_working_directory/test_data/ .

Here, the -r flag is telling it to download the whole folder, the . tells it to download the folder to your current location, and you will need to change the username, server.ca and full_working_directory parts. If I was doing this on the Digital Research Alliance Canada server that I mentioned before, I'd do something like this:

scp -r [email protected]:/home/rwright/microbiome_helper/test_data_fasta/ .
scp -r [email protected]:/home/rwright/microbiome_helper/test_data/ .

Remember that you can get the full working directory with the pwd command, but you'll need to add on the test_data_fasta folder name.

If you were using the CBW AWS image, or had a .pem file for logging in, you can navigate to the location of your .pem file. To download the whole folders, you can run:

scp -r -i CBW.pem ubuntu@##.uhn-hpc.ca:/home/ubuntu/workspace/microbiome_helper/test_data_fasta/ .
scp -r -i CBW.pem ubuntu@##.uhn-hpc.ca:/home/ubuntu/workspace/microbiome_helper/test_data/ .

Make sure that you replace the ## with your instance number, or with

Now you can open the files with a text editor like TextEdit (Mac) or Notepad (Windows). You should see that the fastq file has 4 lines for each sequence, while the fasta only has two. The fasta has a first line that starts with ">" that contains the sequence name and description, and the second line contains the actual DNA sequence (DNA in this case, but this file format could also contain RNA or protein sequences). fastq files start the same, with the first line containing the sequence name and description (but starting with an "@" symbol instead), and the second containing the sequence. The third line then contains a "+" character, and the fourth contains quality information about each base of the sequence (and should contain the same number of characters as the sequence). You can read more about the quality information that the symbols encode here.

To count the number of lines in a file, we can use the less command again, with some additional information: less test_data/206538.fastq | wc -l

You should see that this file contains 63,048 lines.

Question 1: How many sequences does this mean that there are? Remember that you can just use less test_data/206538.fastq to look at the file.

Answer

Each sequences is spread across four different lines, so there are 63,048/4 = 15,762 sequences.

Now do the same for the fasta file: less test_data/206538.fasta | wc -l.

You should see that the fasta file contains half the number of lines as the fastq file. Sometimes you'll find that the sequences with a fasta file are split across multiple lines, in which case, simply counting the number of lines wouldn't work, so there are also other ways to count the number of sequences in a file, and these can be adapted for other purposes, too. E.g.: grep -c ">" test_data_fasta/206538.fasta - the grep command pulls out every occurrence of a phrase (or "string", as it's usually called in programming) and the -c argument tells it to count these. What happens if you don't use the -c argument? Why do you think this happened?

In most programming languages, you have "positional" arguments and "named" arguments. Positional arguments need to be included in the proper position, or order. The order of positional arguments is defined within the program. Named or keyword arguments are given or passed to the program only after the name is given. In the case above, the -c ">" is a named argument and the file name test_data_fasta/206538.fasta is a positional argument.

Question 2: What happens if you try to do the same thing with "@" for the fastq file? Why is this? Remember to look at the fastq file with the less command for clues, or try running it without the -c argument.

Answer

The number of "@" in the fastq file is much more than the number of lines. This is because the "@" symbol is also used in the quality information. We can get round this by using part of the sample name, e.g., grep -c "@206534" test_data/206534.fastq.

9. Installing programs to the server

Now we're going to learn how to install programs to the server. A lot of the commands we have just used (like grep and less) are standard ones that will be installed on most servers, but frequently we will need to install other programs (usually called "packages"), too. The packages that we use are often not as stable as those that we download and use on our laptops (like Microsoft Word or Adobe Acrobat Reader) and so they sometimes depend on a particular version of another package. It frequently takes more time to install packages than it does to run them, and any bioinformatician will tell you how frustrating it can be. Anaconda can help to manage this, although it doesn't overcome these problems entirely!

Note that Anaconda is already installed on the CBW AWS servers so you don't need to run this, but if you want to install it for yourself, you can do so like this:

curl -O https://repo.anaconda.com/archive/Anaconda3-2024.10-1-Linux-x86_64.sh
bash Anaconda3-2024.10-1-Linux-x86_64.sh
#hold down enter key until you get to the end of the agreement, or press q
#type yes
#confirm location by pressing enter
#yes
#now close and reopen the window - you'll need to log back in the same way as you did before!

10. Conda environments

Anaconda, or conda, allows us to have separate "environments" for installing packages into. This means that if one package requires version 3.1 of another package, but another requires version 2.9, they won't interfere with each other. Often when we're installing new packages or starting a new project, we'll make a new environment. This also helps us to keep track of which versions of a package we've used for a specific project. The environment is essentially a directory that contains a collection of packages that you've installed, so that other packages know where to access the package that they need to use. We're going to make a new environment to install some packages into:

conda create -n microbiome_helper

You'll see that here we're using the conda command first, and then giving it the create and -n microbiome_helper arguments. We could call this environment anything we like, but it's best to make this descriptive of what it is so that when we collaborate with others or share our code, it'll be obvious what this environment is for.

You'll need to press y at some point, to confirm that you want to install new packages.

Now we can "activate" this environment like this: conda activate microbiome_helper. Any time you are logged out and you log back in, you'll need to reactivate the environment if you want to be working from it. If you want to see the other environments that are installed and available, you can run conda info --envs. You'll see that we've already installed a lot of environments that we'll be using over the next few days. While it would be great to be able to get you to install all of this for yourself, it can take quite a long time to install some packages so we've set most of them up already. If you want to see how we did that, you can see most of that here.

11. Install fastqc and multiqc

Now we'll install the packages that we want to use. Usually if there's a package that you're interested in, for example we'll be using one called "fastqc", you can just google "conda install fastqc" and you should see an anaconda.org page as one of the top hits, telling you how to install it. Sometimes you'll also see bioconda documentation, or a "package recipe" and this might give more details if you're struggling to install it. We'll install fastqc like this:

conda install bioconda::fastqc

You'll need to confirm that you want to install things with y at some point. If you forgot to activate the environment (see above), then you'll get an error that you don't have permissions to do this!

You can test to see whether it got installed by typing which fastqc - this should show you the location that it's installed in, and it should look something like this: /home/ubuntu/CourseData/MIC_data/.conda/envs/microbiome_helper/bin/fastqc.

Now we'll install the second package that we need:

conda install bioconda::multiqc

Confirm this again with y

As you might have guessed from the "qc" in both of these names, we'll be using them for Quality Control of the sequence data.

12. Perform quality control on fastq files

First we'll be running fastqc, and to do that, we'll first make a directory for the output to go: mkdir fastqc_out

Now we'll run fastqc:

fastqc -t 4 test_data/*.fastq -o fastqc_out

Here the arguments that we're giving fastqc are:

  • -t 4: the number of threads to use. Sometimes "threads" will be shown as --threads, --cpus, --processors, --nproc, or similar. Basically, developers of packages can call things whatever they like, but you can use the help documentation to see what options are available. We're using 4 here because that's the maximum that we have available. See below (htop) for how we find out about how many we have available.
  • test_data/*.fastq: the fastq files that we want to check the quality of.
  • -o fastqc_out: the folder to save the output to.

13. htop - looking at the number of processes we have available or running

Try running htop. This is an interactive viewer that shows you the processes that are running on your computer/server. There are a lot of different bits of information that this is showing us - you can see all of that here, but the key things for us are:

  • The CPUs (labelled 0, 1, 2, 3 at the top left) - this shows the percentage of the CPU being used for each core, and the number of cores shown here is the number of different processes/threads that we have available to us. In our case, this is 4.
  • Memory - this is the amount of memory, or RAM, that we have available to us. You'll see that it is ~16GB - this is similar to many laptops now, but many servers that you'll use or have access to for bioinformatics analysis will have much more than a standard computer. For example, one of the Langille lab servers has ~1.5 TB RAM. The larger your dataset, or the deeper your sequencing depth, the more RAM you are likely to need.
  • The processes (at the bottom) - you can see everything that is running under a PID (Process ID). This is useful when you're using a shared server to see who is running what, particularly for when you're wanting to run something that will use a lot of memory or will take a long time and you want to check that it won't bother anyone else.

When you're done looking at this, press F10 (on a Mac this is fn+F10) to exit from this screen.

14. Back to the quality control

Now take a look at one of the .html files in microbiome_helper/fastqc_out/ (note that you'll need to download it as you did above, and if you still have the webpage open, you will need to refresh it).

Next we'll run multiqc. The name suggests it might be performing QC on multiple files, but it's actually for combining the output together of multiple files, so we can run it like this:

multiqc fastqc_out --filename multiqc.html

So we've given as arguments:

  • fastqc_out: the folder that contains the fastqc output.
  • --filename multiqc.html: the file name to save the output as.

Now look at multiqc.html.

There are some questions here to help you look at the files and interpret these:

Question 3: What is the GC% of the samples?

Answer

51%

Question 4: What % of the samples are duplicate reads? Is this what you expected?

Answer

In the "General Statistics" section, we can see that ~97% of the reads are duplicated. Looking in the "Sequence Counts" section and hovering over each sample will show us how many of the reads are unique. This makes sense, because the reads are from PCR-amplified samples so we are expecting most to occur more than once.

Question 5: Now look at the Sequence Counts section. Which sample has the most reads?

Answer

Question 6: How many unique and duplicate reads are in sample 206536?

Answer

752 and 29,883.

Question 7: Look at the Sequence Quality Histograms. Do these seem good to you? Why or why not? Does this seem normal?

Answer

The sequence quality here is all really high! These are all good sequences, but this isn't normal. This is because the samples that are available for download from the HMP website have already been quality filtered.

Question 8: Look at the top overrepresented sequence. If you want to see what it is, paste it into the "Enter accession number(s), gi(s), or FASTA sequence(s)" box here and click on the blue "BLAST" button at the bottom of the page.\

Answer

All of the top hits are Bacteroides (finegoldii, caccae, stercoris, etc). These samples are from HMP2 IBD gut samples, so this seems normal!

16. Install QIIME2

For the final part of this page, we're going to install the program (QIIME2).

First, we'll deactivate the environment that we've been using:

conda deactivate

If you've found this easy so far and want more of a challenge, try doing this for yourself by following the conda instructions on the webpage here. Make sure that you're using the latest version, 2025.4.

Hint: You will likely want to install the base distribution and the CBW AWS servers are Linux.

Otherwise, follow the instructions here.

Download the files that we'll need:

wget https://raw.githubusercontent.com/qiime2/distributions/refs/heads/dev/2025.4/amplicon/released/qiime2-amplicon-ubuntu-latest-conda.yml

Create an environment using this file:

conda env create -n qiime2-amplicon-2025.4 --file qiime2-amplicon-ubuntu-latest-conda.yml

Note that where previously, we created an environment and then installed packages, this time we're creating an environment using a file that tells conda exactly how to create that environment and which packages should be installed within it.

Once we've made it, we can activate this new environment:

conda activate qiime2-amplicon-2025.4

And now we can remove the file that we used to install it:

rm qiime2-amplicon-ubuntu-latest-conda.yml
⚠️ **GitHub.com Fallback** ⚠️