Getting started on a Linux server - LangilleLab/microbiome_helper GitHub Wiki

Authors: Robyn Wright
Initial date: February 2026

The information on this page has been written for the experiential learning students that we have at Dalhousie University, but may be useful for other new lab members or people using other servers in other institutions.

Initial steps

When you first get started, Morgan (or someone else in the lab) will give you details for you to log in with. Follow the instructions on this page for logging into the server for the first time and a Unix tutorial to get the hang of some basics.

If you forget your login details or aren't able to reset your password, please let one of us know and we can reset the password for you.

Once you've sorted this, come back to this page to run through this tutorial. Some of it may be a recap of what you learned in the other Unix tutorial.

Introduction

To begin with, we're going to be learning the basics of moving around on the command line, creating directories, zipping and unzipping folders, making tar archives and installing programs on a server. We'll also be downloading files to the server, investigating the format of fasta and fastq files, and looking at the quality of sequenced samples.

1. Log in to the server

First off, go to Terminal and open up a new window. Type in ssh [email protected] and press enter. You should be prompted for your password. Type it in and press enter - note that you won't be able to see what is being typed.

To log out from Kronos (the server), type in exit and press enter. Now you'll be back to typing things into your laptop. Press the up arrow on your keyboard to get back to the last thing that you typed in, press enter and log back into the server as you did previously.

2. Creating directories and moving around on the command line

Now that we're logged in, we're going to get familiar with what's on the server. First of all, type ls and press enter. It should look something like this (although note that you'll likely have less in there than I do!):

What you see in purple on the above screenshot are directories/folders, and the names in red (just fungi_test.tar.gz) are files. Think of these like you would the directories on your own computer, e.g. "My Documents" and "Downloads". The ls command is just telling the server to list out the files/directories in the directory that we're currently in.

Next, type in pwd and press enter. The pwd command tells you which directory you're currently in, so we can see that we're in /home/your-name.

Next, we're going to make a directory that we'll call testing. To do that, type in mkdir testing and then press enter - note that you will always need to press enter after typing in a command to get it to run, but it will usually be assumed that you know that. If you now type in ls (and press enter), you should see your new directory. Now we're going to change in to it, so type in cd testing. If you type in pwd again, you should see that you've changed directory. If you type in ls again, you should see that it is empty as we just created it.

Now what happens if we want to go back to our home directory? We can use cd .., which will take us up one level. Try that and then use the ls command again. You should see the folder that you made. If you use cd on it's own, then it'll take you back to where you started. Try it and then use the pwd command. Now to get back to the directory we created for this module, use: cd testing. Note that any time you log out from the server (or lose connection/time out and get logged out), you will need to change back to the directory that you're working from. Use the pwd command once more to check that you're in the right place (/home/your-name/testing).

3. Use wget to download files

Now we're going to download some files to use. There are a lot of different data repositories available online, but we're going to use some data from the Human Microbiome Project. We used to be able to just download this from the online repository, but unfortunately it doesn't seem to be available anymore :(

We have a copy of a few of these files on our server, but this is the where they are from.

Now we'll download the files. There are a few different commands that we can use for downloading files, so you might have seen others, or may see others in the future, but for now we'll be using a command called wget. If you just type in wget on its own then you should see some information about it. You'll see that it gives an error message, because we haven't also given it a URL, but it also tells us that we can run wget --help to see more options. Try running that. If you scroll back up to where you ran it, you'll see a huge list of options that you could give wget depending on what you want to do. We won't be using most of these options for now, but you can usually add --help to whatever command you are trying to run to get some more information about what information it is expecting from you (these are called "arguments").

We're going to download three files like so:

wget http://kronos.pharmacology.dal.ca/public_files/MH2/unix_tutorial/test_data/206534.fastq
wget http://kronos.pharmacology.dal.ca/public_files/MH2/unix_tutorial/test_data/206536.fastq
wget http://kronos.pharmacology.dal.ca/public_files/MH2/unix_tutorial/test_data/206538.fastq

You should see some progress bars come up, but these files aren't very big so they shouldn't take very long to download. Now use ls again to see the files. You should see:

206534.fastq  206536.fastq  206538.fastq

4. Move these files to the directories

When we downloaded these, we didn't make a directory to put them in, so let's do that now so that we can tidy them up a bit:

mkdir test_data

And then we can move them to this directory we've just made using the mv command. The mv command is expecting the name of the file that we want to move, and then the directory/path that we want to move this file to as arguments:

mv 206534.fastq test_data/

If you use the ls command again now, you will see that there's only two files (along with the test_data directory). You can also use the ls command on the test_data directory: ls test_data, and you should see the file that we moved into there.

Often, we might have a lot of files to move and we don't want to have to move them one by one. If we want to move multiple files of the same type, we can do that like this:

mv *.fastq test_data/

The asterisk (*) acts as a wildcard, and anything that ends in .fastq will be moved using this command. If you take a look in test_data again using the ls command, you should see all three files in there now, and not in your current directory. Another useful thing that you can do with the ls command is add another argument to get some information about the files: ls -lh test_data/. When you run this, you should see who has permission to read/write to the files, the author and the owner of the files, the size of the files, and when they were created. The -l is what is telling this to give you the information, and adding the h converts this into "human" readable format. Try using it without the h - you'll see that the file sizes are in bytes instead of megabytes (MB/M).

5. Zip and unzip these files

You will often notice a .gz on the end of files, which indicated that they're zipped (or compressed). We often store files like this to save space, and some bioinformatics programs will be able to uncompress files within the program, but other times we need to uncompress (unzip) them first, so it's useful to be comfortable doing this.

These are quite small files so you might think it's unnecessary to zip/compress them, but often we have thousands of sequencing data files and they can each be hundreds of gigabytes (GB) or even terabytes (TB) in size, so it becomes quite important to keep them compressed until we need them.

We'll used the gzip command to zip them up. Try typing in gzip test_data/20 and then pressing the tab button. If you press it a couple of times, you should see a list of your options come up. The tab button can be really useful for completing file paths for you and saving you a lot of typing (or a lot of chances to make mistakes by mis-typing file names)! Continue typing and choose one of the files to zip (e.g. type in 8 and then press tab again). Now if you run ls -lh test_data again, you should see that the file you zipped now has the .gz on the end, and it's much smaller in size now.

We'll unzip the file again for now: gunzip test_data/20 - if you press tab to complete again, you should find that this will auto-fill with the file that you unzipped, because it's the only file type that the command is able to work with.

What happens if you try to run this on a file that is already unzipped? gunzip test_data/206538.fastq It should tell you that it's unable to run because the file had an unknown suffix (i.e. it didn't have a .gz on the end).

Let's zip all of the files now: gzip test_data/* See that we can use the asterisk (*) again as a wildcard and it will zip every file in the test_data directory. Take a look at the directory with ls or ls -lh if you like.

And do the same again by unzipping them:

gunzip test_data/*

6. Create new tar archive of files

There are several different ways of compressing files - there is gzip/gunzip that we just used, but we can also package up multiple files inside a directory together. We'll be using the tar command for this, and as you can see if you run tar --help, you'll see that there are lots of options available for this. Let's try it out with the test_data directory: tar -czvf test_data.tar.gz test_data/ Here we gave the command several arguments: -czvf (see below), test_data.tar.gz (the file name that we want our tar archive to have) and test_data/ (the name of the directory that we want to compress. -czvf is a short way of giving several arguments to the command:

  • c for "create" (creates a new archive)
  • z for "gZip" (this tells tar to write/read through gzip)
  • v stands for "verbose" (meaning that it will print out information about what it is doing)
  • f for "file" or "archive".

If you now run ls -lh, you should see that the tar archive (test_data.tar.gz) is a smaller size than the 3 files would be together (check by running ls -lh test_data). You can also take a look at what's in the tar archive with the less command: less test_data.tar.gz. Press q (for "quit") when you're ready to exit.

Usually we'll make a tar archive because we want to keep our files but save some space, so let's delete the original folder: rm -r test_data/. Hopefully by now you're getting the hang of how these commands work. The rm command is for removing files - you should always be really careful when using it because it won't ask you if you're sure like your regular computer would, and most servers don't have a "recycle bin" for the files to be sent to, so if you remove them, they're gone for good. The -r argument is for removing a directory rather than just a file.

7. Unzip tar archive

If we need to use the data that we zipped into the tar archive again, we'll need to unzip - or extract - it.

To unzip the tar archive, we can do that like so: tar -xvf test_data.tar.gz. Note that we just replaced the cz with x for "extract". You should be able to see the files back in test_data with the ls command.

8. Look at fasta and fastq files with less

Now we're going to take a look at these files. Let's look at 206538 first: less test_data/206538.fastq. You can scroll through the file, and remember to press q when you want to stop looking at the file. If you want to look at it again, press the up arrow key to run the less command again. You can always press the up arrow to go back through the commands that you've run previously. If you've typed something wrong and want to start again, press ctrl+c to get back to a blank command prompt.

You should have noticed that this file had the .fastq extension. Now we're going to download the same files in the fasta format:

wget http://kronos.pharmacology.dal.ca/public_files/MH2/unix_tutorial/test_data_fasta.tar.gz

As we can see that this is a tar archive by the .tar.gz extension/suffix, we'll go ahead and extract it:

tar -xvf test_data_fasta.tar.gz

Take a look at the same file in fasta format: less test_data_fasta/206538.fasta You can also download these files by going to a new Terminal window. Navigate to a folder on your laptop where you can download these files to - if you don't want to or don't know how to type in the whole folder, you can go to finder and right-click/two-finger click on the folder - hold the option key and then click Copy "folder-name" as Pathname. In the new terminal window, you can then type in cd and paste in the copied pathname. Then, to download the whole folder, you can run:

scp -r [email protected]:/home/your-name/testing/test_data_fasta/ .
scp -r [email protected]:/home/your-name/testing/test_data/ .

Make sure that you replace the your-name with your name on the server in both places! If you named any of your folders differently than I did above, you'll need to change these here, too.

Now you can open the files with a text editor like TextEdit. You should see that the fastq file has 4 lines for each sequence, while the fasta only has two. The fasta has a first line that starts with ">" that contains the sequence name and description, and the second line contains the actual DNA sequence (DNA in this case, but this file format could also contain RNA or protein/amino acid sequences). fastq files start the same, with the first line containing the sequence name and description (but starting with an "@" symbol instead), and the second containing the sequence. The third line then contains a "+" character, and the fourth contains quality information about each base of the sequence (and should contain the same number of characters as the sequence). You can read more about the quality information that the symbols encode here.

Back in the window that's logged into Kronos, to count the number of lines in a file, we can use the less command again, with some additional information: less test_data_fasta/206538.fasta | wc -l and less test_data/206538.fastq | wc -l

You should see that the fastq file contains double the number of lines as the fasta file. There are also other ways to count the number of sequences in a file, and these can be adapted for other purposes, too. E.g.: grep -c ">" test_data_fasta/206538.fasta - the grep command pulls out every occurrence of a phrase (or "string", as it's usually called in programming) and the -c argument tells it to count these. What happens if you don't use the -c argument? Why do you think this happened?

In most programming languages, you have "positional" arguments and "named" arguments. Positional arguments need to be included in the proper position, or order. The order of positional arguments is defined within the program. Named or keyword arguments are given or passed to the program only after the name is given. In the case above, the -c ">" is a named argument and the file name test_data_fasta/206538.fasta is a positional argument.

Question 1: What happens if you try to do the same thing with "@" for the fastq file? Why is this?

These questions are just here to make sure that you're thinking about what you're doing and not just running through on auto-pilot! There are some answers at the bottom of the page.

9. Installing programs to the server

Now we're going to learn how to install programs to the server. A lot of the commands we have just used (like grep and less) are standard ones that will be installed on most servers, but frequently we will need to install other programs (usually called "packages"), too. The packages that we use are often not as stable as those that we download and use on our laptops (like Microsoft Word or Adobe Acrobat Reader) and so they sometimes depend on a particular version of another package. It frequently takes more time to install packages than it does to run them, and any bioinformatician will tell you how frustrating it can be. Anaconda can help to manage this, although it doesn't overcome these problems entirely!

You can install Anaconda like this (following the instructions here_:

curl -O https://repo.anaconda.com/archive/Anaconda3-2025.12-2-Linux-x86_64.sh
bash Anaconda3-2025.12-2-Linux-x86_64.sh
#hold down enter key until you get to the end of the agreement, or press q
#type yes
#confirm location by pressing enter
#yes
#now close and reopen the window, or exit and log back into Kronos - you'll need to log back in the same way as you did before!

10. Conda environments

Anaconda, or conda, allows us to have separate "environments" for installing packages/programs into. This means that if one package requires version 3.1 of a second package, but a third program requires version 2.9 of the second package, they won't interfere with each other. Often when we're installing new packages or starting a new project, we'll make a new environment. This also helps us to keep track of which versions of a package we've used for a specific project. The environment is essentially a directory that contains a collection of packages that you've installed, so that other packages know where to access the package that they need to use. We're going to make a new environment to install some packages into:

conda create -n quality_control

You'll see that here we're using the conda command first, and then giving it the create and -n quality_control arguments. We could call this environment anything we like, but it's best to make this descriptive of what it is so that when we collaborate with others or share our code, it'll be obvious what this environment is for.

You'll need to press y at some point, to confirm that you want to install new packages.

Now we can "activate" this environment like this: conda activate quality_control. Any time you are logged out and you log back in, you'll need to reactivate the environment if you want to be working from it. If you want to see the other environments that are installed and available, you can run conda info --envs. You'll only see this one and the base environment, but you can also work from environments that other users have made, and that's what we'll be doing. These packages can take quite a long time to install.

11. Install fastqc and multiqc

Now we'll install the packages that we want to use. Usually if there's a package that you're interested in, for example we'll be using one called "fastqc", you can just google "conda install fastqc" and you should see an anaconda.org page as one of the top hits, telling you how to install it. Sometimes you'll also see bioconda documentation, or a "package recipe" and this might give more details if you're struggling to install it. We'll install fastqc like this:

conda install bioconda::fastqc

You'll need to confirm that you want to install things with y at some point. If you forgot to activate the environment (see above), then you'll get an error that you don't have permissions to do this!

You can test to see whether it got installed by typing which fastqc - this should show you the location that it's installed in.

Now we'll install the second package that we need:

conda install bioconda::multiqc

Confirm this again with y

Note that if this doesn't work, or if it's taking a long time, you can try instead running:

pip install multiqc

If you need to stop the command that you're running, you can press ctrl+c.

As you might have guessed from the "qc" in both of these names, we'll be using them for Quality Control of the sequence data.

12. Perform quality control on fastq files

First we'll be running fastqc, and to do that, we'll first make a directory for the output to go: mkdir fastqc_out

Now we'll run fastqc:

fastqc -t 4 test_data/*.fastq -o fastqc_out

Here the arguments that we're giving fastqc are:

  • -t 4: the number of threads to use. Sometimes "threads" will be shown as --threads, --cpus, --processors, --nproc, or similar. Basically, developers of packages can call things whatever they like, but you can use the help documentation to see what options are available. See below (htop) for how we find out about how many we have available.
  • test_data/*.fastq: the fastq files that we want to check the quality of.
  • -o fastqc_out: the folder to save the output to.

13. htop - looking at the number of processes we have available or running

Try running htop. This is an interactive viewer that shows you the processes that are running on your computer/server. This should look something like this:

There are a lot of different bits of information that this is showing us - you can see all of that here, but the key things for us are:

  • The CPUs (labelled 0-48 at the top of the screenshot) - this shows the percentage of the CPU being used for each core, and the number of cores shown here is the number of different processes/threads that we have available to us.
  • Memory - this is the amount of memory, or RAM, that we have available to us. You'll see that it is 252GB - this is much more than most laptops - most servers that you'll use or have access to for bioinformatics analysis will have much more than a standard computer. Our other lab server has even more capacity - ~1.5 TB RAM. The larger your dataset, or the deeper your sequencing depth, the more RAM you are likely to need

If you're reading this outside of a tutorial/workshop that someone from the Langille lab is running, then we're trying to put together some information on the RAM/resources required for various different projects. Check the main Microbiome Helper 2 page for more information.

  • The processes (at the bottom) - you can see everything that is running under a PID (Process ID). This is useful when you're using a shared server to see who is running what, particularly for when you're wanting to run something that will use a lot of memory or will take a long time and you want to check that it won't bother anyone else.

When you're done looking at this, press F10 (on a Mac this may be fn+F10) to exit from this screen.

14. Back to the quality control

Now take a look at one of the .html files in fastqc_out/ (note that you'll need to download it as you did above using the scp command).

It can be really useful when you're trying to modify commands that you've previously run to have saved them somewhere that you can edit them. If you try using something like Microsoft Word or TextEdit then you'll find that this can be really irritating - it auto-corrects things and capitalises words that you don't want to capitalise. It will also do things like correcting -- to . I personally like to use RStudio and R notebooks for keeping track of the code that I run, and use them like I would a lab notebook. You can see some more information on them here, but essentially these allow you to have multiple "chunks" of code, which will specify which coding language they are written in (e.g., bash for Linux command line, R, or Python), and you can then make notes around these. If you are in a tutorial, we'll likely be providing you with these for subsequent analyses, and if not, we'll hopefully be providing some outlines/templates at some point in the other pages that you'll be able to use.

For now, back to our analysis...

Next we'll run multiqc. The name suggests it might be performing QC on multiple files, but it's actually for combining the output together of multiple files, so we can run it like this:

multiqc fastqc_out --filename multiqc.html

So we've given as arguments:

  • fastqc_out: the folder that contains the fastqc output.
  • --filename multiqc.html: the file name to save the output as.

Now look at multiqc.html (copy it across to your laptop and open it).

There are some questions here to help you look at the files and interpret these:

  • Question 2: What is the GC% of the samples?
  • Question 3: What % of the samples are duplicate reads? Is this what you expected?
  • Question 4: Now look at the Sequence Counts section. Which sample has the most reads?
  • Question 5: How many unique and duplicate reads are in sample 206536?
  • Question 6: Look at the Sequence Quality Histograms. Do these seem good to you? Why or why not? Does this seem normal?
  • Question 7: Look at the top overrepresented sequence. If you want to see what it is, paste it into the "Enter accession number(s), gi(s), or FASTA sequence(s)" box here and click on the blue "BLAST" button at the bottom of the page.\

15. Final notes - using tmux

The final thing that I want to mention in this overview of using a Linux server for analyses is a program called tmux.

Lots of the steps in bioinformatics analyses will take minutes, hours, or even days or weeks to run. It's not realistic to stay logged into a server for this long (connected to the internet with our terminal window open!), but luckily there are several tools that are pre-installed on most Linux systems that we can use to make sure that our program carried on running even if we get disconnected from the server.

One of the most frequently used ones (and the one that I use) is called tmux. To activate it, just type in tmux and press enter. It should take a second to start up, and then load up with a similar looking command prompt to previously, but with a coloured bar at the bottom of the screen.

To get out of this window again (to detach), press ctrl+b at the same time, and then d. You should see your original command prompt and something like:

[detached (from session 0)]

We can actually use tmux to have multiple sessions, so to see a list of the active sessions, use:

tmux ls

We can rename the tmux session that we just created with this:

tmux rename-session -t 0 analysis

Note that we know it was session 0 because it said that we detached from session 0 when we exited it.

If we want to re-enter this window, we use: tmux attach-session -t analysis

We can also use the simpler tmux a to simply reattach to the last session that we were attached to.

There are various other useful commands that you could use with tmux, and you can see lots of those here. The only other thing you need to start with is that you won't be able to just scroll up and down with your mouse anymore - instead, you'll need to hold down ctrl+b, and then press the left square bracket [. Once you're done scrolling, press q to get back to the command prompt.

Answers

Question 1: What happens if you try to do the same thing with "@" for the fastq file? Why is this?
The number of "@" in the fastq file is much more than the number of lines. This is because the "@" symbol is also used in the quality information. We can get round this by using part of the sample name, e.g., grep -c "@206534" test_data/206534.fastq.

Question 2: What is the GC% of the samples?
51%

Question 3: What % of the samples are duplicate reads? Is this what you expected?
In the "General Statistics" section, we can see that ~97% of the reads are duplicated. Looking in the "Sequence Counts" section and hovering over each sample will show us how many of the reads are unique. This makes sense, because the reads are from PCR-amplified samples so we are expecting most to occur more than once.

Question 4: Now look at the Sequence Counts section. Which sample has the most reads?
206354.

Question 5: How many unique and duplicate reads are in sample 206536?
752 and 29,883.

Question 6: Look at the Sequence Quality Histograms. Do these seem good to you? Why or why not? Does this seem normal?
The sequence quality here is all really high! These are all good sequences, but this isn't normal. This is because the samples that are available for download from the HMP website have already been quality filtered.

Question 7: Look at the top overrepresented sequence. If you want to see what it is, paste it into the "Enter accession number(s), gi(s), or FASTA sequence(s)" box here and click on the blue "BLAST" button at the bottom of the page.
All of the top hits are Bacteroides (finegoldii, caccae, stercoris, etc). These samples are from HMP2 IBD gut samples, so this seems normal!