Section 3: Behind the Scenes - Green-Biome-Institute/AWS GitHub Wiki

Back to Section 2: The Basics

Go back to tutorial overview

Now that we understand what the shell is (the program that interacts with the computer), and how we can interact with the shell by entering commands into the command line interface, there is a bit of information we need to go over. We will do our best to not dive too deep into the weeds on these topics, but instead only show you what you need to know and give you the awareness of further topics if you wish to learn more. So let’s just get through this bit of information and we will end module 1 with something more fun!

Learning Points for Section 3 - Behind the Scenes

From this section you should take away that:

  • Most of the files you will interact with are either:
    • Directories, which are like folders
    • Text and data files, which hold information
    • Executables, which tell the CLI to do something (like a genome assembly program!)
  • You can find out more information about most commands by using the manual command man beforehand. For example if you want to know more info about the list command ls just type man ls!
  • There are three types of permissions for files: read, write, and execute. As a regular user, you will not have access to read, write, or execute all the files.
  • If you need to do something that the CLI is not allowing you to do, like install a software, you can gain administrator permissions (also called privileges) by using the sudo command. (Also known as “root user" or “superuser” privileges).
  • Environment variables are containers that exist behind the scenes and hold information regarding the shell (like PWD holds the current working directory and PATH holds the paths to all of the directories that your computer looks through to find the commands you want it to run!)

Different types of files

For our purposes, it is important to know that there are, in general, several different types of files you will use.

  1. We have already discussed “directories”. Directories are organizational files, like folders, used for creating hierarchical clusters of other files. We can use these to organize our data and programs.
  2. We have text and data files. These can come in a variety of different formats, such as .csv, .xls, .txt, .fastq, .fasta, etc. They contain information with instructions, manuals, sequencing data, and other data.
  3. We have “executables”. These are similar to text and data files, except for they are used as instructions for the computer to “do something”. This could be counting to 10 as we saw in example 1. However it could also be a program that does a genome assembly, k-mer counter, or other bioinformatics relevant program. These executables sometimes implement other executables, read information from the text and data files, or generate their own (such as logs of if the program was successful, or in the case of a genome assembler, it might output a data file with the resulting genome assembly!)

Manuals and help pages

Next, one of the most important commands you need to know while using the CLI on linux is man, which is short for “manual”. Nearly every command you will use has been documented thoroughly and has a manual page dedicated to providing you more information about it. You can navigate these manual pages using the up and down arrows or by scrolling up and down with your trackpad. To exit out of the manual page, just type q for “quit” and the CLI will go back to the command prompt. Let’s try this with ls:

$ man ls

Command options

This will take you to the manual page for ls. As you can see from scrolling down it, it is quite long. It provides a description of the command and then, our next topic, which are called “options,” “flags,” or “switches” (they all mean the same thing). Options follow after a command and modify how it works. Let’s look at two options from the manual of ls: -a and -l.

ls -a or ls --all (these do the same thing): do not ignore entries starting with . ls -l: use a long listing format

Let’s see what these mean.

Press q to exit the manual. If you pressed another button and it doesn't exit when you press q, don’t worry, just press q a second time and it should exit out.

First, let’s us ls alone:

$ ls

You can see that is lists the contents of your directory as it did before. Now let’s use -a:

$ ls -a

All of the sudden, there are a ton more files being listed! This is because files with a dot before them are normally hidden from view. By using the -a option, you tell the CLI to show you all of the files within the directory, even if they are hidden! Next let’s use the other option, -l:

$ ls -l

We are once again looking at only the unhidden files because we didn’t use the -a option, however this time, the files are listed 1 per line with a bunch of information to the left. This includes information about the permissions of the file, size of it, date of creation, etc. You don’t need to memorize this stuff, I just want you to see the amount of information and the variety of things any one command (like ls) can do! One last note on options, what do you think happens when you put them together? Let’s try both -a and -l together by doing ls -al:

$ ls -al

Now, all of the hidden files are once again shown, but they are also listed 1 per line with all the extra information provided by -l.

The major takeaways here are:

  1. If you want more information about a command, look at its manual by using man [command]
  2. Most commands have options to use them in different ways! This will be important when we start using the genome assemblers and other more complex programs.

Read, Write, and Execute Permissions

Next, there is one thing you will inevitably run into while using the CLI: Permissions. Linux operating systems, like the one you are currently using, are capable of being used by multiple users at the same time. This is great in the context of having multiple people log into and use our EC2 instances or to use them from different computers.

However, once you start getting multiple people working on the same system, you run into the issue of the possibility of overriding and damaging work done by other people. For this reason, regular users (remember, that is signified by the $ sign on the command prompt!) do not have access to some files and executables. Another way of saying this is that they do not have “permission.”

When we used the command ls -l earlier, the leftmost column of information for each file looked something like this: -rw-r--r-- or drwxr-xr-x.

These letters and dashes represent 2 things:

  1. Whether or not the file is a regular file or a directory: the first letter is either a “-” for regular file or a “d” for directory
  2. The permissions that users have to access that file or directory: Letters 2-4 represent the permissions for the file owner, letters 5-7 are for the group owner of the file, and letters 8-10 are for everybody else.

r stands for “read

w stands for “write

x stands for “execute

You don’t need to memorize this information, it is important to be aware of it.

Root user (superuser, administrator privileges, and the command sudo)

What is important, however, is the concept of the "root user" or “superuser”. We mentioned before that the symbol $ stands for "regular user". Some commands cannot be executed by the regular user!

There is another type of user which is known as an “administrative user”, “super user”, or “root user”. This level of user has the highest level of permissions on the computer, meaning it can read, write (or overwrite!), and execute any file on the computer.

Note on the dangers of the root user!

It’s important to make a note on this before going further. You should only use the root user access if you need it. This is to prevent you from accidentally deleting or overwriting important files that you or the computer needs! The main reason you will use administrator privileges (permissions) is to install things that the regular example can’t do itself.

Let’s show this with a couple examples. First, let’s go back to the first command we used in this training module, whoami. If you enter this now:

$ whoami

It will return ubuntu like it did before.

Now we’ll try it with the command sudo in front, which stands for “superuser do”:

$ sudo whoami

Now, instead of reporting back the username ubuntu, which is the regular user you normally use, it says root! As mentioned previously, the “root” user is a high level user that has administrative privileges. So in order to do things on a Linux system with higher level permissions, you use the command “sudo” beforehand.

Let’s look at another example. As you’ve probably experienced, it’s common for your personal computer to need “updates”. This is also true for the software uploaded onto Linux systems. In order to try updating your EC2 instance, use the following command:

$ apt update

You can see that this gives an error. Parts of this error read “Permission denied” and “are you root?” When you are working on the CLI, you will most likely see these errors occasionally. These are hints that you are likely trying to do something (like read a text file, execute a program, or create a new file in a private directory) that requires higher level permissions than the regular user has.

In order to overcome this lack of permissions, we will again use the sudo command:

$ sudo apt update

Environment Variables

The last topic to mention in this section is called an “Environment Variable”. Environment variables are pieces of information that are created when you log into the command line interface or move around the directories within it. This information is used by the CLI to find the programs that it needs to run (like the list command ls or the whoami command we just used!). In order to not get overwhelmed, we will only look at two of these in order to get a basic understanding of the idea.

First, let’s list these environment variables using the following command:

$ env

You will see a whole bunch of information pop up. On the left of every new line is a word in all capital letters. These are the environment variables that have been created and stored in your CLI session.

Let’s start by looking at the 6th variable down from the top (you can scroll up and down with your mouse or trackpad). It says PWD. Remember how we used the lowercase command pwd to display our current working directory? This environment variable, PWD, is a virtual container that stores the information about the current working directory. This is all a variable is: a storage location for information. So the environment variable PWD is a container to store information about the working directory. This is what differentiates it from the lowercase pwd. The lowercase pwd is a command, it tells the computer to do something. Whereas the uppercase PWD simply stores information. Let’s see this in action. First, we’ll use the command pwd, like we’ve done before:

$ pwd

As it did the first time we used it, it tells us where our current working directory is. Now let’s try to use the environment variable, PWD, the same way:

$ PWD

Hmm, it doesn’t work. Why is that?

It’s because, like we just talked about, the environment variable PWD is not a command, it is a piece of information. This is the difference between a text or data file and an executable. Lowercase pwd does something and uppercase PWD stores something. Let’s use one of the other first commands we learned to find out what PWD is storing:

$ echo $PWD

Now, when we enter this, it gives us the same information that the lowercase command pwd gave us. The reason we put a $ before PWD, is that it tells the computer that the word that comes after the dollar sign will be a variable, and not just text. For example, enter:

$ echo PWD

It just returns the word PWD, just like it returned the word “hello" when we used the command echo hello. So in order to tell the computer you are using a variable, you use $ before the variable name with no space between.

The PATH is this way

With this understood, we can move onto one of the important environment variables it is good for you to be aware of, PATH. The PATH environment variable stores the locations of the directories where your command line interface looks for programs to use. All of the softwares that we will be using on the CLI will be stored in one of the directories found on our PATH. So let’s look at it with the following command:

$ echo $PATH

This is a long piece of text, don’t worry if it looks confusing. If you look at it closely, it is a series of directories and their subdirectories, each separated by a colon.

Do you remember how our home directory was home/ubuntu? This is an example of a path to your home directory. It is a series of directories that lead to a specific location. The path home/ubuntu has 2 directories in it.

If we look here at the text that the CLI gave us, we can see a bunch of different “paths”, all leading to different locations. Each one of these locations is where programs and files important for the computer are stored. You can think of these locations like the “Applications” folder on your computer, which has your applications like Microsoft Word or Excel stored in it.

For example, we’ve been using the command whoami. Let’s find out where it is stored!

Within this block of text we just got by using $ echo $PATH, we can see that one of the paths is /usr/bin. Let’s change to that directory by using the cd command and then listing (ls) the contents of it:

$ cd /usr/bin

Now, we can see from the command prompt that we have moved to a different directory. Next use:

/usr/bin$ ls

You’ll see that there is a ton of stuff in this directory! This is where lots of the important programs are stored on your Linux system. They are alphabetically organized here, so let's scroll up to the "w”s. You can see that it is the 8th to last “w” entry. So when you enter the command whoami, your computer searches for that command in all the directories stories in your PATH. When it finds a command that matches what you entered, it executes it! This is important to know because if you try to use a software and it is not installed in one of these directories, the CLI will not be able to find it!

Okay. I know that was a lot of information, so let’s just do a brief recap:

Review Questions:

What are directories?

  • They are like folder and they store other files or more directories.

What is a subdirectory vs a parent directory?

  • A subdirectory is inside of a parent directory.

What are text and data files used for?

  • Text and data files hold information! They can have a variety of different filetypes, like excel files (.xls), comma-separated values (.csv), or the text-based format for storing both a biological sequence and its corresponding quality scores (.fastq), etc.

What is an executable?

  • An executable is a file whose contents holds instructions for the computer to follow. Can be thought of as a "program".

How can you find out more information about a given command?

  • Use the manual! To read the manual, use the command man [command-name], example for ls: man ls. You can also sometimes use the option --help after the command, ex: [command] --help

What is a command option?

  • A command option is a feature for customizing a command. Sometimes commands can be run in different ways (like using ls to list in different ways) or the options are used to input datafiles.

What command can you use to gain administrator privileges on the CLI session?

  • The command sudo is used for gaining administrator privileges, but is dangerous to use because you can potentially overwrite important files!

What is a variable?

  • A variable is a storage container for information.

What is the path of a directory?

  • A directories path is the series of directories that lead the computer to that directory. For example, in dir1/dir2/dir3/dir4/file.txt., the path to the file file.txt is dir1/dir2/dir3/dir4/.

What command is useful for when we are a bit lost and don't know where we are on the computer?

  • The command pwd is useful for finding out what our current working directory is.

Move on to Section 4: Making, Moving, Copying, Editing, and Removing Files!

Go back to tutorial overview