Section 8: The Basics of Bash Scripting - Green-Biome-Institute/AWS GitHub Wiki

Back to Section 7: Software Dependencies & More Complex Software Manuals

Go back to tutorial overview

Learning Points for Section 8: Basics of Bash Scripting

Things you should take away from this section:

  1. An executable is also known as a script or program and is a set of instructions for the computer to follow.
  2. Scripts / programs are helpful for running an analysis that must be run multiple times, each time with a different input, or for interacting with a bunch of different files in a way that can be pre-set.
  3. A variable is a container of information that can change what it is storing.
  4. A for loop tells the computer to continue following the instructions within that loop until a certain condition is met.
  5. To operate a script, you must add execute permissions to it using the chmod command.

At this point, you’re ready to go into the world of the command line on your own and do many of the things that you need to do for your own experiments and analysis. There is just one more realm of the command line that it is important to be aware of.

Introduction

At the beginning of this tutorial we used two example programs. One of them counted from 1 to 10. The other showed a progress bar from 0% to 100% “completion” for 5 fake genome assemblies. But what if those genome assemblies were real? How much time could that save you? How could that improve the documentation and consistency of your analysis? It turns out, quite a bit!

Before we jump into this, I want to make it clear that you do not need to learn how to do this. We are not going to go deep into this. However, if you are at all interested in furthering your understanding and ability to do data analysis, it is good to be exposed to these topics!

Bash

Now, let’s quickly refresh: what is bash?

Bash is the program that is running in your command line window. It is what takes the commands you type, looks for the instructions to run those commands (in your PATH), and then executes those commands. This is why this section is called “Bash Scripting” and not “Linux scripting” or “Command Line Scripting”. Linux is the operating system of your computer (like MacOS or Windows) and the Command Line Interface is what allows you to interact with Bash. But the bash is what actually receives your commands and does things with them.

Bash Scripts

One main feature of bash that is so powerful is the fact that it doesn't have to just take one command at a time as you type it. It can follow what is known as a script. A script is a file that can simply be thought of as a series of instructions. Let’s look at a really simple one. Navigate into the ex6-dir directory and read out the file multi-echo.sh using cat. This is the output:

#!/bin/bash

echo 1
echo 2
echo 3

Just by looking at this, we should know exactly what it does. It tells bash to use the “echo” command three times in a row, first saying the number “1”, then “2”, and then “3”. To confirm, try it out!:

$ ./multi-echo.sh

The output is, as we thought:

1
2
3

So what happened here? We executed a script called multi-echo.sh. That script had 4 lines of code (we’ll ignore the first line for a moment). Bash went through each line, starting from the top, and followed each instruction that it was told to. That’s all. Now I know what you’re thinking - this is kind of useless right? You would have to write each command out anyway in order to write this script. Not so fast.

What if we wanted bash to echo every number from 1 to 100,000? Instead of writing 100,000 echo commands, we can use certain commands in these scripts to do the work for us. Try executing the command count100k.sh:

$ ./count100k.sh

Pretty cool, eh? I think so.

Don’t waste your own time… make your computer do it!

Let’s look at the text for this program. I have numbered each line so we can identify what’s going on:

$ cat count100k.sh

outputs:

1. #!/bin/bash
2. 
3. for (( i=1; i<100001; i=i+1)); do
4.     echo $i >> mydata.txt
5. done

Starting with line number (1), this is called a “bash shebang” and it tells the system (your computer) how it should interpret the rest of the file. Basically, it is telling your computer what language to use when executing the instructions that will follow (in this case, it is telling it to execute the following instructions in “bash”). You should use this at the top of any bash script.

For Loops

Next, on line (3) we have what is called a for loop. A for loop tells bash that you want it to follow the instructions within that loop for as many times as you tell it to do so. In order to tell it how many times to do the instructions within it, we use the variable i. We tell the for loop that:

  • at the beginning that i should start at 1 (i=1),
  • the for loop should keep going until i is no longer less than 100,001 (i<100001),
  • and that i should increase by 1 every time it goes through the loop (i=i+1).

The “inside” of the for loop is the instructions in between do and done. In this example, we only have 1 command: echo $i. If we remember back to section 3, the $ symbol means that the text directly after is the name of a variable (in this case the variable i is incrementing by 1 during each loop)

Each time the computer goes through this for loop, it checks to see if i is no longer less than 100,001. If it is still less than 100,001, it executes the command echo $i. The result? A script with a for loop that counts from 1 to 100,000 and prints out each number as it counts up!

What does this mean for you?

I know, I know, y’all are capable of counting. But hopefully you can see where we are headed. What if, instead of the command in the for loop being echo, it was a program that did genome assembly? Well, then it would run 100,000 genome assemblies! Seems a bit far-fetched, but that's beside the point (this is where it gets cool!): a script is whatever you want it to be as long as it follows bashes guidelines and uses commands that it can find the instructions for. You can be as creative as you want! The people who wrote the larger softwares that you use as a scientist started exactly where you are right now: with a basic understanding of the command line interface and curiosity!

Once again, this introduction to bash scripting is just the tip of an iceberg. If you’d like to pursue more, there is endless information on the internet via courses, youtube videos, tutorials, etc. If you only want to know the basics, that’s fine too!

Review Questions

What is Bash?

  • Bash is the program that receives the commands you enter in the Command Line Interface window, looks for the instructions to operate those commands on the computer, and then executes those instructions.

What is a bash script?

  • A bash script is a collection of instructions that Bash follows one after another. This is done to automate tasks instead of having to manually enter them yourself.

What is a variable?

  • A variable is a container used to temporarily store information. It can change when it is assigned another value (numbers or text).

What is a for loop?

  • A for loop is a set of instructions for the computer to follow until a condition is met. When the computer enters a for loop, it is given conditions (like "until the variable i is equal to 100, with i starting at 0, and i increasing by 1 every loop) that tell it when (if ever) to leave the for loop. For as long as it is inside of the for loop, it follow the instructions inside of it.

If you try to operate a script and it tells you you don't have the correct permissions, what command can you use to change the permissions of the file so that you can execute it?

  • chmod +x [filename]

Is scripting / programming something that you are capable of learning?

  • Of course!! All it takes is practice.

Go back to tutorial overview