E1 Basic shell programming - NU-CPGME/sl_workshop_2024 GitHub Wiki

July / August, 2024

Developed by:
Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo Redondo, PhD ([email protected])


The shell is the command processor that has been running all of the commands that we've been using in the Terminal so far. It also functions as a programming language and learning a few tricks can greatly improve your productivity and make some repetitive tasks easier. There are a few different types of shell programs, but the most common is Bash (Bourne Again SHell). The default shell on Mac has been changed to Zsh. There are some prominent differences, but the basic functions that we will be using the most are the same.

Section 1 - Loops

The "for" loop is a basic but powerful function in Bash. For loops allow you to perform commands mutiple times or on groups of files or variable.

For loops start with a for statement and then are followed by a variable name. In this case I am using i as the variable name. The next part of the function defines the list of variables to loop over with in followed by the list separated by spaces.

The next line should start with do which signals the start of the command or set of commands you want to loop over.

Each subsequent line should contain a command or commands. If you want to use the variable in the commands, then add a dollar sign $ in front of the variable name to recall it. Since we used i to define the variable name in the first line, we use $i to recall the variable in the command. Finally, end the loop with done.

Example:

for i in 1 2 3 A B C  
do  
echo $i  
done

The echo function prints something to the screen.

Output:

1
2
3
A
B
C

A more complicated example (leading spaces added for readability only):

for file in Ecoli.fasta Kleb.fasta 
do
    echo "Working on $file"
    quast $file > $file.quast.txt
    head -n 5 $file.quast.txt >> all_results.txt
    mv $file finished/
done

You can also use the wildcard symbol * to auto-generate a variable list instead of typing everything out.

For example, if you have several files in a directory named "genomes" ending with the .fasta suffix and you want to run the same command on all of them, you could start your loop with:

for i in genomes/*.fasta
do
[etc.]

Section 2 - Getting loop variables from a file

Sometimes all the variables you want to use will be in a file. One example could be a file with a list of sample IDs. You can use the while and read functions to use these variables in loops.

For example, you may have a file named "variables.txt" that looks like this:

One
Two
Three

To use these variables in your loop, use the while and read functions and use the standard input (or STDIN) redirect < to provide the variables in the file to the loop:

while read var
do
    echo $var
done < variables.txt

The read function also allows you to read multiple variables per line in the input. Just separate them by white space (space or tab) in the variable file.

variables_multi.txt:

apples Kevin
pears Suzy
figs John

Multiple variable loop example:

while read food name
do
    echo "$name likes to eat $food"
done < variables_multi.txt

The output will be:

Kevin likes to eat apples
Suzy likes to eat pears
John likes to eat figs

Section 3 - Basic variable manipulation using Perl one-liners

A frequent activity when managing data on the command line is manipulation of file names or other data. Most often, I find I'll want to use the prefix of input files in a loop to name the output files. To do this, I want to use a method for string manipulation. There are great several tools available for this such as sed and awk, but we'll discuss one simple method: using regular expressions in a Perl one-line program.

Regular expressions are patterns to search for in text. The command used most often in this context is the substute command. The command defines a pattern to match and a pattern to substitute the match with.

Here is an example of the structure of a substitute command: s/high/low/

The general format of the substitute command consists of four parts:

  • s: the substitute command
  • /../../: delimiters. Traditionally the / character is used as a delimiter, but any character can be used.
  • high: the regular expression pattern to be matched
  • low: the replacement string

If we apply the substitute command above to an example string,

I've got friends in high places

the output will be:

I've got friends in low places

Regular expression don't just have to include perfect matches, but can include wildcard and repetiton characters to make the strings matched more flexible.

Commonly used wildcards include:

  • .: match any character
  • \s: match whitespace character
  • \S: match non-whitespace character
  • \d: match digit character
  • \t: match tab character

Commonly used repetitions include:

  • *: match 0 or more times
  • +: match 1 or more times
  • {n}: match exacty n times

A longer list of wildcards and repetitions can be found here.

For example, in the filename genome_187.fasta, applying the substitute command s/\d+/one/ will match a pattern of one or more (+) consecutive digit characterss (\d) and replaces them with the word one to give you the result genome_one.fasta.

Now to actually apply the substitution command, you need to use Perl, a programming language built into nearly all Linux and Unix (like MacOS) distributions, to run it. For that we'll pipe the string we want to change into the perl one-liner command, which would look like this.

echo "genome_187.fasta" | perl -pe "s/\d+/one/"

This construction can be very useful in bash loops, for example, when we want to extract file prefixes and assign them to a new variable.

For example:

for input in file1.fastq file2.fastq file3.fastq
do
    prefix=` echo $input | perl -pe "s/.fastq//" `
    echo "$prefix.txt"
done

Here we're looping through a list of three file names (file1.fastq, file2.fastq, and file3.fastq) assigning each to the variable "input". In the loop we're then defining a new variable named "prefix" that is the result of a substitution command where the regular expression pattern ".fastq" is replaced with nothing, in effect removing the pattern from the string. We use the backtick characters (`) to surround a set of commands we want the bash shell to execute and have the output captured into the new variable "prefix". We then print the prefix variable (starting witgh the $ character) with our new suffix. Hopefully you can see how this might be useful for organizing the results and outputs of looped pipelines.

If you want to play around and test the behavior of a regular expression to see if it will match the pattern in your data that you expect, there are several regular expression testers on the internet. This is the one I use most often because it does a good job of highlighting and explaining the matches: Regular Expressions 101

Section 4 - Shell script files

If there are sets of commands or loops that you use repeatedly, you probably don't want to type them over and over again. You can solve this by creating shell script files that contain all the commands you would usually type out one at a time. The commands in the file can then be run using the sh command.

For example, you could create a simple text file using a text editor or nano called "my_first_script.sh" that contains the following commands:

echo "Working hard" > test_file.txt
cp test_file.txt new_file.txt
echo "Hardly working" >> new_file.txt
cat new_file.txt

Then to run the commands in that file, just type sh my_first_script.sh and see what happens.

You can really make your script files flexible by adding the option to input variables on the command line when you are running the script. For example, we can create a new file called "my_second_script.sh" that has these commands:

echo "$1" > test_file.txt
cp test_file.txt new_file.txt
echo "$2" >> new_file.txt
cat new_file.txt

The $1 and $2 variables refer to the first and second arguments on the command line, respetively. So now if you run the following command:

sh my_second_script.sh open shut

You'll see how the arguments on the command line get integrated into the output.

Creating shell scripts like this is one way to start creating, maintaining, and sharing bioinformatic or other analytical pipelines.



Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

⚠️ **GitHub.com Fallback** ⚠️