03. Variables - davidaray/Genomes-and-Genome-Evolution GitHub Wiki

Variables store pieces of information, that can be temporary or permanent given that the contents may change, hence its name. They can contain a character, a string of characters, numbers, or all of them (alphanumeric). In bash scripting, it is common practice to use variables because of their practicality and applicability as you will see in this exercise. The utility of working with variables is more noticeable when you are working with complex scripts or loops.

Environmental vs local variables

Every operating system has a predefined set of information on how it should work, what specific software or module it should use to perform certain actions. The set of predefined variables that control everything, are the environmental variables. The HPCC terminal is no exception to this, and you can see what these are by typing printenv.

Local variables, on the other hand, are defined by the user and will be erased every time the system is reset or shuts down, and can be changed at anytime. You will make intensive use of local variables in this course.

To make it easy to understand, let's start with one of the most common examples in any programming language. Type greeting="Hello World!", followed by echo $greeting

Hello World!

Notice that the content of the variable should be written in quotes, although in some cases it's not strictly necessary. You have created a variable named "greeting", which contains two words and a space between them, or -technically speaking- a string of characters. Bash will identify variables because of the dollar sign "$" preceding its name.

You can also add as many characters as you want. For example, introduction="My name is David Ray and I am a professor at TTU", and to display its content type echo $introduction

My name is David Ray and I am a professor at TTU

You can type it all together with echo $greeting $introduction

Hello World! My name is David Ray and I am a professor at TTU

If you want to add more text to an existent variable, or rewrite it, you must type the exact same variable name and change the text inside of the quotes introduction="My name is David Ray, I am a professor at TTU, and will be your GGE teacher in fall 2024!", and again, you can run echo $greeting $introduction to check the updated variable content.

Hello World! My name is David Ray, I am a professor at TTU, and will be your GGE teacher in fall 2024!

Naming conventions

As you noticed, you can write anything as the content of a variable, however, there are some rules to follow when naming them.

It isn't impossible, for example, to write numbers as names, or to use a number as the first character of your variable name. Try this 1variable="This will not work"

-bash: 1variable=This will not work: command not found

Of course, bash wasn't able to store it and was actually waiting for a command or function instead.

Now try variable1="This will work" followed by echo $variable1

$ variable1="This will work"
$ echo $variable1
This will not work

Also, you are not able to use spaces if you decide to name your variable with two or more words, try space variable="some text"

bash: space: command not found

The result will be different if you name your variable "space_variable". Try it yourself! Moreover, it is highly recommended and good practice, to name a variable based on its content. You don't want to have a variable named "greeting" when it contains a path to a folder, or a DNA/aminoacid sequence.

One last thing. A habit I've gotten into (and this is not common among people who write code but I find it very useful) is to use all capital letters to assign my variables. This makes it easier to identify the variables in my code, especially if there are many, many lines.

In other words, I prefer MY_VARIABLE="variable text" to my_variable="variable text". You don't have to do this but I recommend it.

Using variables

Using variables allows you to replace long strings of characters for a smaller and easy to use element. Specially when dealing with long paths to directories or files you may use constantly and that may be nested in a directory, inside a directory, inside a directory, like:

/this/is/a/path/to/a/very/intricate/project/structure/with/lots/of/nested/directories/that/eventually/get/to/this/file.txt

Instead of typing this over and over again, you can simply assign the variable with

MY_PATH="/this/is/a/path/to/a/very/intricate/project/structure/with/lots/of/nested/directories/that/eventually/get/to/this/file.txt"

And if you need to use that path to access a file, you simply type

$MY_PATH/file.txt

rather than /this/is/a/path/to/a/very/intricate/project/structure/with/lots/of/nested/directories/that/eventually/get/to/this/file.txt

Files and paths

Before proceeding, make sure you get a refresher on working directories in Exercise 01 - You are here, if you need it.

Let's make sure you are in your home directory with pwd. You should see:

/home/[eraider]

Now, copy a directory I set up for this exercise using cp -r /home/daray/arcane .

You now have a copy of that path and anything contained in the directories in your home directory, /home/[eraider]/arcane/set/of/nested/directories/just/to/make/a/point/for/this/exercise

Notice the very long path. It'd be a real pain to type that repeatedly.

However, I put a file you've already used in a previous exercise into the exercise directory at the end of that long path and you may want to look at it or use it multiple times during some process.

So, you could enter cat /home/[eraider]/arcane/set/of/nested/directories/just/to/make/a/point/for/this/exercise/counter.sh

#!/bin/bash
#SBATCH --job-name=counter
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1


for i in `seq 1 1000`;
do
    echo $i
    sleep 1
done

Or, you could do the following to avoid having to type over and over again whenevr you want to use that file.

DIR=/home/[eraider]/arcane/set/of/nested/directories/just/to/make/a/point/for/this/exercise
cat $DIR/counter.sh

#!/bin/bash
#SBATCH --job-name=counter
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
#SBATCH --partition=nocona
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1


for i in `seq 1 1000`;
do
    echo $i
    sleep 1
done

You could simplify the code above by adding that path in COUNTER_DIR="/home/[eraider]/arcane/set/of/nested/directories/just/to/make/a/point/for/this/exercise/". You can now get the same result as before with cat $COUNTER_DIR/counter.sh

You could extract information from that script and save it in another file, or into another variable.

grep -w partition $COUNTER_DIR/counter.sh # Check the partition that you used to run the counter script

#SBATCH --partition=nocona

Notice that it returns the part of the header where the partition of HPCC is assigned.

If you wanted to change the partition to be used for that particular script without using nano, you could do something like

sed "s/nocona/quanah/g" $DIR/counter.sh > $DIR/counter_quanah.sh

View the new file and corroborate that the change was done.

Compare the command line you just used to one not making use of variables

sed "s/nocona/quanah/g" /home/[eraider]/arcane/set/of/nested/directories/just/to/make/a/point/for/this/exercise/counter.sh > /home/[eraider]/arcane/set/of/nested/directories/just/to/make/a/point/for/this/exercise/counter_quanah.sh

They do exactly the same thing but which one is easier to understand and to enter on the command line?

NOTE: You may encounter this error:

sed: can't read /lustre/scratch/[eraider]/gge2024/bin/counter.sh: No such file or directory

99.9% of the times this error occurs because you didn't type the directory correctly, so the computer cannot locate that file. You can always check the variable and change its content to fix this.

Command substitution

You can also capture the output of a command and store it in a variable. To make that possible, there's a small modification to be done in the syntax you have been using so far.

VARIABLE=$(command)

For example, if you wanted to store today's date and time, the date command can do that.

CURRENT_DATE=$(date) #save date and time in a variable echo "Today's date and time is: $CURRENT_DATE" #Print it

We can do things a bit more complicated. Let's use the opening_lines.txt file you used in Exercise 01, and do some filtering with the commands you have already learnt. First, make sure you change your directory to where this file is stored.

CHARACTER_NUMBER=$(grep was opening_lines.txt | wc -c)

Breaking it down, you used grep to find the word "was" in the file opening_lines.txt, and you used a pipe | , to count the number of characters present in the file that had the word "was" in the matching lines with wc -c. The result of this output was stored in the variable "$character_number".

NOTE: If the last couple of commad lines have confused you, please review Exercise 01 - Matching lines in files with "grep", and Combining UNIX commands with pipes before you move on with this exercise.

Putting it all together

How is any of this useful? In the context of programming, bioinformatics and genomics, you can simplify your code and change the content of a variable and work with several paths at once. I will illustrate how you could take advantage of variables in a very basic and over simplified pipeline to filter and identify protein sequences transcribed by the BRCA2 gene, which are known to increase the risks of breast and ovarian cancer. Read more about it here.

This gene is located on chromosome 13q12, typically the coding region's length is around 11,000bp long which means it encodes a protein of around 3,000 aminoacids.

I have created some directories and files to have you practice with variables. Make sure you are located in:

/lustre/scratch/[eraider]/gge2024/

Now copy the files cp -r /lustre/scratch/daray/gge2024/variables_exercise ., and cd into the "variables_exercise" directory.

You should see two directories, when you type ls.

annotations sequences

Let's create two variables for each one of those:

ANNOTATION_DIR="/lustre/scratch/[eraider]/gge2024/variables_exercise/annotations"
SEQ_DIR="/lustre/scratch/[eraider]/gge2024/variables_exercise/sequences"

Let's see what's inside the file located in "ANNOTATION_DIR" with cat $ANNOTATION_DIR/brca_description.txt

BRCA2   homo_sapiens    3418    P51587  P51587_BRCA2_HUMAN
BRCA2   mus_musculus    3329    P97929  P97929_BRCA2_MOUSE
BRCA2   ursus_americanus        3458    A0A452RG22      A0A452RG22_A0A452RG22_URSAM
BRCA2   panthera_leo    3356    A0A8C8XLK1      A0A8C8XLK1_A0A8C8XLK1_PANLE
PRDX1   myotis_lucifugus        202     Q6B4U9  Q6B4U9_PRDX1_MYOLU
BRCA2   myotis_lucifugus        315     G1NUN6  G1NUN6_G1NUN6_MYOLU
PALB2   myotis_lucifugus        1118    G1Q4H6  G1Q4H6_G1Q4H6_MYOLU

This file contains information about:

What gene it is
What organism it is from
Its length
An accession number that can be looked up in this protein database called UNIPROT
The name with which the sequence is identified in the fasta file

Now let's have a look at the BRCA protein sequences cat $SEQ_DIR/brca_sequences.fasta.

>P51587_BRCA2_HUMAN
MPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPYNSEPAEESEHKNNNYEPNLFKTPQRKPSYNQLASTPIIFKEQGLTLPLYQSPVKELDKFKLDLGRNVPNSRHKSLRTVKTKMDQADDVSCPLLNSCLSESPVVLQCTHVTPQRDKSVVCGSLFHTPKFVKGRQTPKHISESLGAEVDPDMSWSSSLATPPTLSSTVLIVRNEEASETVFPHDTTANVKSYFSNHDESLKKNDRFIASVTDSENTNQREAASHGFGKTSGNSFKVNSCKDHIGKSMPNVLEDEVYETVVDTSEEDSFSLCFSKCRTKNLQKVRTSKTRKKIFHEANADECEKSKNQVKEKYSFVSEVEPNDTDPLDSNVANQKPFESGSDKISKEVVPSLACEWSQLTLSGLNGAQMEKIPLLHISSCDQNISEKDLLDTENKRKKDFLTSENSLPRISSLPKSEKPLNEETVVNKRDEEQHLESHTDCILAVKQAISGTSPVASSFQGIKKSIFRIRESPKETFNASFSGHMTDPNFKKETEASESGLEIHTVCSQKEDSLCPNLIDNGSWPATTTQNSVALKNAGLISTLKKKTNKFIYAIHDETSYKGKKIPKDQKSELINCSAQFEANAFEAPLTFANADSGLLHSSVKRSCSQNDSEEPTLSLTSSFGTILRKCSRNETCSNNTVISQDLDYKEAKCNKEKLQLFITPEADSLSCLQEGQCENDPKSKKVSDIKEEVLAAACHPVQHSKVEYSDTDFQSQKSLLYDHENASTLILTPT
.... and a lot more

First, note that the lines starting with a > are the ones that contain the name of that sequence, and just below it, you will see letters that represent aminoacids making up the protein sequence. If you look closely at sequence "Q6B4U9_PRDX1_MYOLU" you will notice that, it has an asterisk * in between the aminoacids. This means that there is a stop codon there, that could have either resulted from an error in the genome assembly, genome annotation, or it cooresponds to a pseudogene. This protein might cause issues in some analyses, and besides, this is also a differente gene called PRDX1. So, let's get rid of it.

For that you can use grep to locate where that * is, and store it along the protein sequence name in a variable.

BAD_PROTEIN=$(grep -B 1 "*" $SEQ_DIR/brca_sequences.fasta)

The -B 1 argument tells grep that once it finds a match, it also includes whatever is one line above it, which in this case, is the name of the protein. Try echo "$BAD_PROTEIN"

Now you can eliminate the protein that contains the stop codon: grep -v "$BAD_PROTEIN" $SEQ_DIR/brca_sequences.fasta > filtered_brca_sequences.fasta

Now open the new filtered_brca_sequences.fasta file, and you will notice that the sequence is gone.

Something you should know

What you have just done is an extremely basic approximation of the pre-filtering steps you would typically do in a comparative genomics project. In this case, you made sure that you kept only BCRA2 genes do not contain stop codons.

To make this more didactic, go to this link. You'll see the alignment of the genomic regions in human and mouse where this gene is located (locus) in each organism. Actually, the proteins you used in this exercise were transcribed by the very genes you are seeing because I used this database to tailor this class. With this tool you can explore the structural genetic differences for both organisms.

Notice that the aligned region is 34.43kb long, and that the lengths of the genes are approximately the same with some differences in the mouse. The orange/brownish boxes represent exons, and they seem to be conserved in both organisms, with some differences in the exons located close to the 15 Kb of the alignment. Keep in mind that these lengths are relative and refer to the alignment alone.

Bear in mind two things this process can be more easily achieved by making use of a combination of for loops and a tool called seqkit, replacing the content of the variables in each iteration.

#------------------------------------------------------------------------------#

FOR YOU TO DO

#------------------------------------------------------------------------------#

You just saved the "filtered_brca_sequences.fasta" file in the "variables_exercise" directory, but it is best to keep things organized. Create a directory called results and make sure that you save the fasta file there when you create it. Of course, make use variables to achieve this, and upload the code that you used to achieve this task to Assignment 3 - Variables.