Course: VT25 Unix applied to genomic data (SC00036)

You will be creating the following scripts:

hello_world.sh (containing questions 1-8)
fasta_info.sh (containing question 9)
fastq_info.sh (containing question 10)

Throughout this exercise we highly encourage you to practice your coding by adding comments to your commands. You will also practice queueing jobs to the cluster.

NOTE: The first script will contain many short commands. If you don't want to run through all of them every time you run the script - comment them out (using # at the beginning of the line) and focus on one task at a time.

Create a new directory for this exercise named Bash_scripting

Click for code

mkdir Bash_scripting
cd Bash_scripting

Your first bash script

Start by open nedit

Click for code

gedit &

Tell the system that this is a shell script by adding on top of your file:

#!/bin/sh

Save the file as hello_world.sh

Its also good practice to add some information like date, script description and usage to your script. Today this script hopefully make sense to you, but in a year you probably have no idea what it does and how you should run it, so add the following:

#!/bin/sh
 
# Author:
# Date:
# Description: Bash course - scripting practice
# Usage: ./hello_world.sh

Q1. Add the message hello world to your script so that it will be printed to the screen when running ./hello_world.sh

Remember that to execute the script you must have permissions to do that. So check the file permissions and change accordingly.

Click for code

ls -l 
chmod +x hello_world.sh

Run the script:

./hello_world.sh

Variables and arithmetics

Add two variables, the variable name with your name and the variable color with your favorite color to the script hello_world.sh.

Q2. Add a command to your script that outputs My name is <your_name> and my favorite color is <color> using your variables.

Create another variable named cap_seq and assign this sequence ACCCGGTGTAGCGATGACAGAGAC to it.

Q3. Add a command that converts your cap_seq to small letters and assign that sequence to a variable called small_seq. Print both sequences to screen.

Q4. Count the GC content of cap_seq (the number of G+C divided by the total number of A+C+G+T).

Don't forget to remove the newline \n in the end of the line.

Print the GC content to screen.

Hints:

you can create two variables ACGT with the total count and GCwith the GC count.

Then use the basic calculator bc to count the frequency. eg.

echo "scale=3; 4/12" | bc

Read files

Now lets practice to read in files in the script.

From the shell, create a dummy file in the same directory as your script, called dummy.txt with the text dummy - hello world. Check that it has been created correctly using less, more or cat.

Click for code

touch dummy.txt
echo "dummy - hello world" >> dummy.txt
cat dummy.txt

Add a variable in your script called dummy and assign the file name to it. Now check that you can read the file from the script by adding the following command and run the script:

cat $dummy

Q5. Now add a command that removes the text dummy - from the dummy.txt file so it only include the text hello world.

Loops

If you have several files you want to do the same operations on, it's very efficient to use loops. With loops you only write one command that you run or "loop" several files through.

Copy the five sequence*.fastq files from /home/courses/Unix/scripting to your Bash_scripting directory
Take a look at your files.

Now let´s practice to use loops.

Q6. Add a loop to your script that give you the identifier of every file. It should look like this:

A
B
C

Q7. Add a loop to your script that prints the file name and identifiers of every file. It should look like this:

dummy_1.fa
A
dummy_2.fa
B
dummy_3.fa
C

Create a new directory vcf_files.

Copy all vcf files from /home/courses/Unix/scripting to your vcf_files directory.

To remove the extension of a file, we can use basename:

basename S1_1000.vcf .vcf

Q8. Add a loop that will take several vcf files as input and outputs new vcf files named <basename>.filtered.vcf containing all variants that pass the filter (PASS in FILTER column).

_Hint: use the example above or feel free to google how to get the basename of a filename using bash

Run scripts with arguments

Sometimes you have a script that you use several times but with different inputs. For example; you have been writing a script that converts your fasta sequence from capital letters to small letters. You would like to use this script for any fasta file without open and changing the actual script. Then you use scripts with arguments.

NOTE: Remember that the first argument is assigned to $1. If you have more arguments they will be $2, $3 and so on.

Q9. Make a new script called fasta_info.sh that takes three fasta files as input arguments and output one (named fasta_info.csv) file including three comma separated columns; 1) file name, 2) number of sequences in this file, and 3) the average sequence length in the file. A header should be added to explain the columns.

It should output something like this:

$ ./fasta_info.sh "*.fa"
$ cat fasta_info.csv
file,number_of_seq,Avg_seq_length
arabidopsis_thaliana.fa,3,1949
homo_sapiens.fa,2,5227
mus_musculus.fa,3,957

Start by open up a new file in eg. gedit, save it as fasta_info.sh. Add:

#!/bin/sh
 
# Author:
# Date:
# Description:
# Usage:

Copy the directory /home/courses/Unix/scripting to your Bash_scripting directory. It contains three fasta files. Take a look at them.

Then start step by step to get the information you need. Don't try to write the whole script at once! And don't forget to save your changes now and then.

Q10. Create a new script named fastq_info.sh that will take a fastq file as input and outputs a new file (named fastq_info.txt) with the name of the third argument. The file should include 1) Date when the script was executed, 2) name of the person who run it, 3) file name and 4) number of sequences in the file.

Copy the fastq file from /home/courses/Unix/scripting/Fly_short.fastq to your directory.

It should run like this:

./fastq_info.sh <fastq> <user_name> <output_name>

and you should get the following output in your txt file:

$ cat fastq_info.txt
Date:  Fri Apr 16 13:23:05 CEST 2021
executed by:  Vanja
Filename:  Fly_short.fastq
Number of sequences:  25

Q11. Submit your script to the queuing system to process all fastq files at /home/courses/NGS/fastq

Home: Unix applied to genomic data

Developed by Vanja Börjesson, 2021. Modified by Marcela Dávila and Alvar Almstedt, 2024.

Unix IV: Basic scripting - BDC-training/VT25 GitHub Wiki

Course: VT25 Unix applied to genomic data (SC00036)

Your first bash script

Variables and arithmetics

Read files

Loops

Run scripts with arguments

Home: Unix applied to genomic data

⚠️ GitHub.com Fallback ⚠️

Unix IV: Basic scripting - BDC-training/VT25 GitHub Wiki

Course: VT25 Unix applied to genomic data (SC00036)

Your first bash script

Variables and arithmetics

Read files

Loops

Run scripts with arguments

Home: Unix applied to genomic data

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️