Unix IV: Basic scripting - BDC-training/VT25 GitHub Wiki
You will be creating the following scripts:
- hello_world.sh (containing questions 1-8)
- fasta_info.sh (containing question 9)
- fastq_info.sh (containing question 10)
Throughout this exercise we highly encourage you to practice your coding by adding comments to your commands. You will also practice queueing jobs to the cluster.
NOTE: The first script will contain many short commands. If you don't want to run through all of them every time you run the script - comment them out (using #
at the beginning of the line) and focus on one task at a time.
Create a new directory for this exercise named Bash_scripting
Click for code
mkdir Bash_scripting
cd Bash_scripting
Start by open nedit
Click for code
gedit &
Tell the system that this is a shell script by adding on top of your file:
#!/bin/sh
Save the file as hello_world.sh
Its also good practice to add some information like date, script description and usage to your script. Today this script hopefully make sense to you, but in a year you probably have no idea what it does and how you should run it, so add the following:
#!/bin/sh
# Author:
# Date:
# Description: Bash course - scripting practice
# Usage: ./hello_world.sh
Q1. Add the message
hello world
to your script so that it will be printed to the screen when running./hello_world.sh
Remember that to execute the script you must have permissions to do that. So check the file permissions and change accordingly.
Click for code
ls -l
chmod +x hello_world.sh
Run the script:
./hello_world.sh
Add two variables, the variable name
with your name and the variable color
with your favorite color to the script hello_world.sh
.
Q2. Add a command to your script that outputs
My name is <your_name> and my favorite color is <color>
using your variables.
Create another variable named cap_seq
and assign this sequence ACCCGGTGTAGCGATGACAGAGAC
to it.
Q3. Add a command that converts your
cap_seq
to small letters and assign that sequence to a variable calledsmall_seq
. Print both sequences to screen.
Q4. Count the GC content of
cap_seq
(the number of G+C divided by the total number of A+C+G+T).
- Don't forget to remove the newline
\n
in the end of the line.- Print the GC content to screen.
- Hints:
- you can create two variables
ACGT
with the total count andGC
with the GC count.- Then use the basic calculator
bc
to count the frequency. eg.
echo "scale=3; 4/12" | bc
Now lets practice to read in files in the script.
From the shell, create a dummy file in the same directory as your script, called dummy.txt
with the text dummy - hello world
. Check that it has been created correctly using less
, more
or cat
.
Click for code
touch dummy.txt
echo "dummy - hello world" >> dummy.txt
cat dummy.txt
Add a variable in your script called dummy
and assign the file name to it.
Now check that you can read the file from the script by adding the following command and run the script:
cat $dummy
Q5. Now add a command that removes the text
dummy -
from thedummy.txt
file so it only include the texthello world
.
If you have several files you want to do the same operations on, it's very efficient to use loops. With loops you only write one command that you run or "loop" several files through.
- Copy the five
sequence*.fastq
files from/home/courses/Unix/scripting
to yourBash_scripting
directory - Take a look at your files.
Now let´s practice to use loops.
Q6. Add a loop to your script that give you the identifier of every file. It should look like this:
A
B
C
Q7. Add a loop to your script that prints the file name and identifiers of every file. It should look like this:
dummy_1.fa
A
dummy_2.fa
B
dummy_3.fa
C
Create a new directory vcf_files
.
Copy all vcf files from /home/courses/Unix/scripting
to your vcf_files
directory.
To remove the extension of a file, we can use basename
:
basename S1_1000.vcf .vcf
Q8. Add a loop that will take several vcf files as input and outputs new vcf files named
<basename>.filtered.vcf
containing all variants that pass the filter (PASS in FILTER column).
- _Hint: use the example above or feel free to google how to get the basename of a filename using bash
Sometimes you have a script that you use several times but with different inputs. For example; you have been writing a script that converts your fasta sequence from capital letters to small letters. You would like to use this script for any fasta file without open and changing the actual script. Then you use scripts with arguments.
NOTE: Remember that the first argument is assigned to $1
. If you have more arguments they will be $2
, $3
and so on.
Q9. Make a new script called
fasta_info.sh
that takes three fasta files as input arguments and output one (namedfasta_info.csv
) file including three comma separated columns; 1) file name, 2) number of sequences in this file, and 3) the average sequence length in the file. A header should be added to explain the columns.
It should output something like this:
$ ./fasta_info.sh "*.fa"
$ cat fasta_info.csv
file,number_of_seq,Avg_seq_length
arabidopsis_thaliana.fa,3,1949
homo_sapiens.fa,2,5227
mus_musculus.fa,3,957
Start by open up a new file in eg. gedit, save it as fasta_info.sh
. Add:
#!/bin/sh
# Author:
# Date:
# Description:
# Usage:
Copy the directory /home/courses/Unix/scripting
to your Bash_scripting
directory. It contains three fasta
files. Take a look at them.
Then start step by step to get the information you need. Don't try to write the whole script at once! And don't forget to save your changes now and then.
Q10. Create a new script named
fastq_info.sh
that will take afastq
file as input and outputs a new file (namedfastq_info.txt
) with the name of the third argument. The file should include 1) Date when the script was executed, 2) name of the person who run it, 3) file name and 4) number of sequences in the file.
Copy the fastq
file from /home/courses/Unix/scripting/Fly_short.fastq
to your directory.
It should run like this:
./fastq_info.sh <fastq> <user_name> <output_name>
and you should get the following output in your txt file:
$ cat fastq_info.txt
Date: Fri Apr 16 13:23:05 CEST 2021
executed by: Vanja
Filename: Fly_short.fastq
Number of sequences: 25
Q11. Submit your script to the queuing system to process all fastq files at
/home/courses/NGS/fastq
Developed by Vanja Börjesson, 2021. Modified by Marcela Dávila and Alvar Almstedt, 2024.