Lab: Short read variation II Read mapping and Loops - mestato/EPP622_2024 GitHub Wiki
BWA (Burrows-Wheeler Alignment)
In today's lab, we will be mapping our reads to the Solenopsis invicta reference genome using bwa. Before we get started, let's prep our directories and retrieve the reference genome index files.
First, change into your personal directory and make a new directory titled 3_bwa
. As a reminder, the path to your personal working directory should be
/pickett_sphinx/teaching/EPP622_2024/analyses/<your_username>
Then, create a new directory called 3_bwa
and change into it.
mkdir 3_bwa
cd 3_bwa
The next step of this process is to download the reference genome and index it. In the interest of time, we already downloaded and indexed the genome for you. The necessary files can be found here:
/pickett_sphinx/teaching/EPP622_2024/raw_data/genome
However, if you need to index a reference genome for your project, the command is:
#This code is just for your reference, please DO NOT run this!
bwa index <your_organism_ref_genome.fasta>
Let's symbolically link the S. invicta genome and index files to our working directories for easier access. First, create a new directory within your 3_bwa
directory (called solenopsis_genome_index
).
mkdir solenopsis_genome_index
cd solenopsis_genome_index
ln -s ../../../../raw_data/genome/solenopsis_invicta_genome.fa.gz* .
Next, change back into your 3_bwa
directory (cd ../
) and symbolically link your trimmed fastq file.
cd ../
ln -s ../2_skewer/<your_assigned_read_set>-trimmed.fastq .
Now, we are almost ready! Load bwa
and samtools
using spack
. The output of your bwa
run is a SAM file and we will need to convert it to a BAM file. Using the fancy |
notation we learned in the Software Carpentry lessons, we can combine these commands to save time.
spack load bwa
spack load [email protected]
bwa mem -t 3 \
solenopsis_genome_index/solenopsis_invicta_genome.fa.gz \
<your_assigned_read_set>-trimmed.fastq \
| samtools sort -@ 3 -m 4G \
-o <your_assigned_read_set>_sorted.bam
-t
is used to specify the number of threads to run the job (bwa)samtools sort
is used to convert and sort the output from a SAM file to a BAM file with a header-@
and-m
designates the number of threads to run the job and memory per thread (samtools)
NOTE - It takes about 5 minutes to run this command. It may take longer for you depending upon the number of reads in your assigned fastq file.
Once the command stops running, you can extract stats from your sorted.bam file using samtools flagstat
and output to a separate file.
samtools flagstat SRR6922447_sorted.bam > SRR6922447_sorted.bam.stats
NOTE - use your own bam file
Open up your sorted.stats
file using cat
to view various statistics. You may notice the % of reads mapped, and this value may be very high (>99% reads mapped). However, the percentage is wrong because bwa
only counts reads that map to the reference as input reads.
What % of reads mapped from your assigned fastq file? Is it different from the one you got using samtools flagstat?
To rectify this, we can use the bc
calculator tool to calculate the total number of reads from our assigned fastq file, and use this as the denominator to calculate the correct % for mapped reads.
spack load bc%[email protected]
echo $(cat <your_assigned_read_set>-trimmed.fastq | wc -l)/4 | bc
So, cat
is concatenating the fastq file and then we are piping the whole file to wc -l
which gives us the number of lines in the file. And the number of lines in my trimmed file are 7485964
. However, this is just a string, bash doesn't have the capability to recognize that it's a mathematical operation. So, the string 7485964/4
is piped to bc
which converts it into a mathematical equation and actually divides the number 7485964
by 4
% reads mapped = (total # of reads mapped from flagstat)/(total number of reads from bc command) * 100
bwa
(or any program) on multiple files.
DIY - Using loops to run Highly recommended to try and run on your own if you are planning to run bioinformatic programs on many files in your own research
In this lab, you are running each program on a single file. In reality, you will have dozens of files, and typing out the command for each file is tedious and cumbersome. Instead, we can use a for loop to run a program through all our files.
To exemplify this, I created a directory for_loop_exercise
with 2 sub-directories - single_end
and paired_end
. We will see how to run for loop on both type of data.
(base) [bkapoor@sphinx bkapoor]$ tree for_loop_exercise/
for_loop_exercise/
├── paired_end
│ ├── SRR6922148_1.fastq
│ ├── SRR6922148_2.fastq
│ ├── SRR6922291_1.fastq
│ ├── SRR6922291_2.fastq
│ ├── SRR6922311_1.fastq
│ └── SRR6922311_2.fastq
└── single_end
├── SRR6922148.fastq
├── SRR6922291.fastq
└── SRR6922311.fastq
2 directories, 9 files
Change into your personal working directory and copy this new directory into it. Make sure to use the recursive flag (-r) to copy the directory and all its contents.
cd /pickett_sphinx/teaching/EPP622_2024/analyses/<your_utk_username>
cp -r /pickett_sphinx/teaching/EPP622_2024/analyses/bkapoor/for_loop_exercise .
First, let's run for loop on single end files. Change into the single_end
folder and start writing the script using nano
. Name it whatever you want, but make sure it's relevant to your task and it ends in .sh
. Here is what I named mine:
cd single_end/
nano bwa_se.sh
The first thing we need to do is extract the base name from each of the files. We can do this using sed
; also known as "stream editor", sed
is incredibly powerful, as it can be used to replace, search, insert, or delete items in a file.
NOTE: although these scripts use bwa
, you can adapt these scripts to any program you want to run.
for file in *.fastq
do
basename=$(echo "$file" | sed 's/.fastq//')
echo $file
echo $basename
done
The sed 's/.fastq//'
part of the command substitutes (the "s") the ".fastq" part of the filename with nothing (notice the emptiness between the slashes). Execute the script by typing bash bwa.sh
. It will print the original name of the file (echo $file
), followed by the new-and-improved basename ($echo basename
).
SRR6922148_1.fastq
SRR6922148_1
SRR6922291_1.fastq
SRR6922291_1
SRR6922311_1.fastq
SRR6922311_1
Now, let's modify and combine this new code with our old bwa
code. Also, you can add your software packages here instead of loading before running the script. In comparison to our original bwa
run in the lab session, note that the only thing that changes is the first part of the input and output files will be now be ${basename} instead of the file's name hardcoded in.
spack load bwa
spack load [email protected]
for file in *.fastq
do
basename=$(echo "$file" | sed 's/.fastq//')
echo $file
echo $basename
bwa mem -t 6 \
/pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz \
${basename}.fastq \
| samtools sort \
-@ 3 -m 4G \
-o ${basename}_sorted.bam
done
Next, let's use for loop to map paired end data to the genome. Change into the paired_end
directory and write the following script.
cd ../paired_end
nano bwa_pe.sh
Let's see an example:
for file in *_1.fastq
do
basename=$(echo "$file" | sed 's/_1.fastq//')
echo $basename #you will get SRR### instead of SRR###_1 like in the previous example
bwa mem -t 6 \
/pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz ${basename}_1.fastq ${basename}_2.fastq | samtools sort -@ 3 -m 4G -o ${basename}_sorted.bam
done