NGS II: Mapping - bcfgothenburg/HT24 GitHub Wiki
Course: HT24 Analysis of next generation sequencing data (SC00204)
The purpose of this exercise is to introduce you to some common aligners and understand that you should be familiar with the data to be analyzed in order to pick the right tool.
The Data
For these exercises, we will be using data from samples AS_2 (exome) from this paper and Ctrl1 (RNAseq) that target chr21 from ENA (ERR1523942)
The server
Connect to the server using MobaXterm (PC users) or your local Terminal (MACS users):
ssh -Y your_account@remote_server
Load the following modules, using module load
:
bwa-mem2
samtools/1.17
bowtie2/2.5.1
tophat/2.1.1
star/2.7.10
picard/3.0.0
multiqc/1.14
Aligners comparison
Remember that depending on the type of data, there will be a better suited mapper. For instance, for DNA sequencing, bowtie2 and bwa are two of the most used short read aligners. While bowtie2 is faster, bwa is more sensitive.
-
Copy
chr21_dna_R1.fastq.gz
,chr21_dna_R2.fastq.gz
,chr21_rna_R1.fastq.gz
,chr21_rna_R2.fastq.gz
to yourFastq
directoy, from/home/courses/NGS/Mappers
. -
Create a directory called
Mappers
and save all the results of this practical there.
We could use the human genome as our reference, but since we are only working with chr21, let's just use this chromosome. This way we can practice generating the files we need for the different tools.
- In your
/home
you created a directory calleddb
(if you don't have it, create it). - Copy the nucleotide sequence
chr21.fasta
and its annotationgenes.chr21.gtf
, from/home/courses/NGS/Mappers/db
.
Note: Add the time
command at the beginning of your comands, to track the time each process takes.
Let's start by looking at the DNA data.
BWA
BWA is a software package for mapping low-divergent sequences against a large reference genome. BWA-MEM
, one of the three algorithms, is generally recommended for its higher speed and accuracy.
The first step is to create the index for our reference. This has to be done only once for each new reference genome and sometimes it is just a matter of downloading it from iGenomes, for example.
- Go to your
db
directory - Run
bwa-mem2 index
to create the index of our reference:chr21.fasta
BWA does not work with zipped files (unfortunately).
-
Go to your
Fastq
directory -
Unzip your
dna
fastq files, usinggunzip
-
Go to your
Mappers
directory -
Use
bwa-mem2 mem
to do the mapping, using the flags you consider important
You will get a SAM file as an output.
- Convert it into BAM, sort it and index it, using
samtools
samtools
- Run the commands
view
,sort
andindex
- Generate some statistics using
flagstat
andidxstats
, saving the output files with extension.flagstat
and.idxstats
, respectively
Bowtie2
Bowtie is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It keeps its memory footprint small and supports gapped, local, and paired-end alignment modes.
-
Go to your
db
directory -
Create the index for our reference by using
bowtie2-build
-
Go to your
Mapped
directory -
Run
bowtie2
to map the reads -
Use
samtools
again tosort
andindex
your alignment -
Generate
flagstat
andidxstats
statistics, saving the output files with extension.flagstat
and.idxstats
, respectively
Q. What are your conclusions when comparing these 2 algorithms?
Now let's map the RNA data.
Tophat
Tophat is a fast splice junction mapper for RNA-Seq reads, that uses Bowtie as a read aligner, then analyzes the mapping results to identify splice junctions between exons.
Here we don't need to create the genome index, since it uses Bowtie2, and we have already created it.
- Run
tophat2
using the GTF file with known transcripts:genes.chr21.gtf
- Have a look at the newly created files.
Before calculating any statistics, there is an extra step when using picard
, and that is to add some metadata to the alignment (AddOrReplaceReadGroups
). Nowadays more tools are requiring this annotation to assure correct merging, processing and splitting of the data. So it is a good practice to have your data correctly annotated.
- Run
java -jar /apps/bio/software/picard/3.0.0/picard.jar AddOrReplaceReadGroups -h
, to display the list of arguments required - Add
-SO
and--CREATE_INDEX
to your command line
Now that you have an annotated BAM file that has been sorted and indexed, use CollectAlignmentSummaryMetrics
and BamIndexStats
to collect some metrics.
STAR
STAR is a tool that uses a novel strategy for spliced alignments that detects novel splice junctions, thus it makes it more accurate. It is faster than other algorithms but it requires more RAM.
- Go to your
db
directory - Use this command to create the index files
STAR --runMode genomeGenerate --genomeDir . --genomeFastaFiles chr21.fasta --sjdbGTFfile genes.chr21.gtf --sjdbOverhang 124
- Go to your
Mappers
directory - Run the mapping using the GTF file with annotations. Don't forget to unzip your
fastq
files.
Several files will be created, the alignment metrics are stored in Log.final.out
.
- Collect some metrics using
picard tools
.
Remember that you can use multiqc
if you want to compile all the results.
Q. What are your conclusions when comparing these 2 algorithms?
Q. Are the metrics from the samtools and picard similar?
Q. What are some differences between DNA and RNA mappers?
Analysis of next generation sequencing data (SC00024)
Home:Developed by Marcela Dávila, 2017. Updated by Marcela Dávila, 2023