NGS II: Mapping - bcfgothenburg/HT24 GitHub Wiki

Course: HT24 Analysis of next generation sequencing data (SC00204)

The purpose of this exercise is to introduce you to some common aligners and understand that you should be familiar with the data to be analyzed in order to pick the right tool.

The Data

For these exercises, we will be using data from samples AS_2 (exome) from this paper and Ctrl1 (RNAseq) that target chr21 from ENA (ERR1523942)

The server

Connect to the server using MobaXterm (PC users) or your local Terminal (MACS users):

ssh -Y your_account@remote_server

Load the following modules, using module load:

bwa-mem2
samtools/1.17
bowtie2/2.5.1
tophat/2.1.1
star/2.7.10
picard/3.0.0
multiqc/1.14

Aligners comparison

Remember that depending on the type of data, there will be a better suited mapper. For instance, for DNA sequencing, bowtie2 and bwa are two of the most used short read aligners. While bowtie2 is faster, bwa is more sensitive.

Copy chr21_dna_R1.fastq.gz, chr21_dna_R2.fastq.gz, chr21_rna_R1.fastq.gz, chr21_rna_R2.fastq.gz to your Fastq directoy, from /home/courses/NGS/Mappers.
Create a directory called Mappers and save all the results of this practical there.

We could use the human genome as our reference, but since we are only working with chr21, let's just use this chromosome. This way we can practice generating the files we need for the different tools.

In your /home you created a directory called db (if you don't have it, create it).
Copy the nucleotide sequence chr21.fasta and its annotation genes.chr21.gtf, from /home/courses/NGS/Mappers/db.

Note: Add the time command at the beginning of your comands, to track the time each process takes.

Let's start by looking at the DNA data.

BWA

BWA is a software package for mapping low-divergent sequences against a large reference genome. BWA-MEM, one of the three algorithms, is generally recommended for its higher speed and accuracy.

The first step is to create the index for our reference. This has to be done only once for each new reference genome and sometimes it is just a matter of downloading it from iGenomes, for example.

Go to your db directory
Run bwa-mem2 index to create the index of our reference:chr21.fasta

BWA does not work with zipped files (unfortunately).

Go to your Fastq directory
Unzip your dna fastq files, using gunzip
Go to your Mappers directory
Use bwa-mem2 mem to do the mapping, using the flags you consider important

You will get a SAM file as an output.

Convert it into BAM, sort it and index it, using samtools

samtools

Run the commands view, sort and index
Generate some statistics using flagstat and idxstats, saving the output files with extension .flagstat and .idxstats, respectively

Bowtie2

Bowtie is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It keeps its memory footprint small and supports gapped, local, and paired-end alignment modes.

Go to your db directory
Create the index for our reference by using bowtie2-build
Go to your Mapped directory
Run bowtie2 to map the reads
Use samtools again to sort and index your alignment
Generate flagstat and idxstats statistics, saving the output files with extension .flagstat and .idxstats, respectively

Q. What are your conclusions when comparing these 2 algorithms?

Now let's map the RNA data.

Tophat

Tophat is a fast splice junction mapper for RNA-Seq reads, that uses Bowtie as a read aligner, then analyzes the mapping results to identify splice junctions between exons.

Here we don't need to create the genome index, since it uses Bowtie2, and we have already created it.

Run tophat2 using the GTF file with known transcripts: genes.chr21.gtf
Have a look at the newly created files.

Before calculating any statistics, there is an extra step when using picard, and that is to add some metadata to the alignment (AddOrReplaceReadGroups). Nowadays more tools are requiring this annotation to assure correct merging, processing and splitting of the data. So it is a good practice to have your data correctly annotated.

Run java -jar /apps/bio/software/picard/3.0.0/picard.jar AddOrReplaceReadGroups -h, to display the list of arguments required
Add -SO and --CREATE_INDEX to your command line

Now that you have an annotated BAM file that has been sorted and indexed, use CollectAlignmentSummaryMetrics and BamIndexStats to collect some metrics.

STAR

STAR is a tool that uses a novel strategy for spliced alignments that detects novel splice junctions, thus it makes it more accurate. It is faster than other algorithms but it requires more RAM.

Go to your db directory
Use this command to create the index files

STAR --runMode genomeGenerate --genomeDir . --genomeFastaFiles chr21.fasta --sjdbGTFfile genes.chr21.gtf --sjdbOverhang 124

Go to your Mappers directory
Run the mapping using the GTF file with annotations. Don't forget to unzip your fastq files.

Several files will be created, the alignment metrics are stored in Log.final.out.

Collect some metrics using picard tools.

Remember that you can use multiqc if you want to compile all the results.

Q. What are your conclusions when comparing these 2 algorithms?

Q. Are the metrics from the samtools and picard similar?

Q. What are some differences between DNA and RNA mappers?

Home: Analysis of next generation sequencing data (SC00024)

Developed by Marcela Dávila, 2017. Updated by Marcela Dávila, 2023