Mapping with HISAT2 - Linlab-slu/TSSr GitHub Wiki

mapping is an important part of the transcript start site (TSS) analysis. Only with correct mapping results can correct TSS locations be obtained.

The splice-aware aligners commonly used today are mainly hisat2 and STAR.

Please refer: Mapping reads to genome with STAR

Mapping with hisat2

(Based on HISAT2 version 2.2.1)

hisat2 is an easy-to-use, extremely fast and accurate aligner. There are two main steps to using hisat2: building an index and mapping

Build hisat2 index

1. If you only have the reference genome in .fasta file, you can build index easily:

hisat2-build -p 8 <ref.fasta> <ht2_index_base>

# for example:
hisat2-build -p 8 S288C_reference_sequence_R64-3-1_20210421.fsa S288C

Here are some explain of the options/parameters

  • -p: number of threads
  • <ref.fasta>: ref sequences
  • <prefix of index>: write ht2 data to files with this dir/basename

2. If you have a genome structure annotation in .gtf format, you can build the index with it:

# extract exon information
hisat2_extract_exons.py <gtf file> > exon.txt
hisat2_extract_exons.py saccharomyces_cerevisiae_R64-3-1_20210421.gtf > exon.txt

# extract splice sites information
hisat2_extract_splice_sites.py <gtf file> > splice.txt
hisat2_extract_splice_sites.py saccharomyces_cerevisiae_R64-3-1_20210421.gtf > splice.txt

# Build index with exon and splice site information
hisat2-build -p 8 --ss splice.txt --exon exon.txt <ref.fasta> <ht2_index_base>
hisat2-build -p 8 --ss splice.txt --exon exon.txt S288C_reference_sequence_R64-3-1_20210421.fsa S288C

What is the difference and what should I choose?

If you only have genome structure annotation in .gff format, you can convert it to .gtf format with AGAT

# if you haven't install AGAT, you can install it with conda
conda install -c bioconda agat

agat_convert_sp_gff2gtf.pl --gff <input gff> -o <output gtf file>

# for example:
agat_convert_sp_gff2gtf.pl --gff saccharomyces_cerevisiae_R64-3-1_20210421.gff -o saccharomyces_cerevisiae_R64-3-1_20210421.gtf

run hisat2 mapping

# Single-end data with no-softclip

hisat2 -x <ht2_index_base> -U <fastq.gz file> -p 8 \
--summary-file <log file> --new-summary --omit-sec-seq \
--no-softclip | samtools sort -@ 4 -o <output>.bam -
  • -p: number of threads.
  • --summary-file & --new-summary: Set up the log file and output it as machine-friendly.
  • --omit-sec-seq: put '*' in SEQ and QUAL fields for secondary alignments. Which can reduce the size of the final file. Note: in current version of TSSr (v0.99.1), this option will return an error like:
Error in .new_IRanges_from_start_width(start, width) :
  'start' or 'width' cannot contain NAs
Calls: getTSS ... IRanges -> .new_IRanges -> .new_IRanges_from_start_width
In addition: There were 26 warnings (use warnings() to see them)
Execution halted

So, you can remove this option when you run hisat2 mapping, or manually remove flag value larger than 256 by samtools view -F 256.

  • --no-softclip: the --no-softclip option disables soft clipping. Softclip is a process where the aligner trims parts of the read that don't match well to the reference genome. By using this option, you force the aligner to include the entire read in the alignment, even if some sections don't align perfectly. This can result in more mismatches at the read ends but keeps the full read information intact.

  • |: The pipe operator | in Unix/Linux is used to connect the output of one command directly as the input to another. It allows you to chain commands together, enabling efficient data processing without creating intermediate files. By default, hisat2 outputs a .sam file, but this file type is disk-occupied. When we pass hisat2 output directly to the samtools sort via the pipe operator, we can directly output a .bam format, which is much smaller than the .sam format.

  • samtools sort: generate a sorted .bam` file.

  • -@: Number of threads to use

  • -o: Output file in .bam format.

some options/parameters maybe help:

  • --max-intronlen: maximum intron length (500000). The default maximum intron length of hisat2 is very large and in some cases will cause a small portion of the beginning sequence to align to a particular location. Since we rely on the exact starting position of the alignment to infer the location of the TSS, Therefore, proper limit of maximum intron length can solve this problem to some extent
  • -t: print wall-clock time taken by search phases.
  • --sensitive and --very-sensitive:

more options and parameters can be found here: http://daehwankimlab.github.io/hisat2/manual/

⚠️ **GitHub.com Fallback** ⚠️