Mapping with STAR - Linlab-slu/TSSr GitHub Wiki

mapping is an important part of the transcript start site (TSS) analysis. Only with correct mapping results can correct TSS locations be obtained.

The splice-aware aligners commonly used today are mainly hisat2 and STAR.

If you want try hisat2: Mapping reads to genome with hisat2

Mapping with STAR

(Based on STAR version 2.7.11b) You can find STAR manual here: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

Best practice

Build STAR index

STAR --runMode genomeGenerate \
--runThreadN 8 \
--sjdbGTFfile <.gtf file> \
--genomeDir <genome index output folder> \
--genomeFastaFiles <genome file>

# For example:
STAR --runMode genomeGenerate \
--runThreadN 8 \
--sjdbGTFfile Schizosaccharomyces_pombe.ASM294v2.31_updated_chr_names.gtf \
--genomeDir pombe_starIndex \
--genomeFastaFiles Schizosaccharomyces_pombe_all_chromosomes_updated_chr_names.fa

Mapping with STAR

STAR --runMode alignReads --runThreadN 24 \
--outSAMtype BAM SortedByCoordinate
--outSAMorder Paired --outFileNamePrefix <output prefix> \
--genomeDir <STAR index folder> \
--readFilesIn <CAGE data> \
--readFilesCommand zcat \
--alignIntronMax <max intron length> \
--alignEndsType Extend5pOfRead1

Note:
If you ran STAR with the default alignment parameters (without using --alignEndsType Extend5pOfRead1), please ensure that you set softclippingAllowed = TRUE when running TSSr.


explanation

  • --runMode: Mainly genomeGenerate and alignReads. genomeGenerate is for generate genome index file. alignReads is for mapping reads to genome.

  • --runThreadN: option defines the number of threads to be used for genome generation, it has to be set to the number of available cores on the server node. You can use lscpu to get info of how many threads are available on the server.

  • --genomeSAindexNbases: For small genomes, the parameter --genomeSAindexNbases must to be scaled down, with a typical value of min(14, log2(GenomeLength)/2 - 1). For example, for 1 megaBase genome, this is equal to 9, for 100 kiloBase genome, this is equal to 7.

  • --sjdbGTFfile: path to the GTF file with annotations

  • --genomeDir: specifies path to the directory (henceforth called "genome directory" where the genome indices are stored. This directory has to be created (with mkdir) before STAR run and needs to have writing permissions. It is recommended to remove all files from the genome directory before running the genome generation step. This directory path will have to be supplied at the mapping step to identify the reference genome.

  • --genomeFastaFiles: path(s) to the fasta file with the genome sequences. These files should be plain text FASTA files, they cannot be zipped.

  • --outSAMtype: default: SAM.

    1. 1st word:
    • BAM: output BAM without sorting
    • SAM: output SAM without sorting
    • None: no SAM/BAM output
    1. 2nd, 3rd:
    • Unsorted: standard unsorted
    • SortedByCoordinate: sorted by coordinate. This option will allocate extra memory for sorting which can be specified by --limitBAMsortRAM.
  • --outSAMorder: default: Paired. type of sorting for the SAM output. The optional options are:

    1. Paired: one mate after the other for all paired alignments.
    2. PairedKeepInputOrder: one mate after the other for all paired alignments, the order is kept the same as in the input FASTQ files.
  • --outFileNamePrefix: output files name prefix (including full or relative path). Can only be defined on the command line.

  • --genomeDir: path to the directory where genome files are stored (for --runMode alignReads) or will be generated (for --runMode generateGenome).

  • --readFilesIn: name(s) (with path) of the files containing the sequences to be mapped (e.g. RNA-seq FASTQ files). If using Illumina paired-end reads, the read1 and read2 files have to be supplied. STAR can process both FASTA and FASTQ files. Multi-line (i.e. sequence split in multiple lines) FASTA (but not FASTQ) files are supported. If the read files are compressed, use the --readFilesCommand UncompressionCommand option, where UncompressionCommand is the un-compression command that takes the file name as input parameter, and sends the uncompressed output to stdout. For example, for gzipped files (*.gz) use --readFilesCommand zcat OR --readFilesCommand gunzip -c. For bzip2-compressed files, use --readFilesCommand bunzip2 -c.

  • --readFilesCommand: Fastq files are typically compressed in .gz format. Therefore, you should use this parameter to specify gzip for decompression

  • --alignIntronMax: default value is 1000000, which is far from the real-world situation. Sometimes, when viewing a BAM file in IGV, you may notice incorrect clipping, where the 5' end is mapped to an unexpectedly distant position. This issue could be due to the absence of a maximum intron length limit. Adjusting this parameter can help mitigate the problem to some extent.

  • --alignEndsType: by default, is Local, you can choose from these four options:

    1. Local: standard local alignment with soft-clipping allowed.
    2. EndToEnd: force end-to-end read alignment, do not soft-clip
    3. Extend5pOfRead1: fully extend only the 5p of the read1, all other ends: local alignment
    4. Extend5pOfRead12: fully extend only the 5p of the both read1 and read2, all other ends: local alignment

    As you may know, precise 5' end mapping is critical to TSS analysis, so Extend5pOfRead1 option should permit us to get accurate TSS locations.

Some options/parameters may help:

  • --limitBAMsortRAM: by default the value is 0. But sometimes you will meet this kind of error:
EXITING because of fatal ERROR: not enough memory for BAM sorting:
SOLUTION: re-run STAR with at least --limitBAMsortRAM xxxxx(a large value, e.g. 1317991407)

So a good idea is set this value to a large value when you run STAR mapping.

  • --outFilterMatchNmin: default: 0 int: alignment will be output only if the number of matched bases is higher than or equal to this value.
  • --outFilterMatchNminOverLread: default: 0.66 real: sam as --outFilterMatchNmin, but normalized to the read length (sum of mates’ lengths for paired-end reads).
  • --genomeLoad: default: NoSharedMemory mode of shared memory usage for the genome files. Only used with --runMode alignReads. LoadAndKeep load genome into shared and keep it in memory after run LoadAndRemove load genome into shared but remove it after run LoadAndExit load genome into shared memory and exit, keeping the genome in memory for future runs Remove: do not map anything, just remove loaded genome from memory NoSharedMemory: do not use shared memory, each job will have its own private copy of the genome

2-pass Mapping

In the 2-pass mode, STAR performs the alignment in two steps:

First pass: STAR aligns the RNA-seq reads to the genome and generates alignment files and splice junctions.

Second pass: The genome indices are re-generated using the splice junctions obtained from the first pass. Then, STAR re-maps all the reads. The 2-pass mode is highly recommended if your goal is to robustly and accurately identify novel splice junctions for differential splicing analysis and variant discovery.

--twopassMode string: 2-pass mapping mode.

  • None: 1-pass mapping (default)
  • Basic: basic 2-pass mapping, with all 1st pass junctions inserted into the genome indices on the fly

--twopass1readsN int: number of reads to process for the 1st step. Use very large number (or default -1) to map all reads in the first step. (default: -1)

Tried but deprecated options

Although STAR can generate a bedGraph or wiggle file as output, we find it unsuitable for CAGE data. Because STAR only assign TSS with the first base location as TSS, while in most cases there are G-cap or multiple G there, which will generate incorrect TSS.

  • --outWigType: default: None. type of signal output, e.g. "bedGraph" OR "bedGraph read1 5p". Requires sorted BAM: --outSAMtype BAM SortedByCoordinate
    • 1st word:
      • None: no signal output
      • bedGraph: bedGraph format
      • wiggle: wiggle format
    • 2nd word:
      • read1_5p: signal from only 5’ of the 1st read, useful for CAGE/RAMPAGE etc
      • read2: signal from only 2nd read
  • --outWigStrand: string: strandedness of wiggle/bedGraph output
    • Stranded: separate strands, str1 and str2
    • Unstranded: collapsed strands
  • --outWigNorm: default: RPM. string: type of normalization for the signal
    • RPM: reads per million of mapped reads
    • None: no normalization, "raw" counts
⚠️ **GitHub.com Fallback** ⚠️