Lab 03: Mapping with STAR - ryandkuster/EPP_575_RNA_25 GitHub Wiki

Lab 3: Read Mapping

Trimmed Reads are Mapped onto the Reference Genome

SOFTWARE

STAR - Spliced Transcripts Alignment Publication

STAR Manual 2.7.0a

Mapping Reads with STAR

Aligning reads with STAR is a two-step process:

  • Create a Genome Index: Star creates an indexed genome using reference genome (Fasta Files) and annotations (GFF file). The indexes created are used in the second step of this process. For every new genome and annotation combination, new indexes needs to be creeated.
  • Aligning Reads: This step involves using genome indexes files in combination with RNA-seq sequences (FASTA or FASTQ). STAR will map reads to the genome and generate outputs with alignment file (SAM/BAM), mapping summary statistics, splice junctions, unmapped reads, signal tracks etc.

Running STAR

1. Make folder to store index files

cd $RNA
mkdir 03_mapping_star
cd 03_mapping_star
ln -s /lustre/isaac24/proj/UTK0386/data/raw/*fastq.gz .

mkdir Star_index
cd Star_index
ln -s /lustre/isaac24/proj/UTK0386/data/reference/* .
cd ..

2. Generation of Index Files: Input files (Reference Genome, FASTA) and (Annotation, GFF)

Load STAR module:

module load star/2.7.6a

(~3 minutes to run)

STAR \
--runMode genomeGenerate \
--genomeDir ./Star_index \
--genomeFastaFiles ./Star_index/GCA_000001735.2_TAIR10.1_genomic.fna \
--runThreadN 8 \
--genomeSAindexNbases 12 \
--sjdbGTFfile ./Star_index/genomic_modified.gff \
--sjdbGTFtagExonParentTranscript Parent \
--sjdbOverhang 149

--runMode: genomeGenerate mode directs STAR to run genome indices generation job.

--genomeDir: /path/to/store/genome_indices

--genomeFastaFiles: /path/to/FASTA_file. Multiple reference sequences (chromosomes) are allowed for each fasta file.

--runThreadN: number of threads (depends on available cores on the server node.)

--genomeSAindexNbases: "length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter –genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1)." - STAR manual

--sjdbGTFfile: /path/to/GTF_file (path to annotation file) STAR can be run without annotations but is not recommended.

In general, for --sjdbGTFfile files STAR only processes lines which have --sjdbGTFfeatureExon (=exon by default) in the 3rd field (column). The exons are assigned to the transcripts using parent-child relationship defined by the --sjdbGTFtagExonParentTranscript (=transcript id by default) GTF/GFF attribute.

--sjdbGTFtagExonParentTranscript: For GFF3 formatted annotations you need to use "Parent".

--sjdbOverhang specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as well as the ideal value.

3. Mapping Reads to Reference Genome

(~15 minutes to run)

STAR \
 --genomeDir ./Star_index \
 --readFilesIn ./Col_0h_rep1_1.fastq.gz ./Col_0h_rep1_2.fastq.gz \
 --runThreadN 8 \
 --readFilesCommand zcat \
 --outFileNamePrefix ./Col_0h_rep1. \
 --outSAMtype BAM SortedByCoordinate

--genomeDir: /path/to/genome_indices_directory

--readFilesIn: /path/to/FASTQ_file (RNA-seq data). If the read files are compressed, use the --readFilesCommand UncompressionCommand option, where UncompressionCommand is the un-compression command that takes the file name as input parameter, and sends the uncompressed output to stdout. For example, for gzipped files (*.gz) use --readFilesCommand zcat OR --readFilesCommand gunzip -c. For bzip2-compressed files, use --readFilesCommand bunzip2 -c.

***Pair-end data will have two input sequence files and Single-end data will have one input file. (Both paired-end data files need to be analyzed together).

--outFileNamePrefix: prefix for all output files. You will need to add a period at the end to separate your sample name from STAR's output file extensions.

--outSAMtype: output filetype (SAM default)

--runThreadN: number of threads / cores

Output Formats

The BAM alignment file is binary in nature and is not readable by humans. Convert the BAM file to SAM file (human readable form) by using samtools.

You don't have to run this, sam files are cumbersome (e.g., 19G vs 2.9G!) and redundant if you have the bams, but bams are in binary so you can't see their contents in human readable form. The /lustre/isaac24/proj/UTK0386/completed/03_mapping_star directory has this example done!

#module load samtools/1.16.1-gcc
#samtools view -h Col_0h_rep1.Aligned.sortedByCoord.out.bam > Col_0h_rep1.Aligned.sortedByCoord.out.sam

Running STAR in a loop

Useful when multiple RNA-seq data needs to be run.

for R1 in *_1.fastq.gz; do
  R2=$( sed 's/_1/_2/' <(echo $R1) )
  BASE=$( basename $R1 | sed 's/_1.fastq.gz//g')

  STAR \
    --genomeDir ./Star_index \
    --readFilesIn $R1 $R2 \
    --runThreadN 8 \
    --readFilesCommand zcat \
    --outFileNamePrefix ${BASE}. \
    --outSAMtype BAM SortedByCoordinate
done

Scaling up to sbatch scripting (optional)

So far, we've been relying on an interactive version of the server using salloc. This is a great way to interact with smaller datasets and get used to working with HPC environments. Ideally, larger jobs will be run using what is known as an sbatch script. For example, if you create a file called 03_run_star_map_loop.sbatch with the following content:

#!/bin/bash
#SBATCH -J starloop
#SBATCH --account isaac-UTK0386
#SBATCH --partition=campus
#SBATCH --qos=campus
#SBATCH --time 0-05:00:00
#SBATCH --nodes=1
#SBATCH --mem=10G
#SBATCH --cpus-per-task=20

module load star/2.7.6a

for R1 in *_1.fastq.gz; do
  R2=$( sed 's/_1/_2/' <(echo $R1) )
  BASE=$( basename $R1 | sed 's/_1.fastq.gz//g')

  STAR \
    --genomeDir ./Star_index \
    --readFilesIn $R1 $R2 \
    --runThreadN 16 \
    --readFilesCommand zcat \
    --outFileNamePrefix ${BASE}. \
    --outSAMtype BAM SortedByCoordinate \
    --outTmpDir ./tmp_${BASE}/
done

⭐ Thanks Marcin for suggesting the --outTmpDir workaround!

To request for this script to run, you invoke the script as follows:

sbatch 03_run_star_map_loop.sbatch

...and you can check the progress of the submission using:

squeue --me

To find out more about sbatch on ISAAC-NG, see the OIT help pages on running jobs, particularly the SBATCH Flags and Submitting Jobs with Slurm topics.

In the next lab, the BAM output can be viewed graphically using IGV (Integrative Genomics Viewer)