Preparing to use STAR - ccsstudentmentors/tutorials GitHub Wiki

USING STAR [In progress]

Karam Alawa, Louis Cai, Matt Field

We learned everything we know from Matt Field, so if there any errors please send complaints to [email protected].

Disclaimer: we are not responsible if your computer overheats in this process.

STAR MANUAL

http://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARreleases/2.4.0k/doc/STARmanual.pdf

Introduction

STAR (Spliced Transcripts Alignment to a Reference) is currently the fastest aligner around. Before using STAR, you need to make sure your RNA-Seq data (fastq.gz files) have been trimmed and quality checked. In this tutorial, we will demonstrate how to align paired-end data to the Hg19 genome. We will also demonstrate how to prepare your data for use in differential expression analysis (EdgeR, DeSeq2)

Trimming is done with software called Trimmomatic. Once you've extracted the package, see example code below on how to trim paired end data.

-Djava.io.tmpdir= a temporary directory for trimmomatic to use -jar specify the location of the trimmomatic jar file

will complete explanation later

module load java
unset _JAVA_OPTIONS
java -Djava.io.tmpdir=/scratch/projects/hlab/kalawa_tmp/./name_tmp -jar /nethome/kalawa/software/Trimmomatic-0.33/trimmomatic-0.33.jar \
PE -phred33 -trimlog /nethome/kalawa/files/TrimmomaticCrop/./name_log.txt ./name_R1_001.fastq.gz \
./name_R2_001.fastq.gz /nethome/kalawa/files/TrimmomaticAdapter/./name_forward_paired.fq.gz \
/nethome/kalawa/files/TrimmomaticAdapter/./name_forward_unpaired.fq.gz \
/nethome/kalawa/files/TrimmomaticAdapter/./name_reverse_paired.fq.gz \
/nethome/kalawa/files/TrimmomaticAdapter/./name_reverse_unpaired.fq.gz \
ILLUMINACLIP:/nethome/mfield/software/Trimmomatic-0.32/adapters/TruSeq2-PE.fa:2:30:10 LEADING:20 TRAILING:20  MINLEN:30

Documentation, Link to original STAR paper

Installation

You have to install STAR before you use it.

Visit the STAR Github HERE
Navigate to bin, then to Linux_x86_64_static.

static executables are the easiest to use

Download the appropriate STAR executable. (Select STAR or STARlong. Right click View Raw and select copy link address. In your terminal, navigate to the appropriate folder and download STAR.

wget https://github.com/alexdobin/STAR/blob/master/bin/Linux_x86_64/STAR?raw=true --no-check-certificate
chmod u+x STAR

STARlong is more efficient for reads > 200b but is incompatible with Cufflinks. 'No check certificate' allows us to download from GitHub. 'chmod' allows us to execute STAR.

That's it!

Generating Genome Indices.

Before we run STAR, we have to generate the genome we are going to align to. You need reference genome sequences (FASTA files) and an annotation file (.GTF) [MATT WRITE HERE!]

Basic Options for STAR: 
 --runThreadN NumberOfThreads
 --runMode genomeGenerate
 --genomeDir /path/to/genomeDir
 --genomeFastaFiles /path/to/genome/fasta1 \
/path/to/genome/fasta2 ...
 --sjdbGTFfile /path/to/annotations.gtf
 --sjdbOverhang ReadLength-1

Threads indicates the amount of cores you will be using on Pegasus, sjdbOverhang specifies the length of the sequence around each annotated junction. This ideally is the length of your read "-1". However, you can very frequently use 100 in generic cases.

NOTE: Make sure that you select the queue "bigmem" for generating the genome index and for running STAR. STAR requires "big memory" (bigmem), and the queue is aptly named to explain what it is used for. So don't run this on another queue; it will fail. And we will laugh at you...

The script below that shows how to generate the genome indices serves as an example on how to set the queue properly. I stored my newly created directory in a folder called "genomeindices" and I keep my original fasta/annotation files in "Genomefiles"

#!/bin/bash
#BSUB -J deplex
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q bigmem
#BSUB -W 48:00
#BSUB -n 16
#BSUB -r "span[ptile=8]"
#BSUB -B
#BSUB -u [email protected]
#BSUB -N
#BSUB -P hlab

/nethome/louiscai/Github/STAR/STAR --runThreadN 16 --runMode genomeGenerate \
--genomeDir /nethome/louiscai/Github/genomeindices --genomeFastaFiles \
/nethome/louiscai/Github/Genomefiles/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa \
--sjdbGTFfile /nethome/louiscai/Github/Genomefiles/gencode.v19.annotation.gtf \
--sjdbOverhang 100

STAR requires more RAM than General has. That's why we use bigmem on Pegasus.