Running STAR - SIWLab/Lab_Info GitHub Wiki

STAR is a very useful splice-aware alignment tool for RNA sequence data. It takes in sequencing reads in fastq format, aligns them to a reference genome in a 2-step process, and outputs a .sam or .bam file for downstream uses (e.g. expression analyses, variant calling, etc). The guide below is not comprehensive, do read through their main github page and relevant parts of the documentation before running!

Installation

STAR is fairly easy to install in your own directory. See the STAR github.

Already installed in:

  • /ohta1/apps/STAR-master/bin/Linux_x86_64_static - version STAR_2.5.2b (not the latest)
  • /ohta1/bianca.sacchi/bin/STAR-2.7.10a/bin/Linux_x86_64_static/STAR - a more recent build

Pre-processing steps

If your sequence data is fresh off the press, double check whether adapter trimming and quality control has been performed. FastQC is a fast and easy way to check your fastqs - more details on preprocessing here.

Step 1: creating a genome index

This step only needs to be performed once per reference genome. It's a good idea to create a descriptive name for this directory to avoid mixing up different genome versions.

STAR --runThreadN 4 --runMode genomeGenerate \
--genomeDir /path/to/genome/directory \
--genomeFastaFiles /path/to/reference/fasta \

STAR only needs a fasta file for this step, however if you have a genome annotation in gtf format you can include it by adding --sjdbGTFfile /path/to/gtf/file. This is optional but can improve how splice junctions are handled. In current versions of STAR the gtf can be added later on in the mapping step instead.

Step 2: mapping reads to the reference genome

This step takes fastq files as input and outputs a .bam file. The example below each sample has a _R1.fastq.gz file and a _R2.fastq.gz file,the reads in R1 are pairs of the reads in R2. Double check whether you have paired- or single-end data before starting. The default output is .sam but that takes up a lot of space and there's not much reason to do that. If you will be doing anything downstream that requires a sorted bam you can do this now using --outSamtype BAM SortedByCoordinate. Generally variant callers (e.g. bcftools, gatk, freebayes) require a sorted bam file. Counting reads for differential expression analyses using featureCounts or HTseq is faster with an unsorted .bam. Sorting can also be done later using samtools or Picard. More on this later.

STAR --genomeDir ${genome} \
--readFilesIn /ohta1/felix.beaudry/rawSequence/RNAseq/josh/pop/$i\R1_clean.fastq.gz \
/ohta1/felix.beaudry/rawSequence/RNAseq/josh/pop/$i\R2_clean.fastq.gz \
--readFilesCommand zcat \
--twopassMode Basic \
--twopass1readsN -1 \
--outFileNamePrefix ../data/star_hap1/${i}_hap1 \
--outSAMtype BAM SortedByCoordinate \
--runThreadN 16

Step 3: getting to know your bam files

It is often very useful to know what your bam looks like and what each piece of information means. Since they are binary files they cannot simply be viewed using less or cat. A very easy way to do this is with samtools

samtools view -H yourfile.bam displays the .bam header. The header contains useful information such as the commands used to create the file, and whether it is sorted or not. The following command will quickly reveal whether the bam is sorted:

$ samtools view -H AnalysisReady_10.TM4_.bam |grep "@HD"
@HD	VN:1.6	SO:coordinate

The output indicates that the file is coordinate sorted. An unsorted .bam will display nothing here or unsorted.

To view the sequence information only, use samtools view -h yourfile.bam. To avoid clogging your screen with the lengthy output, pipe to less e.g. samtools view -h yourfile.bam | less.

If your bamfiles have long names this is not a bad time to rename them using mv or cp, e.g.

$ mv NS.1797.004.NEBNext_dual_i7_326---NEBNext_dual_i5_374.ScF2f5_Aligned.sortedByCoord.out.bam ScF2f5_Aligned.sortedByCoord.out.bam 

mv can also be used to move files to a different directory. Be careful when using this command because it does not back up the original file. Also be careful to never overwrite an existing file name. A safer way is to copy using cp. The version with the old name can be deleted later on.

$ cp NS.1797.004.NEBNext_dual_i7_326---NEBNext_dual_i5_374.ScF2f5_Aligned.sortedByCoord.out.bam ScF2f5_Aligned.sortedByCoord.out.bam 

If you have many files to rename with the same prefix it is possible to do this using a loop.

Step 4: preparing .bams for variant calling with freebayes

Post-processing depends on next steps.

bam cleanup for freebayes variant calling

Sorting

If the bam is not sorted this can be done using samtools, e.g. samtools sort file.bam -o file_sorted.bam Many files can be sorted in parallel, and a number of jobs and threads can be specified:

# one thread per .bam file in the current directory
ls *.bam | parallel 'samtools sort {} > {.}_sorted.bam'
# runs 8 jobs at a time
ls *.bam | parallel -j 8 'samtools sort {} > {.}_sorted.bam'

Adding readgroups

Many variant callers require each sample to have a unique read group name in the .bam header. Picard tools can be used (see gatk section below), however this can be slow and bamaddrg is quick and easy. ls *.bam | parallel 'bamaddrg -b {} > {.}_rg.bam'

indexing bams

bams also need to be indexed for variant calling. The following command will create a .bai file for every .bam. ls *.bam | parallel 'samtools index {}'

Alternate Step: preparing .bams for variant calling with GATK

bam cleanup for GATK variant calling using Picard tools

These steps will be similar to the gatk dna variant calling steps with a few differences. GATK best practices provide a rough guideline, but the MarkDuplicates step can be ignored

### picard tools 
### gatk version 4
### add read groups

java -Djava.io.tmpdir=~/tmp -jar picard.jar AddOrReplaceReadGroups \
       INPUT=${prefix}/${sample}.sorted.bam \
       OUTPUT=${prefix}/${sample}.sorted.rg.bam \
       RGID=${sample} \
       RGLB=lib1 \
       RGPL=Illumina \
       RGPU=unit1 \
       RGSM=${sample} \
       >> ~/RG.out 2>> ~/RG.err

### Index # this creates a .bai index file
java -jar picard.jar BuildBamIndex \
      I=input.bam
### SplitNCigar reads - very important for splitting reads that span splice sites when variant calling using gatk
gatk SplitNCigarReads \
    -R Homo_sapiens_assembly38.fasta \
    -I input.bam \
    -O output.bam