Running STAR - ccsstudentmentors/tutorials GitHub Wiki

Next run STAR with the following options:

Running your first job.

Input Files.

Single end, Paired-End (already trimmed, quality checked)

Navigate to the folder with your files.
Make a list of your samples using the code

ls *gz > samples.txt

Using FileZilla, transfer samples.txt to your local machine and open with a text editor (such as TextWrangler). It may appear like this

MM0135_GTGAAA_L007_forward_paired.fq.gz
MM0135_GTGAAA_L007_forward_unpaired.fq.gz
MM0135_GTGAAA_L007_reverse_paired.fq.gz
MM0135_GTGAAA_L007_reverse_unpaired.fq.gz
MM0137_CTTGTA_L007_forward_paired.fq.gz
MM0137_CTTGTA_L007_forward_unpaired.fq.gz
MM0137_CTTGTA_L007_reverse_paired.fq.gz
MM0137_CTTGTA_L007_reverse_unpaired.fq.gz
MM0141_CGATGT_L008_forward_paired.fq.gz
MM0141_CGATGT_L008_forward_unpaired.fq.gz
MM0141_CGATGT_L008_reverse_paired.fq.gz
MM0141_CGATGT_L008_reverse_unpaired.fq.gz
MM0144_ATGTCA_L007_forward_paired.fq.gz
MM0144_ATGTCA_L007_forward_unpaired.fq.gz
MM0144_ATGTCA_L007_reverse_paired.fq.gz
MM0144_ATGTCA_L007_reverse_unpaired.fq.gz
MM0171_AGTCAA_L008_forward_paired.fq.gz
MM0171_AGTCAA_L008_forward_unpaired.fq.gz
MM0171_AGTCAA_L008_reverse_paired.fq.gz
MM0171_AGTCAA_L008_reverse_unpaired.fq.gz
MM0173_TGACCA_L008_forward_paired.fq.gz
MM0173_TGACCA_L008_forward_unpaired.fq.gz
MM0173_TGACCA_L008_reverse_paired.fq.gz
MM0173_TGACCA_L008_reverse_unpaired.fq.gz
MM0175_CAGATC_L008_forward_paired.fq.gz
MM0175_CAGATC_L008_forward_unpaired.fq.gz
MM0175_CAGATC_L008_reverse_paired.fq.gz
MM0175_CAGATC_L008_reverse_unpaired.fq.gz
MM0179_CCGTCC_L008_forward_paired.fq.gz
MM0179_CCGTCC_L008_forward_unpaired.fq.gz
MM0179_CCGTCC_L008_reverse_paired.fq.gz
MM0179_CCGTCC_L008_reverse_unpaired.fq.gz
MM091_ACAGTG_L007_forward_paired.fq.gz
MM091_ACAGTG_L007_forward_unpaired.fq.gz
MM091_ACAGTG_L007_reverse_paired.fq.gz
MM091_ACAGTG_L007_reverse_unpaired.fq.gz
MM094_GCCAAT_L007_forward_paired.fq.gz
MM094_GCCAAT_L007_forward_unpaired.fq.gz
MM094_GCCAAT_L007_reverse_paired.fq.gz
MM094_GCCAAT_L007_reverse_unpaired.fq.gz
samples.txt

Now clean up the file list to generate a list of sample names only. You can do this by "finding (Command F) and replacing "_forward_paired.fq.gz" and "_reverse_paired.fq" etc.. with a blank space. Also remove your duplicates, either manually or using TextWrangler's function "Text -> Process duplicates" Your final sample list should look like this.

MM0135_GTGAAA_L007
MM0137_CTTGTA_L007
MM0141_CGATGT_L008
MM0144_ATGTCA_L007
MM0171_AGTCAA_L008
MM0173_TGACCA_L008
MM0175_CAGATC_L008
MM0179_CCGTCC_L008
MM091_ACAGTG_L007
MM094_GCCAAT_L007

###Standard Options A general strategy is to always test your options on one set of files first. This will ensure everything is installed correctly. Then you can write a bash script to automate this for all of your samples.

Here's an example of STAR being run on one sample.

#!/bin/bash
#BSUB -J deplex
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -q bigmem
#BSUB -W 48:00
#BSUB -n 16
#BSUB -r "span[ptile=8]"
#BSUB -B
#BSUB -u [email protected]
#BSUB -N
#BSUB -P hlab



/nethome/louiscai/Github/STAR/STAR --runThreadN 16 --genomeDir /nethome/louiscai/Github/genomeindices \
--readFilesIn /scratch/projects/hlab/louis/polyA_rnaseq_UM/MM100T.0708_2_1_forward_paired.fq.gz \
/scratch/projects/hlab/louis/polyA_rnaseq_UM/MM100T.0708_2_1_reverse_paired.fq.gz \
--readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --chimSegmentMin 12 \
--quantMode TranscriptomeSAM GeneCounts --alignIntronMax 200000 --alignMatesGapMax 200000 \
--alignSJDBoverhangMin 10 --chimJunctionOverhangMin 12 --twopassMode Basic --twopass1readsN -1 \
--outSAMstrandField intronMotif --outFileNamePrefix ./MM100T.0708_

--genomeDir specifies location of the genome
--readFilesIn specifies the two input (paired files)
--readFilesCommand zcat indicates the files are in gz format.
--outSAMtype lets you choose between SAM and BAM formats.
--alignIntronMax, --alignMatesGapMax, --align SJDBoverhangMin, are standard ENCODE options
--chimSegmentMin and -- chimJunctionOverhangMin are Fusion options.
--outSAMstrandField intronMotif allows for compatability with cufflinks.
--outFileNamePrefix names the output file.
--twopassMode Basic and --twopass1readsN -1 allow for each sample to be independently mapped twice.

To make your script work for all samples, you can write a bash script that writes a unique script for each of your samples.

make sure your samples file is formatted correctly
add the following to the top of your script.

for samp in `cat samples2.txt`
    do

add "echo" in front of each line
add quotes (') surrounding each command.
add >> the sample.sh at the end of each line.
replace the sample name, such as MM100T.0708, with '${samp}'

Your script should now look like the following.

for samp in `cat samples2.txt`
    do
echo ${samp}
   echo '#!/bin/bash' >> starcomplete/${samp}.sh
echo '#BSUB -J '${samp}'' >> starcomplete/${samp}.sh
echo '#BSUB -e run.err' >> starcomplete/${samp}.sh
echo '#BSUB -o run.out' >> starcomplete/${samp}.sh
echo '#BSUB -W 48:00' >> starcomplete/${samp}.sh           # <- that is your wall time, you probably do not need to specify that
echo '#BSUB -n 4' >> starcomplete/${samp}.sh
echo '#BSUB -R "span[ptile=4]"' >> starcomplete/${samp}.sh
echo '#BSUB -u [email protected]' >> starcomplete/${samp}.sh
echo '#BSUB -q bigmem'  >> starcomplete/${samp}.sh


echo '/nethome/louiscai/Github/STAR/STAR --runThreadN 16 --genomeDir /nethome/louiscai/Github/genomeindices \
--readFilesIn /scratch/projects/hlab/louis/polyA_rnaseq_UM/'${samp}'_2_1_forward_paired.fq.gz \
/scratch/projects/hlab/louis/polyA_rnaseq_UM/'${samp}'_2_1_reverse_paired.fq.gz \
--readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --chimSegmentMin 12 \
--quantMode TranscriptomeSAM GeneCounts --alignIntronMax 200000 --alignMatesGapMax 200000 \
--alignSJDBoverhangMin 10 --chimJunctionOverhangMin 12 --twopassMode Basic --twopass1readsN -1 \
--outSAMstrandField intronMotif --outFileNamePrefix ./starcomplete/'${samp}'_' >> starcomplete/${samp}.sh


bsub < /scratch/projects/hlab/louis/polyA_rnaseq_UM/starcomplete/${samp}.sh
done

Here are examples of output files.

MM10T.0708_Aligned.sortedByCoord.out.bam
MM10T.0708_Aligned.toTranscriptome.out.bam
MM10T.0708_Chimeric.out.junction
MM10T.0708_Chimeric.out.sam
MM10T.0708_Log.final.out
MM10T.0708_Log.out
MM10T.0708_Log.progress.out
MM10T.0708_ReadsPerGene.out.tab
MM10T.0708_SJ.out.tab
MM10T.0708__STARgenome
MM10T.0708__STARpass1

There are two BAM files. The sortedByCoord.out.bam file can be used in downstream applications such as Cufflinks and HTSeq.

The Aligned.toTranscriptome.out.bam (alignments are translated to transcript coordinates) and the ReadPerGene.out.tab are products of using quantmode. You can directly use the ReadPerGene.out.tab file for EdgeR and DeSeq2. Note that if your data is stranded, you will get two columns in your ReadperGene.out.tab. If Column 3 is sense data, then column 4 is antisense data (and vice versa).

Additional Options

Quant Mode and preparing for EdgeR and DeSeq2 Star-Fusion