Lab: Read mapping - statonlab/EPP_575_RNASeq_Workshop GitHub Wiki

Lab 3: Read Mapping

Trim Reads are Mapped onto the Reference Genome The adaptors are clipped and the low quality reads are filtered out before mapping them to genomic regions to identify aligned genes.

SOFTWARE

STAR - Spliced Transcripts Alignment to a Reference STAR2.5.3 manual

Mapping Reads with STAR

Aligning reads with STAR is a two-step process:

  • Create a Genome Index: Star creates an indexed genome using reference genome (Fasta Files) and annotations (GFF file). The indexes created are used in the second step of this process. For every new genome and annotation combination, new indexes needs to be creeated.
  • Aligning Reads: This step involves using genome indexes files in combination with RNA-seq sequences (FASTA or FASTQ). STAR will map reads to the genome and generate outputs with alignment file (SAM/BAM), mapping summary statistics, splice junctions, unmapped reads, signal tracks etc.

Running STAR

1. Make folder to store index files

spack load star
spack load /yk7bd6o
mkdir 3_Read_mapping
cd 3_Read_mapping

mkdir 2a_trimming
cd 2a_trimming
ln -s /pickett_shared/teaching/EPP575_Jan2022/final_outputs/2a_trimming/*.fastq ./
cd ..

mkdir Star_index
cd Star_index
ln -s /pickett_shared/teaching/EPP575_Jan2022/reference_genome/* ./
cd ..

2. Generation of Index Files: Input files (Reference Genome, FASTA) and (Annotation, GFF)

STAR \
--runMode genomeGenerate \
--genomeDir ./Star_index \
--genomeFastaFiles ./reference_genome_RNAseq/Athaliana_447_TAIR10.fa \
--runThreadN 1 \
--sjdbGTFfile ./reference_genome_RNAseq/Athaliana_447_Araport11.gene_exons.gff3 \
--sjdbGTFtagExonParentTranscript Parent \
--sjdbOverhang 149

--runMode: genomeGenerate mode directs STAR to run genome indices generation job.

--genomeDir: /path/to/store/genome_indices

--genomeFastaFiles: /path/to/FASTA_file. Multiple reference sequences (chromosomes) are allowed for each fasta file.

--runThreadN: number of threads (depends on available cores on the server node.)

--sjdbGTFfile: /path/to/GTF_file (path to annotation file) STAR can be run without annotations but is not recommended.

--sjdbGTFtagExonParentTranscript: For GFF3 formatted annotations you need to use --sjdbGTFtagExonParentTranscript Parent. In general, for --sjdbGTFfile files STAR only processes lines which have --sjdbGTFfeatureExon (=exon by default) in the 3rd field (column). The exons are assigned to the transcripts using parent-child relationship defined by the --sjdbGTFtagExonParentTranscript (=transcript id by default) GTF/GFF attribute.

--sjdbOverhang specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as well as the ideal value.

3. Mapping Reads to Reference Genome

STAR \
 --genomeDir ./Star_index \
 --readFilesIn ./2a_trimming/SRR17062759-trimmed-pair1.fastq ./2a_trimming/SRR17062759-trimmed-pair2.fastq \
 --runThreadN 1 \
 --outFileNamePrefix ./SRR17062759. \
 --limitBAMsortRAM 13842584230 \
 --outSAMtype BAM SortedByCoordinate

--genomeDir: /path/to/genome_indices_directory

--readFilesIn: /path/to/FASTQ_file (RNA-seq trimmed data). If the read files are compressed, use the --readFilesCommand UncompressionCommand option, where UncompressionCommand is the un-compression command that takes the file name as input parameter, and sends the uncompressed output to stdout. For example, for gzipped files (*.gz) use --readFilesCommand zcat OR --readFilesCommand gunzip -c. For bzip2-compressed files, use --readFilesCommand bunzip2 -c.

***Pair-end data will have two input sequence files and Single-end data will have one input file. (Both pari-end data file needs to be analyzed together).

--outFileNamePrefix: prefix for all output files. You will need to add a period at the end to separate your sample name from STAR's output file extensions.

--outSAMtype: output filetype (SAM default)

--runThreadN: number of threads / cores

Output Formats

The BAM alignment file is binary in nature and is not readable by humans. Convert the BAM file to SAM file (human readable form) by using SAMTOOLS:

samtools view -h SRR17062759.Aligned.sortedByCoord.out.bam > SRR17062759.Aligned.sortedByCoord.out.sam

Running STAR in a loop

Useful when multiple RNA-seq data needs to be run.

for R1 in /pickett_centaur/project/EPP575_testrun/analysis/2a_trimming/*pair1.fastq
do
        R2=`sed 's/pair1/pair2/' <(echo $R1)`
        BASE=$( basename $R1 | sed 's/-trimmed-pair1.fastq*//g')
        echo "R1 $R1"
        echo "R2 $R2"
        echo "BASE $BASE"

        /pickett_centaur/software/STAR/STAR \
                --genomeDir /pickett_centaur/project/EPP575_testrun/reference_genome \
                --readFilesIn $R1 $R2 \
                --runThreadN 2 \
                --outSAMtype BAM SortedByCoordinate \
                --limitBAMsortRAM 13905678955 \
                --outFileNamePrefix $BASE. &
done

The BAM output can be viewed graphically using IGV (Integrative Genomics Viewer)