1.4 SpliceScape: Read Mapping with STAR - labbces/SpliceScape GitHub Wiki

This is a critical stage where the cleaned reads are aligned to the reference genome. The MAPPING_STAR process uses the STAR aligner to determine the genomic origin of each read. Its ability to accurately identify reads that span across exon-exon boundaries (splice junctions) is essential for detecting splicing events.

After STAR completes the alignment, the process uses samtools to sort and index the resulting alignment file, preparing it for the final splicing analysis step.

Process 5: `MAPPING_STAR`

Inputs and Outputs:

Type	Description
Input	A tuple containing: 1. Cleaned forward FASTQ file (.trimmed.R1.fastq.gz). 2. Cleaned reverse FASTQ file (.trimmed.R2.fastq.gz). 3. The SRA accession string. Plus the path to the STAR genome index directory.
Output	A tuple containing: 1. The path to the output directory for the sample. 2. The path to the coordinate-sorted BAM file (*.sortedByCoord.out.bam). 3. The path to the BAM index file (.bam.bai). 4. The SRA accession string.

Key STAR Parameters

The pipeline uses an optimized STAR command with specific flags to enhance performance and accuracy for alternative splicing analysis.

STAR Flag	Function
--twopassMode Basic	This activates STAR's two-pass mapping mode, which is highly effective for discovering novel (unannotated) splice junctions. In the first pass, it identifies splice junctions from the data, and in the second pass, it re-aligns reads to produce a more sensitive and accurate mapping.
--outSAMstrandField intronMotif	This flag instructs STAR to add an XS tag to reads that span introns. This tag indicates the strand based on the intron motif (e.g., canonical or non-canonical splice sites), which is required information for downstream analysis tools like SGSeq.
--readFilesCommand zcat	An efficiency parameter that tells STAR to decompress the input .gz files "on-the-fly" as it reads them. This avoids the need to write large, uncompressed temporary files to disk, saving time and space.
--outSAMtype BAM Unsorted	Configures STAR to output alignments directly in the binary BAM format instead of the text-based SAM format. The output is unsorted to let the faster, multi-threaded samtools sort handle this task.

Post-Alignment Processing

Once STAR generates the alignments, two essential samtools commands are executed:

samtools sort: The initial BAM file from STAR is unsorted. This command sorts the alignments based on their genomic coordinates. A coordinate-sorted BAM is a standard requirement for most downstream analysis tools.
samtools index: This command creates a companion index file (.bai) for the sorted BAM file. This index allows programs to quickly access data from any specific region of the genome without having to read the entire file.

Key Process Features