1.4 SpliceScape: Read Mapping with STAR - labbces/SpliceScape GitHub Wiki
This is a critical stage where the cleaned reads are aligned to the reference genome. The MAPPING_STAR process uses the STAR aligner to determine the genomic origin of each read. Its ability to accurately identify reads that span across exon-exon boundaries (splice junctions) is essential for detecting splicing events.
After STAR completes the alignment, the process uses samtools to sort and index the resulting alignment file, preparing it for the final splicing analysis step.
Process 5: MAPPING_STAR
- Inputs and Outputs:
| Type | Description |
|---|---|
| Input | A tuple containing: 1. Cleaned forward FASTQ file (.trimmed.R1.fastq.gz). 2. Cleaned reverse FASTQ file (.trimmed.R2.fastq.gz). 3. The SRA accession string. Plus the path to the STAR genome index directory. |
| Output | A tuple containing: 1. The path to the output directory for the sample. 2. The path to the coordinate-sorted BAM file (*.sortedByCoord.out.bam). 3. The path to the BAM index file (.bam.bai). 4. The SRA accession string. |
Key STAR Parameters
The pipeline uses an optimized STAR command with specific flags to enhance performance and accuracy for alternative splicing analysis.
| STAR Flag | Function |
|---|---|
| --twopassMode Basic | This activates STAR's two-pass mapping mode, which is highly effective for discovering novel (unannotated) splice junctions. In the first pass, it identifies splice junctions from the data, and in the second pass, it re-aligns reads to produce a more sensitive and accurate mapping. |
| --outSAMstrandField intronMotif | This flag instructs STAR to add an XS tag to reads that span introns. This tag indicates the strand based on the intron motif (e.g., canonical or non-canonical splice sites), which is required information for downstream analysis tools like SGSeq. |
| --readFilesCommand zcat | An efficiency parameter that tells STAR to decompress the input .gz files "on-the-fly" as it reads them. This avoids the need to write large, uncompressed temporary files to disk, saving time and space. |
| --outSAMtype BAM Unsorted | Configures STAR to output alignments directly in the binary BAM format instead of the text-based SAM format. The output is unsorted to let the faster, multi-threaded samtools sort handle this task. |
Post-Alignment Processing
Once STAR generates the alignments, two essential samtools commands are executed:
samtools sort: The initial BAM file from STAR is unsorted. This command sorts the alignments based on their genomic coordinates. A coordinate-sorted BAM is a standard requirement for most downstream analysis tools.samtools index: This command creates a companion index file (.bai) for the sorted BAM file. This index allows programs to quickly access data from any specific region of the genome without having to read the entire file.
Key Process Features
- Error Handling: The process is configured with
errorStrategy 'ignore'. This means that if a single sample fails during the mapping stage, the pipeline will log the error but will not stop. It will continue processing the remaining samples, which is a robust approach for large-scale analyses. - Disk Space Management: Similar to the previous step, this process uses the
truncatecommand to empty the cleaned FASTQ files (*.trimmed.fastq.gz) after they have been successfully mapped. This conserves a significant amount of disk space while ensuring Nextflow's-resumefunctionality remains intact.