1.3 SpliceScape: Genome Index Generation with STAR - labbces/SpliceScape GitHub Wiki

Before any sequencing reads can be aligned to a reference genome, a specialized index of that genome must be created. This step is a prerequisite for the mapping stage. The GENOME_GENERATE_STAR process handles this crucial, one-time setup for each new genome.

This process uses the STAR (Spliced Transcripts Alignment to a Reference) aligner in genomeGenerate mode to build the index files. Once an index for a specific genome is successfully created, Nextflow's caching mechanism will automatically reuse it in subsequent pipeline runs, saving significant time.


Process 4: GENOME_GENERATE_STAR

  • Inputs and Outputs
Type Description
Input 1. Path to the reference genome FASTA file (.fa).
2. Path to the genome annotation file (.gff3). 3. The number of threads to use, specified in the config file.
Output A directory containing all the files that constitute the STAR genome index. This directory is then passed to the mapping process.

Key STAR Parameters

The script uses specific STAR flags to ensure an optimal genome index is built for RNA-seq analysis:

STAR Flag Function
--runMode genomeGenerate Instructs STAR to build a new genome index instead of performing read alignment.
--sjdbGTFfile This is a critical parameter for splicing analysis. It provides STAR with a GFF3 file containing known gene annotations. STAR uses this information to build a database of known splice junctions, which significantly increases its accuracy and sensitivity when mapping reads across exon-exon boundaries in the next stage.
--runThreadN Specifies the number of CPU cores to use, allowing for faster, multi-threaded index generation.
--genomeDir The output directory where the generated index files will be stored.

Key Process Features

Critical Error Handling

This process is essential for the entire pipeline. Therefore, it is configured with a strict error strategy:

  • errorStrategy 'finish': If the genome index generation fails for any reason, the entire pipeline will stop. This prevents wasted computational resources, as no downstream mapping steps can proceed without a valid index.

Reusability and Caching

  • The generated index is saved to the directory specified by publishDir (e.g., <output>/genomeGenerate). Nextflow's caching (cache 'lenient') ensures that if you re-run the pipeline with the same genome, this step will be skipped, and the previously generated index will be used, making subsequent analyses much faster.
⚠️ **GitHub.com Fallback** ⚠️