1.3 SpliceScape: Genome Index Generation with STAR - labbces/SpliceScape GitHub Wiki
Before any sequencing reads can be aligned to a reference genome, a specialized index of that genome must be created. This step is a prerequisite for the mapping stage. The GENOME_GENERATE_STAR process handles this crucial, one-time setup for each new genome.
This process uses the STAR (Spliced Transcripts Alignment to a Reference) aligner in genomeGenerate mode to build the index files. Once an index for a specific genome is successfully created, Nextflow's caching mechanism will automatically reuse it in subsequent pipeline runs, saving significant time.
- Inputs and Outputs
| Type | Description |
|---|---|
| Input | 1. Path to the reference genome FASTA file (.fa). 2. Path to the genome annotation file (.gff3). 3. The number of threads to use, specified in the config file. |
| Output | A directory containing all the files that constitute the STAR genome index. This directory is then passed to the mapping process. |
The script uses specific STAR flags to ensure an optimal genome index is built for RNA-seq analysis:
| STAR Flag | Function |
|---|---|
| --runMode genomeGenerate | Instructs STAR to build a new genome index instead of performing read alignment. |
| --sjdbGTFfile | This is a critical parameter for splicing analysis. It provides STAR with a GFF3 file containing known gene annotations. STAR uses this information to build a database of known splice junctions, which significantly increases its accuracy and sensitivity when mapping reads across exon-exon boundaries in the next stage. |
| --runThreadN | Specifies the number of CPU cores to use, allowing for faster, multi-threaded index generation. |
| --genomeDir | The output directory where the generated index files will be stored. |
Critical Error Handling
This process is essential for the entire pipeline. Therefore, it is configured with a strict error strategy:
-
errorStrategy 'finish': If the genome index generation fails for any reason, the entire pipeline will stop. This prevents wasted computational resources, as no downstream mapping steps can proceed without a valid index.
Reusability and Caching
- The generated index is saved to the directory specified by
publishDir(e.g.,<output>/genomeGenerate). Nextflow's caching (cache 'lenient') ensures that if you re-run the pipeline with the same genome, this step will be skipped, and the previously generated index will be used, making subsequent analyses much faster.