0.3 Input Files: Preparing Genome and Transcriptome Data - labbces/SpliceScape GitHub Wiki
SpliceScape is designed to work with the standard file formats and structures provided by the Phytozome database. For the pipeline to locate the genome and annotation files correctly, you must follow a specific directory structure.
The three main files required for each species are:
- A reference genome in FASTA format (.fa).
- A genome annotation file in GFF3 format (.gff3).
- A plain text file (.txt) listing the target SRA accessions, with one per line.
A critical requirement is that both genome files must be uncompressed before running the pipeline. SpliceScape does not handle gzipped files (e.g., .fa.gz) for the genome and annotation inputs.
Recommended Directory Structure
We recommend creating a main directory for your project's input data and then creating a subdirectory for each species downloaded from Phytozome. Inside each species directory, you should have two subdirectories: assembly and annotation.
/path/to/your/project/
└── data/
└── Phytozome/
└── Athaliana_447_Araport11/ <-- Main species directory
├── assembly/
│ └── Athaliana_447_TAIR10.fa <-- Place the uncompressed FASTA file here
└── annotation/
└── Athaliana_447_Araport11.gene_exons.gff3 <-- Place the uncompressed GFF3 file here
Step-by-Step Guide:
- Download from Phytozome: Go to the Phytozome portal, find your species of interest, and download the genome sequence (FASTA) and the gene annotation (GFF3) files. It is often best to use the file ending in gene_exons.gff3 for the annotation.
- Create Directories: Create a folder structure as shown in the diagram above.
- Decompress Files: If your downloaded files are compressed (e.g., ...TAIR10.fa.gz), uncompress them.
- Place Files: Move the final .fa file into the
assemblyfolder and the .gff3 file into theannotationfolder.
Connecting the Structure to the config File
This directory structure directly maps to the parameters in your splicescape_paired.config. Using the example above, your configuration would look like this:
| Parameter | Value |
|---|---|
| genome | "/path/to/your/project/data/Phytozome/Athaliana_447_Araport11" |
| genome_path | "/path/to/your/project/data/Phytozome/Athaliana_447_Araport11/assembly" |
| genomeFASTA | "/path/to/your/project/data/Phytozome/Athaliana_447_Araport11/assembly/Athaliana_447_TAIR10.fa" |
| genomeGFF | "/path/to/your/project/data/Phytozome/Athaliana_447_Araport11/annotation/Athaliana_447_Araport11.gene_exons.gff3" |