Data Organization and Download - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

4.2 Data Organization & Download

Before running any analyses, it pays to have a consistent directory layout and a reliable way to fetch raw FASTQ files from public repositories (SRA/ENA).

πŸ“ Recommended Directory Structure

project_root/
β”œβ”€β”€ metadata/
β”‚   └── samplesheet.tsv        # sample-to-FASTQ mapping & experiment design
β”‚
β”œβ”€β”€ raw_data/                  # downloaded FASTQs
β”‚   β”œβ”€β”€ SampleA_R1.fastq.gz
β”‚   β”œβ”€β”€ SampleA_R2.fastq.gz
β”‚   β”œβ”€β”€ SampleB_R1.fastq.gz
β”‚   └── SampleB_R2.fastq.gz
β”‚
β”œβ”€β”€ qc/                        # FastQC / MultiQC outputs
β”‚   β”œβ”€β”€ SampleA_fastqc.html
β”‚   └── SampleB_fastqc.html
β”‚
β”œβ”€β”€ trimmed/                   # adapter-/quality-trimmed reads
β”‚   β”œβ”€β”€ SampleA_trimmed_R1.fastq.gz
β”‚   └── SampleA_trimmed_R2.fastq.gz
β”‚
β”œβ”€β”€ align/                     # alignments (BAM files)
β”‚   β”œβ”€β”€ SampleA.sorted.bam
β”‚   └── SampleA.sorted.bam.bai
β”‚
β”œβ”€β”€ counts/                    # gene-level count matrices
β”‚   └── counts.tsv
β”‚
└── results/                   # downstream tables & figures
    β”œβ”€β”€ de_analysis/
    └── enrichment/

Tip: keep your samplesheet.tsv under metadata/ and refer to it in all pipeline steps to automatically assign file paths and group labels.

⬇️ Fetching from NCBI SRA

  1. Install SRA Toolkit
conda install -c bioconda sra-tools

  1. Download & split paired-end reads
# example accession SRR1234567
prefetch SRR1234567

# convert to FASTQ, split paired reads:
fasterq-dump SRR1234567 \
  --split-files \
  --outdir raw_data \
  --threads 4

# optionally gzip the outputs
pigz -p 4 raw_data/SRR1234567_*.fastq

  1. Rename for clarity (match SampleID in your metadata)
mv raw_data/SRR1234567_1.fastq.gz raw_data/SampleA_R1.fastq.gz
mv raw_data/SRR1234567_2.fastq.gz raw_data/SampleA_R2.fastq.gz

⬇️ Fetching from ENA (European Nucleotide Archive)

  1. Use wget or curl
⬇️ **Fetching from ENA (European Nucleotide Archive)**

1. **Use `wget` or `curl`**  
   ```bash
   # Example with wget
   wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/001/SRR1234567/SRR1234567_1.fastq.gz
   wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/001/SRR1234567/SRR1234567_2.fastq.gz

   # Example with curl
   curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/001/SRR1234567/SRR1234567_1.fastq.gz
   curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/001/SRR1234567/SRR1234567_2.fastq.gz

  1. Recursive download of all FASTQ
# grab all .fastq.gz files in one go
wget -r -nd -P raw_data \
  -A "*_1.fastq.gz","*_2.fastq.gz" \
  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/007/SRR1234567/

πŸ“ Notes & Best Practices

  • Always verify checksums (if provided) to ensure data integrity.
  • Keep raw data read-only; perform trimming, alignment, etc. in separate folders.
  • Use your samplesheet.tsv to map SampleID ↔ SRR run; avoid hard-coding file names.
  • If you expect large downloads, consider running fetch steps on a server or via Aspera (ascp) for speed.

With raw FASTQs neatly organized and paired to your metadata, you’re ready for Quality Control and the rest of the RNA-Seq pipeline.