Data Organization and Download - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

4.2 Data Organization & Download

Before running any analyses, it pays to have a consistent directory layout and a reliable way to fetch raw FASTQ files from public repositories (SRA/ENA).

📁 Recommended Directory Structure

project_root/
├── metadata/
│   └── samplesheet.tsv        # sample-to-FASTQ mapping & experiment design
│
├── raw_data/                  # downloaded FASTQs
│   ├── SampleA_R1.fastq.gz
│   ├── SampleA_R2.fastq.gz
│   ├── SampleB_R1.fastq.gz
│   └── SampleB_R2.fastq.gz
│
├── qc/                        # FastQC / MultiQC outputs
│   ├── SampleA_fastqc.html
│   └── SampleB_fastqc.html
│
├── trimmed/                   # adapter-/quality-trimmed reads
│   ├── SampleA_trimmed_R1.fastq.gz
│   └── SampleA_trimmed_R2.fastq.gz
│
├── align/                     # alignments (BAM files)
│   ├── SampleA.sorted.bam
│   └── SampleA.sorted.bam.bai
│
├── counts/                    # gene-level count matrices
│   └── counts.tsv
│
└── results/                   # downstream tables & figures
    ├── de_analysis/
    └── enrichment/

Tip: keep your samplesheet.tsv under metadata/ and refer to it in all pipeline steps to automatically assign file paths and group labels.

⬇️ Fetching from NCBI SRA

Install SRA Toolkit

conda install -c bioconda sra-tools

Download & split paired-end reads

# example accession SRR1234567
prefetch SRR1234567

# convert to FASTQ, split paired reads:
fasterq-dump SRR1234567 \
  --split-files \
  --outdir raw_data \
  --threads 4

# optionally gzip the outputs
pigz -p 4 raw_data/SRR1234567_*.fastq

Rename for clarity (match SampleID in your metadata)

mv raw_data/SRR1234567_1.fastq.gz raw_data/SampleA_R1.fastq.gz
mv raw_data/SRR1234567_2.fastq.gz raw_data/SampleA_R2.fastq.gz

⬇️ Fetching from ENA (European Nucleotide Archive)

Use wget or curl

⬇️ **Fetching from ENA (European Nucleotide Archive)**

1. **Use `wget` or `curl`**  
   ```bash
   # Example with wget
   wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/001/SRR1234567/SRR1234567_1.fastq.gz
   wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/001/SRR1234567/SRR1234567_2.fastq.gz

   # Example with curl
   curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/001/SRR1234567/SRR1234567_1.fastq.gz
   curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/001/SRR1234567/SRR1234567_2.fastq.gz

Recursive download of all FASTQ

# grab all .fastq.gz files in one go
wget -r -nd -P raw_data \
  -A "*_1.fastq.gz","*_2.fastq.gz" \
  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/007/SRR1234567/

📝 Notes & Best Practices

Always verify checksums (if provided) to ensure data integrity.
Keep raw data read-only; perform trimming, alignment, etc. in separate folders.
Use your samplesheet.tsv to map SampleID ↔ SRR run; avoid hard-coding file names.
If you expect large downloads, consider running fetch steps on a server or via Aspera (ascp) for speed.

With raw FASTQs neatly organized and paired to your metadata, you’re ready for Quality Control and the rest of the RNA-Seq pipeline.