Data Organization and Download - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
4.2 Data Organization & Download
Before running any analyses, it pays to have a consistent directory layout and a reliable way to fetch raw FASTQ files from public repositories (SRA/ENA).
π Recommended Directory Structure
project_root/
βββ metadata/
β βββ samplesheet.tsv # sample-to-FASTQ mapping & experiment design
β
βββ raw_data/ # downloaded FASTQs
β βββ SampleA_R1.fastq.gz
β βββ SampleA_R2.fastq.gz
β βββ SampleB_R1.fastq.gz
β βββ SampleB_R2.fastq.gz
β
βββ qc/ # FastQC / MultiQC outputs
β βββ SampleA_fastqc.html
β βββ SampleB_fastqc.html
β
βββ trimmed/ # adapter-/quality-trimmed reads
β βββ SampleA_trimmed_R1.fastq.gz
β βββ SampleA_trimmed_R2.fastq.gz
β
βββ align/ # alignments (BAM files)
β βββ SampleA.sorted.bam
β βββ SampleA.sorted.bam.bai
β
βββ counts/ # gene-level count matrices
β βββ counts.tsv
β
βββ results/ # downstream tables & figures
βββ de_analysis/
βββ enrichment/
Tip: keep your
samplesheet.tsv
undermetadata/
and refer to it in all pipeline steps to automatically assign file paths and group labels.
β¬οΈ Fetching from NCBI SRA
- Install SRA Toolkit
conda install -c bioconda sra-tools
- Download & split paired-end reads
# example accession SRR1234567
prefetch SRR1234567
# convert to FASTQ, split paired reads:
fasterq-dump SRR1234567 \
--split-files \
--outdir raw_data \
--threads 4
# optionally gzip the outputs
pigz -p 4 raw_data/SRR1234567_*.fastq
- Rename for clarity (match SampleID in your metadata)
mv raw_data/SRR1234567_1.fastq.gz raw_data/SampleA_R1.fastq.gz
mv raw_data/SRR1234567_2.fastq.gz raw_data/SampleA_R2.fastq.gz
β¬οΈ Fetching from ENA (European Nucleotide Archive)
- Use
wget
orcurl
β¬οΈ **Fetching from ENA (European Nucleotide Archive)**
1. **Use `wget` or `curl`**
```bash
# Example with wget
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/001/SRR1234567/SRR1234567_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/001/SRR1234567/SRR1234567_2.fastq.gz
# Example with curl
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/001/SRR1234567/SRR1234567_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/001/SRR1234567/SRR1234567_2.fastq.gz
- Recursive download of all FASTQ
# grab all .fastq.gz files in one go
wget -r -nd -P raw_data \
-A "*_1.fastq.gz","*_2.fastq.gz" \
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/007/SRR1234567/
π Notes & Best Practices
- Always verify checksums (if provided) to ensure data integrity.
- Keep raw data read-only; perform trimming, alignment, etc. in separate folders.
- Use your
samplesheet.tsv
to mapSampleID β SRR
run; avoid hard-coding file names. - If you expect large downloads, consider running fetch steps on a server or via Aspera (
ascp
) for speed.
With raw FASTQs neatly organized and paired to your metadata, youβre ready for Quality Control and the rest of the RNA-Seq pipeline.