Obtain Data - mlbendall/telescope_tutorial GitHub Wiki

2-i. Identifying and obtaining RNA-seq data from GEO and SRA

Although generating new data is important for answering specific research questions, there is a wealth of existing, publicly accessible data that can be leveraged for novel insights and generating hypotheses.

Identify datasets of interest using GEO

The Gene Expression Omnibus (GEO) is a public functional genomics data repository that stores expression data, including microarray data and RNA-seq. An overview of the database is here. This is a great resource since samples are grouped into "DataSets": curated collections of biologically and statistically comparable GEO Samples.

Use SRA database to fetch sequence data and metadata

The Sequence Read Archive (SRA) makes biological sequence data available to the research community to enhance reproducibility and allow for new discoveries by comparing data sets. This database is where actual sequence data is stored and can be downloaded by researchers. The key to finding sequence data for a given experiment is the SRA accession number.

Other databases

Raw genomic sequence data contains information that could potentially be used to re-identify the original subjects in a study. Access to genomic data from clinical samples is thus controlled to protect individual privacy. Often, these projects will still appear in GEO and/or SRA, but the raw sequence data will only be accessible by applying for permission through dbGAP.


Practical Exercise 2

  • Part 1: Search GEO for RNA-seq datasets related to autism using human iPSCs. Identify the SRA project accession number corresponding to this dataset.

  • Part 2: Use the cbi_sra_metadata tool to download information about this dataset.


Solution for Practical Exercise 2

module load cbiC1
cbi_sra_metadata2 --email <[email protected]> fetch SRP050377
cbi_sra_metadata2 urls SRP050377 > metadata/sample_urls.txt
cbi_sra_metadata2 samples SRP050377 > metadata/sample_matrix.txt

tail -n+2 metadata/sample_matrix.txt | cut -f1 > samples.txt

Extracting FASTQ files from SRA

Download all the SRA files

wget -i metadata/sample_urls.txt

Make directory for each sample and move .sra files

while read samp; do mkdir -p $samp && mv $samp.sra $samp; done < samples.txt

Extract FASTQ from SRA using fastq-dump

The argument -N and -X limit the output to 500K read pairs.

cat samples.txt | while read samp; do
    fastq-dump -N 1000001 -X 1500000 -F -B -Q 33 --split-files -R --defline-qual '+' --gzip -O $samp $samp/$samp.sra
done
Previous Section This Section Next Section
Tutorial Setup Obtain Data Sequence Data QC