Obtain Data - mlbendall/telescope_tutorial GitHub Wiki
2-i. Identifying and obtaining RNA-seq data from GEO and SRA
Although generating new data is important for answering specific research questions, there is a wealth of existing, publicly accessible data that can be leveraged for novel insights and generating hypotheses.
Identify datasets of interest using GEO
The Gene Expression Omnibus (GEO) is a public functional genomics data repository that stores expression data, including microarray data and RNA-seq. An overview of the database is here. This is a great resource since samples are grouped into "DataSets": curated collections of biologically and statistically comparable GEO Samples.
Use SRA database to fetch sequence data and metadata
The Sequence Read Archive (SRA) makes biological sequence data available to the research community to enhance reproducibility and allow for new discoveries by comparing data sets. This database is where actual sequence data is stored and can be downloaded by researchers. The key to finding sequence data for a given experiment is the SRA accession number.
Other databases
Raw genomic sequence data contains information that could potentially be used to re-identify the original subjects in a study. Access to genomic data from clinical samples is thus controlled to protect individual privacy. Often, these projects will still appear in GEO and/or SRA, but the raw sequence data will only be accessible by applying for permission through dbGAP.
Practical Exercise 2
-
Part 1: Search GEO for RNA-seq datasets related to autism using human iPSCs. Identify the SRA project accession number corresponding to this dataset.
-
Part 2: Use the
cbi_sra_metadata
tool to download information about this dataset.
Solution for Practical Exercise 2
module load cbiC1
cbi_sra_metadata2 --email <[email protected]> fetch SRP050377
cbi_sra_metadata2 urls SRP050377 > metadata/sample_urls.txt
cbi_sra_metadata2 samples SRP050377 > metadata/sample_matrix.txt
tail -n+2 metadata/sample_matrix.txt | cut -f1 > samples.txt
Extracting FASTQ files from SRA
Download all the SRA files
wget -i metadata/sample_urls.txt
Make directory for each sample and move .sra files
while read samp; do mkdir -p $samp && mv $samp.sra $samp; done < samples.txt
Extract FASTQ from SRA using fastq-dump
The argument -N and -X limit the output to 500K read pairs.
cat samples.txt | while read samp; do
fastq-dump -N 1000001 -X 1500000 -F -B -Q 33 --split-files -R --defline-qual '+' --gzip -O $samp $samp/$samp.sra
done
Previous Section | This Section | Next Section |
---|---|---|
Tutorial Setup | Obtain Data | Sequence Data QC |