How to get publicly available RNAseq data with the SRA toolkit - Persilian/WGCNA GitHub Wiki
In the age of high-throughput sequencing technologies, all published studies involving genome assemblies and transcriptomes have their sequencing data publicly available. This is a guide on how to access such read data.
Select a publication
As an example we assume that you are interested in gene expression of Arabidopsis thaliana under heat and cold treatments. You will come across publications that have used RNAseq to measure gene expression under heat and cold, which were published with the requirement to make the RNAseq read data available in the database of the national center for biotechnology information, NCBI (https://www.ncbi.nlm.nih.gov).
Search for read data
In scientific publications there is usually a section called “accession numbers”, which contains the accession number of the sequencing projects the publication has published. As an example, we use the “Stress dataset” from Klepikova et al. 2016 (https://doi.org/10.1111/tpj.13312), which has the project ID “PRJNA324514”. You can simply do a google search of this ID and use the first result from NCBI. You are now on the NCBI project page of the sequencing project you have looked for. Here you can find detailed information about the generation of the read data, as well as access descriptions of individual sequencing files.
Access individual read data files
On the project page you'll find a section “project data” where you can access sequencing datasets and other datasets, such as descriptions of the tissue samples used. Under sequencing datasets, you will find detailed descriptions of each sequencing library, as well as the links to the sequencing library files. To get the download-link to the sequencing libraries of your choice, click the link in the “Runs” section, leading you for example to the description of the 6 hour heat treated Arabidopsis leaf library (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR3724785). Go to the “data access” tab and copy paste the download link of the library into a text document, for example https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR3724739/SRR3724739.1
Downloading read data with the SRA-toolkit
Once you have found the download-links to all the sequencing libraries, you will need to download them using the SRA-toolkit. Install the SRA-toolkit on your linux operated computer, for example using a miniconda3 environment.
conda create -n SRAtoolkit
conda activate SRAtoolkit
conda install -c bioconda sra-tools
Download the sequencing libraries using the download-links you acquired before.
wget https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-1/SRR3724739/SRR3724739.1
Within your SRAtoolkit miniconda environment, use the SRAtoolkit command “fastq-dump” to extract .fastq files from your SRA files.
fastq-dump -O ./cold --gzip SRR3724739.1
Rename the .fastq files according to your needs and process them further. It is recommended that you treat those files like raw-reads, therefore do a fastQC first. Options to improve raw-reads are described here (https://informatics.fas.harvard.edu/best-practices-for-de-novo-transcriptome-assembly-with-trinity.html).