Download from SRA - BenLangmead/jhu-compute GitHub Wiki
Downloading from SRA is a common task for us. It's also not completely straightforward.
While many of the sequence run accessions may be found on the European mirror of SRA (ENA), not all will be there. See https://github.com/BenLangmead/jhu-compute/wiki/Download-FASTQs-from-ENA
First, it's helpful to know there's some fairly up-to-date info from the SRA folks themselves: https://github.com/ncbi/sra-tools/wiki
Second, you will need to install a version (preferably the latest) of SRA-tools to download from SRA.
If you're on a well maintained HPC system (e.g. TACC) there's probably a form of the module
system running.
module spider sra
may result in a module you can module load <sratool_module_name>
Otherwise conda's probably the easiest:
conda install -c bioconda sra-tools
As of 2019-07-29 the most consistently stable approach to downloading from SRA (public data) is to use the following strategy:
- prefetch HTTPS to download cSRA formatted files
- parallel-fastq-dump to convert from cSRA to actual FASTQ
generic prefetch
command line:
prefetch --max-size 200G -L info -t http -O <outputdir> <RUN_ACCESSION>
working example:
prefetch --max-size 200G -L info -t http -O ./ ERR204946
If you do want to use Aspera for speed (and you know it'll work on your particular accession), just swap the http
in the -t
option with fasp
.
By leveraging the -N
and -X
(range of spots) options of fastq-dump
, parallel-fastq-dump
can run multiple fastq-dump
s in parallel on the same downloaded cSRA file.
https://github.com/rvalieris/parallel-fastq-dump
We have found parallel-fastq-dump
to consistently work with certain fastq-dump
options better than fasterq-dump
.
generic parallel-fastq-dump
command line:
parallel-fastq-dump --sra-id /path/to/downloaded_cSRA_files --threads <num_fastq_dump_processes> --tmpdir /path/to/temporary_working_dir -L info --split-3 --skip-technical --outdir ./ --gzip
NOTE: the argument to --tmpdir
needs to exist before running the command.
working example:
parallel-fastq-dump --sra-id ./ERR204946.sra --threads 4 --tmpdir ./tmp -L info --split-3 --skip-technical --outdir ./ --gzip
If for some reason parallel-fastq-dump
isn't available, you can fall back to using good old fastq-dump
it'll just be slower (and the options may be a little different).
The above will work in either a one-off way or for a pipeline.
However, additional consideration should be taken in the case of a pipeline processing many run accessions in a batch.
- To save temporary working space, configure SRA-tools to NOT "Enable Local File Caching (2)"
- If running downloads from within in a container (Docker/Singularity) you may need to modify the SRA-tools configuration paths for downloading files to work within the container context