Project 3 downloading_contigs - ncbi/workshop-asm-ngs-2022 GitHub Wiki
Project 3 Downloading contigs
A slow step in Project 3 is downloading the contig sequences (Step 2). Below are some alternative ways to speed it up by parallelizing the process.
These all assume that you already have the 293aa_kpc_contigs.out file created in Step 1.
parallel
Download the contig sequences in parallel on one VM using GNU GNU parallel will run multiple processes based on what is passed to it on STDIN. Here we split up the list of URLs we need to download into multiple files and run 12 jobs in parallel (which a few rounds of testing downloads on a subset showed me was a reasonably good value).
First we isolate just the URLs from the bq
output
fgrep 'gs://' 293aa_kpc_contigs.out | awk '{print $4}' > kpc_contig_urls
Then we split up the file into 12 approximately equal parts to run in parallel using split
.
split -nl/12 kpc_contig_urls
Then we use parallel
to run 12 jobs at once using the new files created by split
as input. We're redirecting the output of gsutil
because it's not particularly useful here (and there is a lot of it).
time ls x?? | parallel -j 12 "cat {} | gsutil -m cp -I contigs 2> /dev/null"
real 2m16.951s
user 15m5.962s
sys 2m10.548s