Project 3 downloading_contigs - ncbi/workshop-asm-ngs-2022 GitHub Wiki

Project 3 Downloading contigs

A slow step in Project 3 is downloading the contig sequences (Step 2). Below are some alternative ways to speed it up by parallelizing the process.

These all assume that you already have the 293aa_kpc_contigs.out file created in Step 1.

Download the contig sequences in parallel on one VM using GNU parallel

GNU parallel will run multiple processes based on what is passed to it on STDIN. Here we split up the list of URLs we need to download into multiple files and run 12 jobs in parallel (which a few rounds of testing downloads on a subset showed me was a reasonably good value).

First we isolate just the URLs from the bq output

fgrep 'gs://' 293aa_kpc_contigs.out | awk '{print $4}' > kpc_contig_urls

Then we split up the file into 12 approximately equal parts to run in parallel using split.

split -nl/12 kpc_contig_urls

Then we use parallel to run 12 jobs at once using the new files created by split as input. We're redirecting the output of gsutil because it's not particularly useful here (and there is a lot of it).

time ls x?? | parallel -j 12 "cat {} | gsutil -m cp -I contigs 2> /dev/null"
real	2m16.951s
user	15m5.962s
sys	2m10.548s

You can now return to Project 3 Step 3