Project_B - ncbi/workshop-asm-ngs-2022 GitHub Wiki

Project B: Find SARS-CoV-2 data with low reference coverage

Part 1: Retrieving SRA data

There is more than one way to retrieve SRA data, and depending on needs you may want to use one of the methods below.

Command 1: Letting the SRAToolkit do all the work

By default, if you provide the toolkit an accession, it will find the "closest" copy of the record to you and retrieve it.

    fasterq-dump --split-3 --fasta --skip-technical --seq-defline '>$ac.$si_$ri' -f SRR12379805

Command 2: Using SRAToolkit's SDL function to locate the data and then retrieve it yourself

First we will find all the places the data exists.

    wget -q "https://locate.ncbi.nlm.nih.gov/sdl/2/retrieve?acc=SRR15515762&accept-alternate-locations=yes" -O - | jq -r '.result[].files[].locations[] | [.service, .region, .link] | @tsv' | column -t

Command 3: Then we download a file and extract the fastq from one of those locations.

    wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR15515762/SRR15515762 -O SRR15515762
    fasterq-dump --split-3 --fasta --seq-defline '>$ac.$si_$ri' -f ./SRR15515762

Part 2: Retrieving a reference genome sequence

Command 4: To retrieve the reference genome as a fasta, we can use edirect.

efetch -id NC_045512 -db nuccore -format fasta >  NC_045512.2.fa

Part 3: Writing bash scripts to calculate and plot reference genome coverage

Part 3a: A bash script to calculate reference coverage, using minimap2 and samtools.

Command 5: First create a file for the coverage calculation script

    vim illumina.sh

Command 6: Then copy and paste the following code into the file

Press i to begin editing the file, then, after pasting the code, esc, then ctrl+:, then x to save and exit vim.

    #!/bin/bash -eu
    acc=$1
    ref=$(realpath $2)
    shift 2
    fastas=$@
    minimap2 -a -x sr $ref $fastas 2>minimap2.log | samtools view -Sb - | samtools sort - > $acc.bam
    samtools depth $acc.bam > $acc.depth

Then enter chmod u+x illumina.sh

Part 3b: Create a Coverage Plot script.

Command 7: Create a file for the coverage plot script.

    vim cov_plot.sh

Command 8: Then copy and paste the following code into the file

Press i to begin editing the file, then, after pasting the code, esc, then ctrl+:, then x to save and exit vim.

    #!/bin/bash -eu
    depth=$1
    width=79
    height=24
    reverse=noreverse
    terminal=dumb
    plotwith=dots
    tmp=$(mktemp)
    cat $depth | cut -f 2,3 > $tmp
    ext=$(mktemp)
    awk 'NR > 1 { if ($2>10 && n>10 && ($2+$2<n || n+n<$2)) { print line; ; } }  { line=$0; n=$2 }' $tmp > $ext
    VAR=$(cat <<- EOM
        set term $terminal $width $height;     
        set xlabel 'Position';
        set ylabel 'Coverage (bases)';
        set xrange [] $reverse;
        set style line 1 lc rgb '#8B1A0E' pt 0 ps 1 lt 1 lw 2;
        set style line 2 lc rgb '#5E9C36' pt 0 ps 1 lt 1 lw 2;
        set tics nomirror;
        plot '$tmp' with lp ls 1 title '$depth';
    EOM
    )
    gnuplot -p -e "$VAR"; # 2>/dev/null
    rm $tmp $ext

Then enter chmod u+x cov_plot.sh

Part 4: Analyzing your SRA records

Command 9: Now analyze the two SRA records

Does the coverage look good for both, should we exclude some records from downstream analysis due to low coverage?

./illumina.sh SRR12379805 NC_045512.2.fa SRR12379805_*.fasta
./cov_plot.sh SRR12379805.depth
⚠️ **GitHub.com Fallback** ⚠️