Project_C - ncbi/workshop-asm-ngs-2022 GitHub Wiki

Project C: Find variant calls that are common between paired Illumina and ONT SARS-CoV-2 records

Step 1: Different approaches to filtering by mutation

Before constructing the exact query we need, it is worth taking a minute to look at a couple approaches to using the variant call results to identify runs of potential interest.

Query 1: Filtering by genomic position in nucleotide space

First we look at how to filter to runs with mutations within some range of genomic coordinates on the reference sequence, NC_045512

    select * 
    from `bigquery-public-data.nih_sequence_read.annotated_variations`
    where POS > 1000 and POS < 2000

Query 2: Filtering by mutation types

We may also want to filter based on the type of mutation detected.

    select * 
    from `bigquery-public-data.nih_sequence_read.annotated_variations`
    where EFFECT like "NON_SYNONYMOUS%"

Query 3: Filtering by associated protein

Alternatively, we may want to find runs with mutations within a given protein's coding region.

    select * 
    from `bigquery-public-data.nih_sequence_read.annotated_variations`
    where protein_name = "S"

Query 4: Filtering by read support

Finally, we may want to filter to only runs with mutations which have some particular amount of read support.

    select * 
    from `bigquery-public-data.nih_sequence_read.annotated_variations`
    where DP >= 100 and G_AD_2 >49 AND G_AD_2/DP >=0.15

Step 2: Combine what's been learned in Projects A, B and C thus far

Next we will combined what we learned in Project A to identify a subset of records which are paired Illumina-ONT SARS-CoV-2 records with what we learned in Project B about how to identify records with low reference genome coverage. We will then combine that with variant call results to see how variant calls compare across the two platforms.

Query 5: Putting it all together

Note the use of multiple CTEs and the use of table joins. The first couple CTEs should look familiar as they are based on queries from Project A. The large list of accessions are accessions identified as either belonging to Biosamples where ARTICv3 was not indicated on runs for each platform or where runs from both platforms did not have at least 90% of the reference genome covered with at least a depth of 10.

with pairs as (  
    select *   
    from nih-sra-datastore.sra.metadata   
    where lower(organism) = "severe acute respiratory syndrome coronavirus 2"  
        and biosample in (select distinct biosample from nih-sra-datastore.sra.metadata where lower(platform) = "illumina")  
        and biosample in (select distinct biosample from nih-sra-datastore.sra.metadata where lower(platform) = "oxford_nanopore") 
        and acc not in ("ERR4458469","ERR6304269","ERR5668724","ERR4082209","ERR4439563","ERR5495454","ERR5500755","ERR4458490","ERR4458472","ERR5506735","ERR4458481","ERR4458442","ERR4458479","ERR5505783","ERR5091877","ERR5517438","ERR5633327","ERR5515901","ERR4365002","ERR5631741","ERR6297313","ERR5093802","ERR5499768","ERR4440121","ERR5631253","ERR4458405","ERR5507040","ERR5630405","ERR4082214","ERR6297717","ERR4082226","ERR4458434","ERR4458419","ERR5630017","ERR4366130","ERR5094303","ERR4365125","ERR4082307","ERR4458459","ERR4082195","ERR4439673","ERR5512480","ERR4082222","ERR4458398","ERR4082198","ERR4082194","ERR4439594","ERR5629600","ERR4082223","ERR4082215","ERR5514452","ERR4082221","ERR5631815","ERR4439752","ERR4439585","ERR4458482","ERR4439489","ERR4458410","ERR4439843","ERR5705739","ERR5498563","ERR4082216","ERR4082224","ERR5169382","ERR4082208","ERR5514445","ERR4082201","ERR4458456","ERR4458427","ERR5505512","ERR4458467","ERR4459683","ERR5476834","ERR4082206","ERR4082193","ERR4458443","ERR5631417","ERR4458450","ERR4458401","ERR5705120","ERR5509766","ERR4364963","ERR4458404","ERR5633724","ERR5628809","ERR4439281","ERR4082308","ERR6301383","ERR4085434","ERR5501735","ERR5516042","ERR5666611","ERR5629633","ERR5634532","ERR5516312","ERR5511667","ERR5068990","ERR5089731","ERR7313352","ERR4458460","ERR4458462","ERR5706286","ERR4664779","ERR4364960","ERR4458432","ERR4458464","ERR5508755","ERR4082219","ERR4458465","ERR4085396","ERR5497491","ERR5507192","ERR4439734","ERR5515580","ERR5669007","ERR4439642","ERR4082203","ERR4440037","ERR5500842","ERR5627824","ERR5633487","ERR4458446","ERR5497385","ERR5517491","ERR4082217","ERR5706212","ERR4363838","ERR5632429","ERR4082196","ERR5631763","ERR4082225","ERR4365241","ERR4439934","ERR4458457","ERR5092934","ERR5516765","ERR5747156","ERR7310600","ERR4082197","ERR5495122","ERR4458458","ERR5501241","ERR4082204","ERR4458415","ERR4082309","ERR5504174","ERR4365335","ERR5631732","ERR4458390","ERR4082213","ERR4364592","ERR5509159","ERR5506471","ERR4458483","ERR4458431","ERR5499921","ERR5517159","ERR5497159","ERR4458403","ERR7315481","ERR4082218","ERR5094116","ERR4440187","ERR4082205","ERR4082076","ERR4082212","ERR5512392","ERR5629480","ERR4082220","ERR4084890","ERR4082210","ERR4458466","ERR5631160","ERR5507462","ERR5514246","ERR5505820","ERR5510726","ERR5495833","ERR4082211") 
    ), 
runs as ( 
    select *  
    from pairs,  
    unnest(attributes) a  
    where lower(a.k) in ("artic_primer_version_exp", "primers_sam", "amplicon__pcr_primer_protocol_run", "primer_design_sam", "artic_primers_sam", 
    "artic_protocol_version_exp", "artic_barcode_identifiers_sam", "amplification_method_sam", "sequencing_protocol_name_run", 
    "amplicon_pcr_primer_scheme_run") and lower(a.v) like "%3"  
), 
vars as ( 
    select run as acc, concat(REF, POS, ALT) as nt_var, concat(protein_name, ':', variation) as aa_var   
    from `bigquery-public-data.nih_sequence_read.annotated_variations`
    where DP >= 100 and G_AD_2 >49 AND G_AD_2/DP >=0.15 and run in (select distinct acc from runs) 
), 
paired_vars as ( 
    select biosample, nt_var, aa_var, string_agg(distinct platform, " & " order by platform) as platforms 
    from runs 
    join vars using (acc) 
    group by biosample, nt_var, aa_var 
) 
select count(distinct biosample) as samples, paired_vars.platforms 
from paired_vars 
group by paired_vars.platforms

How might you go about generating the accession list yourself?

Step 3: Describe our resulting dataset

Now that we have our results, we might want to see if we can make some sense of it before doing deeper investigations.

Query 6: Finding average read support per group

Here we will slightly modify the prior query to look at overall depth, alternate allele depth, and alternate allele frequency for variants found in each group to see if that can help explain the results.

with pairs as (  
    select *   
    from nih-sra-datastore.sra.metadata   
    where lower(organism) = "severe acute respiratory syndrome coronavirus 2"  
        and biosample in (select distinct biosample from nih-sra-datastore.sra.metadata where lower(platform) = "illumina")  
        and biosample in (select distinct biosample from nih-sra-datastore.sra.metadata where lower(platform) = "oxford_nanopore") 
        and acc not in ("ERR4458469","ERR6304269","ERR5668724","ERR4082209","ERR4439563","ERR5495454","ERR5500755","ERR4458490","ERR4458472","ERR5506735","ERR4458481","ERR4458442","ERR4458479","ERR5505783","ERR5091877","ERR5517438","ERR5633327","ERR5515901","ERR4365002","ERR5631741","ERR6297313","ERR5093802","ERR5499768","ERR4440121","ERR5631253","ERR4458405","ERR5507040","ERR5630405","ERR4082214","ERR6297717","ERR4082226","ERR4458434","ERR4458419","ERR5630017","ERR4366130","ERR5094303","ERR4365125","ERR4082307","ERR4458459","ERR4082195","ERR4439673","ERR5512480","ERR4082222","ERR4458398","ERR4082198","ERR4082194","ERR4439594","ERR5629600","ERR4082223","ERR4082215","ERR5514452","ERR4082221","ERR5631815","ERR4439752","ERR4439585","ERR4458482","ERR4439489","ERR4458410","ERR4439843","ERR5705739","ERR5498563","ERR4082216","ERR4082224","ERR5169382","ERR4082208","ERR5514445","ERR4082201","ERR4458456","ERR4458427","ERR5505512","ERR4458467","ERR4459683","ERR5476834","ERR4082206","ERR4082193","ERR4458443","ERR5631417","ERR4458450","ERR4458401","ERR5705120","ERR5509766","ERR4364963","ERR4458404","ERR5633724","ERR5628809","ERR4439281","ERR4082308","ERR6301383","ERR4085434","ERR5501735","ERR5516042","ERR5666611","ERR5629633","ERR5634532","ERR5516312","ERR5511667","ERR5068990","ERR5089731","ERR7313352","ERR4458460","ERR4458462","ERR5706286","ERR4664779","ERR4364960","ERR4458432","ERR4458464","ERR5508755","ERR4082219","ERR4458465","ERR4085396","ERR5497491","ERR5507192","ERR4439734","ERR5515580","ERR5669007","ERR4439642","ERR4082203","ERR4440037","ERR5500842","ERR5627824","ERR5633487","ERR4458446","ERR5497385","ERR5517491","ERR4082217","ERR5706212","ERR4363838","ERR5632429","ERR4082196","ERR5631763","ERR4082225","ERR4365241","ERR4439934","ERR4458457","ERR5092934","ERR5516765","ERR5747156","ERR7310600","ERR4082197","ERR5495122","ERR4458458","ERR5501241","ERR4082204","ERR4458415","ERR4082309","ERR5504174","ERR4365335","ERR5631732","ERR4458390","ERR4082213","ERR4364592","ERR5509159","ERR5506471","ERR4458483","ERR4458431","ERR5499921","ERR5517159","ERR5497159","ERR4458403","ERR7315481","ERR4082218","ERR5094116","ERR4440187","ERR4082205","ERR4082076","ERR4082212","ERR5512392","ERR5629480","ERR4082220","ERR4084890","ERR4082210","ERR4458466","ERR5631160","ERR5507462","ERR5514246","ERR5505820","ERR5510726","ERR5495833","ERR4082211") 
    ), 
runs as ( 
    select *  
    from pairs,  
    unnest(attributes) a  
    where lower(a.k) in ("artic_primer_version_exp", "primers_sam", "amplicon__pcr_primer_protocol_run", "primer_design_sam", "artic_primers_sam", 
    "artic_protocol_version_exp", "artic_barcode_identifiers_sam", "amplification_method_sam", "sequencing_protocol_name_run", 
    "amplicon_pcr_primer_scheme_run") and lower(a.v) like "%3"  
), 
vars as ( 
    select run as acc, concat(REF, POS, ALT) as nt_var, concat(protein_name, ':', variation) as aa_var, DP, G_AD_2
    from `bigquery-public-data.nih_sequence_read.annotated_variations`
    where DP >= 100 and G_AD_2 >49 AND G_AD_2/DP >=0.15 and run in (select distinct acc from runs) 
), 
paired_vars as ( 
    select biosample, nt_var, aa_var, string_agg(distinct platform, " & " order by platform) as platforms 
    from runs 
    join vars using (acc) 
    group by biosample, nt_var, aa_var 
) 
select count(distinct acc) as runs, avg(DP) as avg_dp, min(DP) as min_dp, max(DP) as max_dp, avg(G_AD_2) as avg_altdp, min(G_AD_2) as min_altdp, 
max(DP) as max_altdp, avg(G_AD_2/DP) as avg_af, min(G_AD_2/DP) as min_af, max(G_AD_2/DP) as max_af, paired_vars.platforms 
from paired_vars 
join ( 
    select * 
    from vars 
    join runs using (acc) 
) using(biosample) 
group by paired_vars.platforms 
order by runs desc

What else might you look at to make sense of the results? Can you do any of those investigations with data NCBI provide in BigQuery?