Project_A - ncbi/workshop-asm-ngs-2022 GitHub Wiki

Project A: Find SARS2 data with paired Illumina and ONT samples, generated using ARTICv3

Step 1: Figuring out how to implement each filter individually

It is often useful to work out each filter individually initially to ensure they work as expected, and none return an empty result set

Query 1: Filter by Organism

Organism contains the taxonomic name the record was submitted under and is cap-sensitive. Using lower() is helpful to avoid issues with capitlaization.

    select *
    from nih-sra-datastore.sra.metadata
    where lower(organism) = "severe acute respiratory syndrome coronavirus 2"

Query 2: Filter by Platform

    select *
    from nih-sra-datastore.sra.metadata
    where lower(platform) = "illumina"

Query 3: Find additional fields that may have primer strategy information

Some information may be entered in a variety of fields, so its useful to search for fields which might contain the information you are looking for. Note the use of unnest() to facilitate search against an array typed field. The use of % allows searching within strings for substrings, using it only at the start or end of a query string would mean that the result must end or start with your query. The opening with statement is known as a common table expression, or CTE, and is useful when constructing more complex queries.

    with runs as (
        select * 
        from nih-sra-datastore.sra.metadata 
        where lower(organism) = "severe acute respiratory syndrome coronavirus 2" 
    )
    select distinct a.k as value, "k" as field
    from runs,
    unnest(attributes) a
    where lower(a.k) like "%artic%" or lower(a.k) like "%primer%" or lower(a.v) like "%artic%" or lower(a.v) like "%primer%"

Which fields look useful to you? What do you think would happen without the CTE if we just queried directly against nih-sra-datastore.sra.metadata?

Step 2: Putting it all together

Now that we know all our filters work, let try putting them all together.

Note the two subqueries within the CTE; we are asking for records where the associated biosample ID occurs in each (and) of two subqueries (one looking for Illumina and the other for ONT records)

Query 4:

with runs as (
    select * 
    from nih-sra-datastore.sra.metadata 
    where lower(organism) = "severe acute respiratory syndrome coronavirus 2"
        and biosample in (select distinct biosample from nih-sra-datastore.sra.metadata where lower(platform) = "illumina")
        and biosample in (select distinct biosample from nih-sra-datastore.sra.metadata where lower(platform) = "oxford_nanopore")
)
select *
from runs,
unnest(attributes) a
where lower(a.k) in ("artic_primer_version_exp", "primers_sam", "amplicon__pcr_primer_protocol_run", "primer_design_sam", "artic_primers_sam", 
"artic_protocol_version_exp", "artic_barcode_identifiers_sam", "amplification_method_sam", "sequencing_protocol_name_run", 
"amplicon_pcr_primer_scheme_run") and lower(a.v) like "%3"

Step 3: Grouping by collection date

Its often useful to look at some descriptive statistics of the data you've identified before conducting downstream analyses. Here will consider how to modify the above query to look at the distribution of collection dates among the records previously identified using group by. group by is an aggregate function, meaning it operates over the associated columns. When used, each value in the select statement must either have an aggregate function applied to it, or else be grouped by.

Query 5:

with runs as (
    select * 
    from nih-sra-datastore.sra.metadata 
    where lower(organism) = "severe acute respiratory syndrome coronavirus 2"
        and biosample in (select distinct biosample from nih-sra-datastore.sra.metadata where lower(platform) = "illumina")
        and biosample in (select distinct biosample from nih-sra-datastore.sra.metadata where lower(platform) = "oxford_nanopore")
)
select count(distinct biosample) as samples, collection_date_sam
from runs,
unnest(attributes) a
where lower(a.k) in ("artic_primer_version_exp", "primers_sam", "amplicon__pcr_primer_protocol_run", "primer_design_sam", "artic_primers_sam", 
"artic_protocol_version_exp", "artic_barcode_identifiers_sam", "amplification_method_sam", "sequencing_protocol_name_run", 
"amplicon_pcr_primer_scheme_run") and lower(a.v) like "%3"
group by collection_date_sam
order by collection_date_sam desc
⚠️ **GitHub.com Fallback** ⚠️