Project_A - ncbi/workshop-asm-ngs-2022 GitHub Wiki
It is often useful to work out each filter individually initially to ensure they work as expected, and none return an empty result set
Organism contains the taxonomic name the record was submitted under and is cap-sensitive. Using lower()
is helpful to avoid issues with capitlaization.
select *
from nih-sra-datastore.sra.metadata
where lower(organism) = "severe acute respiratory syndrome coronavirus 2"
select *
from nih-sra-datastore.sra.metadata
where lower(platform) = "illumina"
Some information may be entered in a variety of fields, so its useful to search for fields which might contain the information you are looking for. Note the use of unnest()
to facilitate search against an array
typed field. The use of %
allows searching within strings for substrings, using it only at the start or end of a query string would mean that the result must end or start with your query. The opening with
statement is known as a common table expression, or CTE, and is useful when constructing more complex queries.
with runs as (
select *
from nih-sra-datastore.sra.metadata
where lower(organism) = "severe acute respiratory syndrome coronavirus 2"
)
select distinct a.k as value, "k" as field
from runs,
unnest(attributes) a
where lower(a.k) like "%artic%" or lower(a.k) like "%primer%" or lower(a.v) like "%artic%" or lower(a.v) like "%primer%"
Which fields look useful to you? What do you think would happen without the CTE if we just queried directly against nih-sra-datastore.sra.metadata
?
Now that we know all our filters work, let try putting them all together.
Note the two subqueries within the CTE; we are asking for records where the associated biosample ID occurs in each (and
) of two subqueries (one looking for Illumina and the other for ONT records)
with runs as (
select *
from nih-sra-datastore.sra.metadata
where lower(organism) = "severe acute respiratory syndrome coronavirus 2"
and biosample in (select distinct biosample from nih-sra-datastore.sra.metadata where lower(platform) = "illumina")
and biosample in (select distinct biosample from nih-sra-datastore.sra.metadata where lower(platform) = "oxford_nanopore")
)
select *
from runs,
unnest(attributes) a
where lower(a.k) in ("artic_primer_version_exp", "primers_sam", "amplicon__pcr_primer_protocol_run", "primer_design_sam", "artic_primers_sam",
"artic_protocol_version_exp", "artic_barcode_identifiers_sam", "amplification_method_sam", "sequencing_protocol_name_run",
"amplicon_pcr_primer_scheme_run") and lower(a.v) like "%3"
Its often useful to look at some descriptive statistics of the data you've identified before conducting downstream analyses. Here will consider how to modify the above query to look at the distribution of collection dates among the records previously identified using group by
. group by
is an aggregate function, meaning it operates over the associated columns. When used, each value in the select statement must either have an aggregate function applied to it, or else be grouped by.
with runs as (
select *
from nih-sra-datastore.sra.metadata
where lower(organism) = "severe acute respiratory syndrome coronavirus 2"
and biosample in (select distinct biosample from nih-sra-datastore.sra.metadata where lower(platform) = "illumina")
and biosample in (select distinct biosample from nih-sra-datastore.sra.metadata where lower(platform) = "oxford_nanopore")
)
select count(distinct biosample) as samples, collection_date_sam
from runs,
unnest(attributes) a
where lower(a.k) in ("artic_primer_version_exp", "primers_sam", "amplicon__pcr_primer_protocol_run", "primer_design_sam", "artic_primers_sam",
"artic_protocol_version_exp", "artic_barcode_identifiers_sam", "amplification_method_sam", "sequencing_protocol_name_run",
"amplicon_pcr_primer_scheme_run") and lower(a.v) like "%3"
group by collection_date_sam
order by collection_date_sam desc