Query Builder - serratus-bio/open-virome GitHub Wiki

Filters

In the Open Virome, it's possible to query into a {Virome} using various metadata labels associated to sequence runs. We call these filters. Filters can include metadata associated directly to the SRA sequence or to the parent BioSample or BioProject.

We can think of these filters as axes on which we slice into a viral ecological niche that we want to explore.

For the community as a whole we can conceive an abstract niche "structure." Many of the attributes of niches can be treated as gradients - of organism size, vertical height, soil depth, diurnal time, seasonal time, proportions of different foods, intensities of different chemical defenses, etc. These gradients may be treated as axes defining a multidimensional, abstract "space", the niche hyperspace (Hutchinson I957, Whittaker I965, I969, 1972). The niche hyperspace is a means of conceiving the way species relate to one another in the community as an interacting system.

Whittaker, R. H. “Evolution and Measurement of Species Diversity.” Taxon, vol. 21, no. 2/3, 1972, pp. 213–51. JSTOR, https://doi.org/10.2307/1218190 (pdf).

Identifiers

Under the hood, the Query Builder works by constructing a minimal join between the denormalized ov_identifiers materialized view and the various normalized tables associated to each applied filter (table dependency chart can be found below).

This initial query returns a set of identifiers consisting of run_id, biosample, bioproject, which are stored in the global application state. The various Modules can then use the identifiers to query for counts on their associated tables to render plots.

Advantages:
- Decouple Query functionality from Module plotting.
- Enable alternative querying methods (e.g. by sequence, by Neo4j Cypher, kmer index search etc.) as long as they return a set of identifiers.
- Module queries for counts don't need to run complex joins associated to applied filters. Instead they can simply add a conditional `{identifier} IN {indexed column}`, which is also much faster.
- Simplifies mental model when working with tables that use different identifiers as their primary key. For example, the tissue table has a primary key of `biosample`, whereas the STAT table uses `run_id`.

Disadvantages:
- Storing a large number of identifiers in app state can cause app slow down. This can be mitigated by only storing ranges of identifiers.
- In the future, some conflicts may arise where we apply one filter to a biosample and another conflicting filter to a run within the biosample. This is currently not an issue and would also impact other approaches for querying.

Table dependencies

Documentation on available tables in database.

Module Table/View Identifiers
Query Builder ov_identifiers run_id, biosample, bioproject
SRA Run sra run_id, biosample, bioproject
Virome palm_virome run_id, biosample, bioproject
Ecology (geo) biosample_geographical_location biosample
Ecology (biome) bgl_gm4326_gp4326 biosample
Host (Tissue) biosample_tissue biosample
Host (STAT) sra_stat run_id