PQL API - lumen-org/LumenReact GitHub Wiki

PQL - Probability Model Query Langauge

This is about the interface to a probabilistic model and what queries such an interface should support.

Disclaimer: this documentation may be a bit out-of-date

PQL Grammar

Commands/expressions supported by PQL

pql = predict-expr | model-expr | drop-expr ;

'Model' Expressions derive a submodel and store it under a name

model-expr = model-clause, as-clause, from-clause, [where-clause];

model-clause = "MODEL" , dimension, {dimension};

as-clause = "AS" , name;

'Predict' expressions calculate certain aggregations of submodels of a model

predict-expr = predict-clause, from-clause , [where-clause], [split-by-clause];

predict-clause = "PREDICT" , (dimension | aggr-dimension) , {dimension | aggr-dimension};

split-by-clause = "SPLIT BY" , split-dimension, {split-dimension};

aggr-dimension = aggr , "(" , dimension , {dimension} , ")";

aggr = "density" | "avg" | "max"; /* an aggregation function */

'Drop' expressions drop, i.e. remove a model from a model base

drop-expr = "DROP" , model;

Rules needed for several of the above commands/expressions

from-clause = "FROM" , model | ( "(" , model-expr , ")" );

where-clause = "WHERE" , filter-expr, {filter-expr};

split-dimension = splitFct , "(" dimension , {splitFctArg} ")"

dimension = "name"; /* name of a random variable of the model */

model = "name"; /* name of a model */

splitFct = /* method to split, e.g. equisplit, ... */

splitFctArg = /* argument to a split function */

Examples

In the following the examples are wrapped in a JSON object.

Note, that "dimension" is here replaced by "name".

Idea: I could add a "class" property for each {}, e.g .

{"aggregation": "average", "name": "income", "class"="aggregation"}

instead of

{"aggregation": "average", "name": "income"}

to easy parsing of the lexeems.

'PREDICT' query: predicts the average income of over 35ers split by sex:

{  "PREDICT": [
      "sex", 
      {"aggregation": "average", "name": "income"}
   ],
   "FROM": "census",
   "WHERE": [
      {"name": "age", "operator": ">", "value": 35 }
   ],
   "SPLIT BY": [
      {"name": "sex", "split": "equiDist", "args": 10}
   ]
}

Note, that a ''randVar'' without aggregation is valid in the ''PREDICT''-clause, if there is a ''GROUP BY''-clause for that ''randVar''. If that ''randVar'' is not present in the ''PREDICT''-clause, the grouping would still happen but the values and the randdom variable grouped by would not be returned. This is in accordance to the SQL-way.

'MODEL' query: derive submodel over random variables sex and income, conditioned on age == 20, and store it under the name "my-sub-model"

{  "MODEL": [
    "sex", 
    "income"
  ],
  "AS:" "my-sub-model",
  "FROM": "census",
  "WHERE": [
    {"name": "age", "operator": "EQUALS", "value": 20 }
  ]
}

'DROP' query: drops a model named "not-needed-model" from the model base

{  "DROP": "not-needed-model" }

'SHOW HEADER' query: returns a description of a model named "car_crashes" in terms of its random variables

{  "SHOW": "HEADER",
   "FROM": "car_crashes"
}

'SHOW MODELS' query: returns a list of all names of models in the model base

{  "SHOW": "MODELS"  }

The following are all model queries, however they have a particular semantic:

'CLONE MODEL' query: creates an identical copy of a model under a different name

{ "FROM": "car_crashes",
  "MODEL": "*",
  "AS": "car_crashes_clone"
}

'MARGINALIZE MODEL' query: Maginalizes some random variables out of a model (and modifies that model instead of creating a new one)

{ "FROM": "car_crashes_clone",
  "MODEL": ["speeding", "alcohol", "not_distracted"],
  "AS": "car_crashes_clone"
}

Queries against a Model

this is old stuff, and may not be reliable:

  • marginalization
  • conditioning
    • condition on single value
    • condition on range, i.e.:
      • set of values for discrete variables
      • range / set of ranges for continuous variables
  • aggregations
    • note 1: it is not necessary that the domain of the aggregation function is one-dimensional. Also multidimensional densities do have e.g. a single maximum - as their image is nevertheless scalar!
    • note 2: i added "arg" before each aggregation function as we are really interested in the value (set of values/ranges/...) of the variable that will give us the maximum, average, ...
    • arg-average:
    • arg-maximum: gives a scalar
    • arg-k-largest local maxima: gives k scalars
    • arg-quantiles: set of ranges (it may be multiple ranges)
  • binning, i.e. discretizing
    • is essentially "iterated averaging on ranges": for all_bins do '(average of (p conditioned on current bin))'
  • sampling ?