PQL API - lumen-org/LumenReact GitHub Wiki
PQL - Probability Model Query Langauge
This is about the interface to a probabilistic model and what queries such an interface should support.
Disclaimer: this documentation may be a bit out-of-date
PQL Grammar
Commands/expressions supported by PQL
pql = predict-expr | model-expr | drop-expr ;
'Model' Expressions derive a submodel and store it under a name
model-expr = model-clause, as-clause, from-clause, [where-clause];
model-clause = "MODEL" , dimension, {dimension};
as-clause = "AS" , name;
'Predict' expressions calculate certain aggregations of submodels of a model
predict-expr = predict-clause, from-clause , [where-clause], [split-by-clause];
predict-clause = "PREDICT" , (dimension | aggr-dimension) , {dimension | aggr-dimension};
split-by-clause = "SPLIT BY" , split-dimension, {split-dimension};
aggr-dimension = aggr , "(" , dimension , {dimension} , ")";
aggr = "density" | "avg" | "max"; /* an aggregation function */
'Drop' expressions drop, i.e. remove a model from a model base
drop-expr = "DROP" , model;
Rules needed for several of the above commands/expressions
from-clause = "FROM" , model | ( "(" , model-expr , ")" );
where-clause = "WHERE" , filter-expr, {filter-expr};
split-dimension = splitFct , "(" dimension , {splitFctArg} ")"
dimension = "name"; /* name of a random variable of the model */
model = "name"; /* name of a model */
splitFct = /* method to split, e.g. equisplit, ... */
splitFctArg = /* argument to a split function */
Examples
In the following the examples are wrapped in a JSON object.
Note, that "dimension" is here replaced by "name".
Idea: I could add a "class" property for each {}, e.g .
{"aggregation": "average", "name": "income", "class"="aggregation"}
instead of
{"aggregation": "average", "name": "income"}
to easy parsing of the lexeems.
'PREDICT' query: predicts the average income of over 35ers split by sex:
{ "PREDICT": [
"sex",
{"aggregation": "average", "name": "income"}
],
"FROM": "census",
"WHERE": [
{"name": "age", "operator": ">", "value": 35 }
],
"SPLIT BY": [
{"name": "sex", "split": "equiDist", "args": 10}
]
}
Note, that a ''randVar'' without aggregation is valid in the ''PREDICT''-clause, if there is a ''GROUP BY''-clause for that ''randVar''. If that ''randVar'' is not present in the ''PREDICT''-clause, the grouping would still happen but the values and the randdom variable grouped by would not be returned. This is in accordance to the SQL-way.
'MODEL' query: derive submodel over random variables sex and income, conditioned on age == 20, and store it under the name "my-sub-model"
{ "MODEL": [
"sex",
"income"
],
"AS:" "my-sub-model",
"FROM": "census",
"WHERE": [
{"name": "age", "operator": "EQUALS", "value": 20 }
]
}
'DROP' query: drops a model named "not-needed-model" from the model base
{ "DROP": "not-needed-model" }
'SHOW HEADER' query: returns a description of a model named "car_crashes" in terms of its random variables
{ "SHOW": "HEADER",
"FROM": "car_crashes"
}
'SHOW MODELS' query: returns a list of all names of models in the model base
{ "SHOW": "MODELS" }
The following are all model queries, however they have a particular semantic:
'CLONE MODEL' query: creates an identical copy of a model under a different name
{ "FROM": "car_crashes",
"MODEL": "*",
"AS": "car_crashes_clone"
}
'MARGINALIZE MODEL' query: Maginalizes some random variables out of a model (and modifies that model instead of creating a new one)
{ "FROM": "car_crashes_clone",
"MODEL": ["speeding", "alcohol", "not_distracted"],
"AS": "car_crashes_clone"
}
Queries against a Model
this is old stuff, and may not be reliable:
- marginalization
- conditioning
- condition on single value
- condition on range, i.e.:
- set of values for discrete variables
- range / set of ranges for continuous variables
- aggregations
- note 1: it is not necessary that the domain of the aggregation function is one-dimensional. Also multidimensional densities do have e.g. a single maximum - as their image is nevertheless scalar!
- note 2: i added "arg" before each aggregation function as we are really interested in the value (set of values/ranges/...) of the variable that will give us the maximum, average, ...
- arg-average:
- arg-maximum: gives a scalar
- arg-k-largest local maxima: gives k scalars
- arg-quantiles: set of ranges (it may be multiple ranges)
- binning, i.e. discretizing
- is essentially "iterated averaging on ranges": for all_bins do '(average of (p conditioned on current bin))'
- sampling ?