Miner Query Language - minersoft/miner GitHub Wiki
Mining query language realizes pipeline processing concept.
SOURCE-COMMAND | COMMAND1 | COMMAND2 .... | DESTINATION-COMMAND
Source command identifies data source, destination command identifies data target and chain commands define data mining instructions.
Each miner command starts with command name that is denoted by upper case keyword (e.g. READ, WRITE, SELECT)
To get help information about commands just type
HELP <command-name>
or press F1 in command context
If command accepts expression, like SELECT, then any valid python expression is accepted with following exceptions/additions:
- inline if should be inserted in C notation -
<condition>? <on-true>: <on-false>
- Regular expression matching can be used:
"string" =~ "regular-expression"
or for non match"string" !~ "regular-expression"
- Miner allows to use advanced constants like 10K, 2M, 2.5G, 3T, 10KB, 20MB, 1.5GB, 1000TB which are converted to integers or floats automatically.
- Time constants: 1h = 3600, 1m = 60 Python variable can be created using SET command:
>>> SET c = 1TB
>>> SET l = [1KB, 1MB, 1GB, 1TB]
>>> SET d = {"hour": 1h, "min": 1m }
Then you can use PRINT command to print the expression:
PRINT c, l, d
and EVAL command to evaluate expression without returning its result:
EVAL l.reverse()
To delete variables
DEL c
DEL l[1]
DEL d["hour"]
To import external python module:
IMPORT mymodule
IMPORT mymodule.mysubmodule as mm
READ command is used to read data from files.
Data is read record by record, each record consisting of set of named fields.
Type of data in file is determined implicitly from file extension or explicitly using following format:
READ<data-type> data.dat
READ command can specify single file to read, multiple files with glob pattern(*, ?, []), sub-tree of files or compressed sources:
READ<csv> *.dat | ...
READ json-tree-root/.../*.json | ...
READ file??.csv.gz | ...
Following syntax dumps tar archive as single csv stream
READ<csv> file.tar.gz | ...
Following syntax extracts json files from tar archive
READ<json> file.tar.gz:*.json | ...
DB FETCH command is used to fetch data from databases (see Working With Database)
DB connection FETCH 'SELECT * FROM table' | ...
SET queryString = "SELECT * FROM table WHERE i=%s or j=%s"
DB connection FETCH queryString WITH param1, param2 | ...
Connection is an opened connection to a database Second format allows to specify external parameters to the query
ITERATE command allows to process data stored in memory, e.g. python lists or dictionaries For example if myList is list of 2-tuples you can do following:
ITERATE a,b in myList | ...
ITERATE i in xrange(1000) | ...
If d is python dictionary or dictionary-like object you can do
ITERATE key,value in d.iteritems() | ...
WRITE command is used to write data to files and/or to generate reports
... | WRITE out.csv
... | WRITE<csv> out.dat
... | WRITE out.xslx
STDOUT command is used to dump pretty-formatted output to the screen. It supports formatting parameters and allows redirect to file
... | STDOUT
... | STDOUT i='x' j='i' _='s'
... | STDOUT > file.txt
Second option specifies that output format of i should be hexadecimal, output format of j should be integer and all rest of fields should be represented as strings
DB PUSH is used to insert/update data into the database
ITERATE i in xrange(100) | SELECT i, i**2 as j | DB connection PUSH 'insert into table(i,j) values(%s,%s)' WITH i, j
STORE command used to store results of miner in memory.
There are 3 forms of usage of this command.
- If you have single row of output (usually after FOR SELECT ... command) and want to store its result in some variables:
>>> READ file.csv | FOR SELECT concat(a), sum(b+c) as d | STORE a as a_list, d
-- Destination a_list, d
-- Mining file.csv ...
Processed 3 coals into 1 diamonds for 0.001 seconds, 0 coals/sec.
>>> PRINT a_list, d
['A', ' ', ','] 10.5
- To store list of results:
>>> READ file.csv | STORE a in a_list
-- Destination a_list
-- Mining file.csv ...
Processed 3 coals into 3 diamonds for 0.001 seconds, 0 coals/sec.
>>> PRINT a_list
['A', ' ', ',']
- To store data in dictionary
>>> READ file.csv | STORE a:(b+c) in a_dict
-- Destination a_dict
-- Mining file.csv ...
Processed 3 coals into 3 diamonds for 0.001 seconds, 0 coals/sec.
>>> PRINT a_dict
{'A': 3, ' ': 7.5, ',': 0}
If the dictionary exists it is filled with new items and not recreated by STORE command. This for example may lead to errors if the object type is not dictionary at all.
SELECT command is a general method to apply transformation on each record (SELECT can't be used for aggregations):
ITERATE i in xrange(100) | SELECT i, i**2 as square_of_i, mm.my_func(i) as mymodule_myfunc_i | STDOUT
SELECT effectively replaces old fields with new fields defined inside select. You would probably
need to set names for new expression. This can be done using as modifier (no quotes allowed around new name)
If you want to keep all old fields, still adding new ones you can use '*' e.g. SELECT *, i**2 as square_of_i
It is possible to generate automatic counter inside select using following format:
# pre-increment (starts from 1)
SELECT ++@id, *
# post increment (starting from zero)
SELECT @id++, *
# general counter
SELECT @id+=i, *
Similar to select, but extracts only distinct values
ITERATE i in xrange(20) | DISTINCT (i/10)*10 as i | STDOUT
The query above will return 2 values: 0, and 10
IF is the basic filtering command:
ITERATE i in xrange(20) | IF i%2 == 0 | STDOUT
LIMIT number records to process
READ a.csv | LIMIT 10 | STDOUT
Or limit until expression evaluates to True
ITERATE i in xrange(1000) | LIMIT IF i<10 | STDOUT
Propagates only last items
ITERATE i in xrange(1000) | TAIL 10 | STDOUT
SORTBY <expression> [ASC|DESC]
Sorts records by general python expression.
Supports numbers, string or any ordered data type.
Default sort order is ascending.
ITERATE i in xrange(20) | SORTBY (i-10)**2 | STDOUT
Sort by group of expressions is allowed, e.g. SORT i,j,k DESC
Complexity of sort is O(n) in size and O(n*log(n)) in time where n is the size of input.
TOP/BOTTOM <number> <expression>
Returns at most of top/bottom records ordered by expression, e.g.:
ITERATE i in xrange(20) | BOTTOM 4 (i-10)**2 | STDOUT
Complexity of operation is O(m) in size and O(n*log(m)) in time where n is the size of input and m is the size of subset.
Aggregation commands are used to calculate different values for set of records.
There are few types of commands each starting with FOR
keyword.
The simplest form is:
ITERATE i in xrange(20) | FOR SELECT sum(i) as sum_i, avg(i) as avg_i | STDOUT
The query above generates single output record with 2 entries one for sum and one for average. sum() and avg() operators are called aggregators. There is wide selection of available aggregators and you can add your if they follow some api see below.
Second form of aggregation is GROUP BY in SQL notation. In miner query language you do it as following:
... | FOR DISTINCT x, y SELECT sum(z) |...
Here the Grouping is identified by list of expressions x, y. After SELECT keyword aggregation expressions are specified. Command above will generate one record for each distinct set of x,y.
Third format is used to create histograms:
ITERATE i in xrange(50) | FOR EACH 10 OF i SELECT sum(i) as sum_i | STDOUT
This will split i range in bins of 10 and calculate aggregation expressions for each bin.
The full format of FOR EACH
command is:
FOR EACH <bin-size> [IN <from>, <to>] OF <expr> [as <name] SELECT <agg> exp1 [as name1], ...
Fourth Aggregation format is FOR IN
, which allows to specify explicitly set of values and order for grouping:
... | FOR month IN [`jan`, `feb`] SELECT sum(days) | ...
Available aggregators are:
append(value) - appends values to a list
avg(exp) - average value of expression
avgIf(cond,exp) - calculates average value of expression for which condition is True
concat(list) - Concatenates lists to a single one
count(exp) - counts all occurrences when expression evaluates to True
first(exp) - returns first seen value of expression
fraction(exp) - returns fraction of all occurrences when expression evaluates to True
fractionSum(cond, exp) - returns fraction of sum of expressions for which cond is True over total sum
last(exp) - returns last seen value of expression
max(exp) - returns maximum of all elements
min(exp) - returns minimum of all elements
number(exp) - counts number of distinct values of expression
number(cond, exp) - counts of distinct values of expression for which condition evaluates to true
rate(period)(value) - gets the rates of the value over defined period, e.g. rate(1h)(hits)
rateIf(period)(cond,value)- gets the rates of the value over defined period filtered by the condition
ratio(nominator,denominator) - returns sum(nominator)/sum(denominator)
segments(start, size) - merges segments into a continuous ones, returns aggregate.Segments object
stats(value) - calculates full set of statistical data (min, max, avg, ...) for variable
statsIf(cond, exp) - calculates full set of statistical data for variable if condition is true
sum(exp) - sum expressions evaluated for each entry
sumIf(cond, exp) - calculates sum of expressions for which condition is True
superset(value) - Merge values to the set
valueAtMax(test, value) - returns value at maximum of test
valueAtMin(test, value) - returns value at minimum of test