Mining Query language

Mining query language realizes pipeline processing concept.

  SOURCE-COMMAND | COMMAND1 | COMMAND2 .... | DESTINATION-COMMAND

Source command identifies data source, destination command identifies data target and chain commands define data mining instructions.

Each miner command starts with command name that is denoted by upper case keyword (e.g. READ, WRITE, SELECT)

To get help information about commands just type

HELP <command-name>

or press F1 in command context

Miner and Python

If command accepts expression, like SELECT, then any valid python expression is accepted with following exceptions/additions:

inline if should be inserted in C notation - <condition>? <on-true>: <on-false>
Regular expression matching can be used: "string" =~ "regular-expression" or for non match "string" !~ "regular-expression"
Miner allows to use advanced constants like 10K, 2M, 2.5G, 3T, 10KB, 20MB, 1.5GB, 1000TB which are converted to integers or floats automatically.
Time constants: 1h = 3600, 1m = 60 Python variable can be created using SET command:

>>> SET c = 1TB
>>> SET l = [1KB, 1MB, 1GB, 1TB]
>>> SET d = {"hour": 1h, "min": 1m }

Then you can use PRINT command to print the expression:

PRINT c, l, d

and EVAL command to evaluate expression without returning its result:

EVAL l.reverse()

To delete variables

DEL c
DEL l[1]
DEL d["hour"]

To import external python module:

IMPORT mymodule
IMPORT mymodule.mysubmodule as mm

Source commands

READ

READ command is used to read data from files. Data is read record by record, each record consisting of set of named fields. Type of data in file is determined implicitly from file extension or explicitly using following format: READ<data-type> data.dat

READ command can specify single file to read, multiple files with glob pattern(*, ?, []), sub-tree of files or compressed sources:

READ<csv> *.dat | ...
READ json-tree-root/.../*.json | ...
READ file??.csv.gz | ...

Following syntax dumps tar archive as single csv stream

READ<csv> file.tar.gz | ...

Following syntax extracts json files from tar archive

READ<json> file.tar.gz:*.json | ...

DB FETCH

DB FETCH command is used to fetch data from databases (see Working With Database)

DB connection FETCH 'SELECT * FROM table' | ...
SET queryString = "SELECT * FROM table WHERE i=%s or j=%s"
DB connection FETCH queryString WITH param1, param2 | ...

Connection is an opened connection to a database Second format allows to specify external parameters to the query

ITERATE

ITERATE command allows to process data stored in memory, e.g. python lists or dictionaries For example if myList is list of 2-tuples you can do following:

ITERATE a,b in myList | ...
ITERATE i in xrange(1000) | ...

If d is python dictionary or dictionary-like object you can do

ITERATE key,value in d.iteritems() | ...

Destination commands

WRITE

WRITE command is used to write data to files and/or to generate reports

... | WRITE out.csv
... | WRITE<csv> out.dat
... | WRITE out.xslx

STDOUT

STDOUT command is used to dump pretty-formatted output to the screen. It supports formatting parameters and allows redirect to file

... | STDOUT
... | STDOUT i='x' j='i' _='s'
... | STDOUT > file.txt

Second option specifies that output format of i should be hexadecimal, output format of j should be integer and all rest of fields should be represented as strings

DB PUSH

DB PUSH is used to insert/update data into the database

ITERATE i in xrange(100) |  SELECT i, i**2 as j | DB connection PUSH 'insert into table(i,j) values(%s,%s)' WITH i, j

STORE

STORE command used to store results of miner in memory.

There are 3 forms of usage of this command.

If you have single row of output (usually after FOR SELECT ... command) and want to store its result in some variables:

>>> READ file.csv | FOR SELECT concat(a), sum(b+c) as d | STORE a as a_list, d
-- Destination a_list, d
-- Mining file.csv ...
Processed 3 coals into 1 diamonds for 0.001 seconds, 0 coals/sec.
>>> PRINT a_list, d
['A', ' ', ','] 10.5

To store list of results:

>>> READ file.csv | STORE a in a_list
-- Destination a_list
-- Mining file.csv ...
Processed 3 coals into 3 diamonds for 0.001 seconds, 0 coals/sec.
>>> PRINT a_list
['A', ' ', ',']

To store data in dictionary

>>> READ file.csv | STORE a:(b+c) in a_dict
-- Destination a_dict
-- Mining file.csv ...
Processed 3 coals into 3 diamonds for 0.001 seconds, 0 coals/sec.
>>> PRINT a_dict
{'A': 3, ' ': 7.5, ',': 0}

If the dictionary exists it is filled with new items and not recreated by STORE command. This for example may lead to errors if the object type is not dictionary at all.

Mining Commands

SELECT

SELECT command is a general method to apply transformation on each record (SELECT can't be used for aggregations):

ITERATE i in xrange(100) | SELECT i, i**2 as square_of_i, mm.my_func(i) as mymodule_myfunc_i | STDOUT

SELECT effectively replaces old fields with new fields defined inside select. You would probably need to set names for new expression. This can be done using as modifier (no quotes allowed around new name) If you want to keep all old fields, still adding new ones you can use '*' e.g. SELECT *, i**2 as square_of_i It is possible to generate automatic counter inside select using following format:

# pre-increment (starts from 1)
SELECT ++@id, *
# post increment (starting from zero)
SELECT @id++, *
# general counter
SELECT @id+=i, *

DISTINCT

Similar to select, but extracts only distinct values

ITERATE i in xrange(20) | DISTINCT (i/10)*10 as i | STDOUT

The query above will return 2 values: 0, and 10

IF

IF is the basic filtering command:

ITERATE i in xrange(20) | IF i%2 == 0 | STDOUT

LIMIT

LIMIT number records to process

READ a.csv | LIMIT 10 | STDOUT

Or limit until expression evaluates to True

ITERATE i in xrange(1000) | LIMIT IF i<10 | STDOUT

TAIL

Propagates only last items

ITERATE i in xrange(1000) | TAIL 10 | STDOUT

SORTBY

SORTBY <expression> [ASC|DESC] Sorts records by general python expression. Supports numbers, string or any ordered data type. Default sort order is ascending.

ITERATE i in xrange(20) | SORTBY (i-10)**2 | STDOUT

Sort by group of expressions is allowed, e.g. SORT i,j,k DESC

Complexity of sort is O(n) in size and O(n*log(n)) in time where n is the size of input.

TOP/BOTTOM

TOP/BOTTOM <number> <expression>

Returns at most of top/bottom records ordered by expression, e.g.:

ITERATE i in xrange(20) | BOTTOM 4 (i-10)**2 | STDOUT

Complexity of operation is O(m) in size and O(n*log(m)) in time where n is the size of input and m is the size of subset.

Aggregation

Aggregation commands are used to calculate different values for set of records. There are few types of commands each starting with FOR keyword. The simplest form is:

ITERATE i in xrange(20) | FOR SELECT sum(i) as sum_i, avg(i) as avg_i | STDOUT

The query above generates single output record with 2 entries one for sum and one for average. sum() and avg() operators are called aggregators. There is wide selection of available aggregators and you can add your if they follow some api see below.

Second form of aggregation is GROUP BY in SQL notation. In miner query language you do it as following:

... | FOR DISTINCT x, y SELECT sum(z) |...

Here the Grouping is identified by list of expressions x, y. After SELECT keyword aggregation expressions are specified. Command above will generate one record for each distinct set of x,y.

Third format is used to create histograms:

ITERATE i in xrange(50) | FOR EACH 10 OF i SELECT sum(i) as sum_i | STDOUT

This will split i range in bins of 10 and calculate aggregation expressions for each bin. The full format of FOR EACH command is:

FOR EACH <bin-size> [IN <from>, <to>] OF <expr> [as <name] SELECT <agg> exp1 [as name1], ...

Fourth Aggregation format is FOR IN, which allows to specify explicitly set of values and order for grouping:

... | FOR month IN [`jan`, `feb`] SELECT sum(days) | ...

Aggregators

Available aggregators are:

append(value)             - appends values to a list
avg(exp)                  - average value of expression
avgIf(cond,exp)           - calculates average value of expression for which condition is True
concat(list)              - Concatenates lists to a single one
count(exp)                - counts all occurrences when expression evaluates to True
first(exp)                - returns first seen value of expression
fraction(exp)             - returns fraction of all occurrences when expression evaluates to True
fractionSum(cond, exp)    - returns fraction of sum of expressions for which cond is True over total sum
last(exp)                 - returns last seen value of expression
max(exp)                  - returns maximum of all elements
min(exp)                  - returns minimum of all elements
number(exp)               - counts number of distinct values of expression
number(cond, exp)         - counts of distinct values of expression for which condition evaluates to true
rate(period)(value)       - gets the rates of the value over defined period, e.g. rate(1h)(hits)
rateIf(period)(cond,value)- gets the rates of the value over defined period filtered by the condition
ratio(nominator,denominator) - returns sum(nominator)/sum(denominator)
segments(start, size)     - merges segments into a continuous ones, returns aggregate.Segments object
stats(value)              - calculates full set of statistical data (min, max, avg, ...) for variable
statsIf(cond, exp)        - calculates full set of statistical data for variable if condition is true
sum(exp)                  - sum expressions evaluated for each entry
sumIf(cond, exp)          - calculates sum of expressions for which condition is True
superset(value)           - Merge values to the set
valueAtMax(test, value)   - returns value at maximum of test
valueAtMin(test, value)   - returns value at minimum of test

Miner Query Language - minersoft/miner GitHub Wiki

Mining Query language

Miner and Python

Source commands

READ

DB FETCH

ITERATE

Destination commands

WRITE

STDOUT

DB PUSH

STORE

Mining Commands

SELECT

DISTINCT

IF

LIMIT

TAIL

SORTBY

TOP/BOTTOM

Aggregation

Aggregators

⚠️ GitHub.com Fallback ⚠️

Miner Query Language - minersoft/miner GitHub Wiki

Mining Query language

Miner and Python

Source commands

READ

DB FETCH

ITERATE

Destination commands

WRITE

STDOUT

DB PUSH

STORE

Mining Commands

SELECT

DISTINCT

IF

LIMIT

TAIL

SORTBY

TOP/BOTTOM

Aggregation

Aggregators

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️