Vocabulary (mappings) - ML-Schema/core GitHub Wiki

ML-Schema — Machine Learning Schema — provides a model for expressing data mining and machine learning algorithms, datasets, and experiments. This section introduces the core of the ML-Schema model, namely the classes (types) that are used to represent the majority of the cases. (note: this draft vocabulary has become obsolete and this description is going to be replaced by a new version shortly)

Task

Property	Value
Description	A formal description of a process that needs to be completed (e.g. based on inputs and outputs). A Task is any piece of work that needs to be addressed in the data mining process.
Example Classes	Classification, Regression, Clustering, Feature Selection, Missing value imputation,...
Example Individuals	Classification on Dataset Iris
OpenML	TaskType
DMOP	DM-Task
OntoDM	"Data Mining Task"
Exposé	Objective
MEX	The closest concept is `mexcore:ExperimentConfiguration`

OpenML

OpenML differentiates a TaskType (e.g. classification) and Task instances. The tasktype defines which types of inputs are given (e.g. a dataset, train-test splits, optimization measures) and which outputs are expected (e.g. a model, predictions,...). A Task contains specific dataset, splits, etc... It can be seen as an individual as the class.

DMOP

DM-Task: A task in general is any piece of work that is undertaken or attempted [SUMO]. A DM-Task is any task that needs to be addressed in the data mining process. DMOP's DM-Task hierarchy models all the major task classes. Top-level of DMOP's task hierarchy DM-Task CoreDM-Task DataProcessingTask HypothesisApplicationTask HypothesisEvaluationTask HypothesisProcessingTask InductionTask ModelingTask DescriptiveModelingTask PredictiveModelingTask PatternDiscoveryTask

OntoDM

OntoDM defines a data mining task as an objective specification that specifies the objective that a data mining algorithm needs to achieve when executed on a dataset to produce as output a generalization. It is represented as a subclass of the IAO: objective specification class, where objective specification is a directive information entity that describes and intended process endpoint. The data mining task is directly dependent of the datatypes of the data examples on which the task is defined, and is included directly in the task representations. This allows us to represent tasks defined on arbitrarily complex datatypes. The definition of data mining algorithm and generalizations is strongly dependent on the task definition.

OntoDM contains a taxonomy of data mining tasks. At the first level, we differentiate between four major task classes: predictive modelling task, pattern discovery task, clustering task, and probability distribution estimation task. Predictive modelling task is worked out in more detail. Since, a predictive modeling task is defined on a pair of datatypes (one describing the part of the data example on the descriptive side and the other describing the part of the data example on the target/output side), we differentiate between primitive output prediction tasks (that include among others the traditional ML tasks such as classification and regression) and structured output prediction tasks (that include among others tasks such as multi-label classification, multi-target prediction, hierarchical multi-label classification).

MEX

MEX has a higher level of abstraction, designed for representing ML executions and related metadata and not DM tasks. There are specific classes for representing specific ML standards. This information could be obtained from Learning Problem + Learning Method + Algorithm Class in a more concise level though.

Learning Problem

e.g.: Association, Classification, Clustering, Metaheuristic, Regression, Summarization, ...

Learning Method

e.g.: Supervised Learning, Unsupervised Learning, Semi-supervised Learning, Reinforcement Learning, ...

Algorithm Class

ANN, ILP, Bagging, Bayes Theory, Boosting, Clustering, Decision Trees, Genetic Algorithms, Logical Representations, Regression Functions, Rules, Support Vector Networks, ...

EDIT: As an :ExperimentConfiguration may have many :Executions and an :Experiment may have many :ExperimentConfigurations, these could be aligned to a mls:Task.

Algorithm?

Property	Value
Description	the algorithm regardless software implementation
OpenML	None
DMOP	DM-Algorithm
OntoDM	"Data Mining Algorithm"
Exposé	Algorithm Specification
MEX	mexalgo:Algorithm

OpenML

OpenML doesn't abstract over algorithms, it just has 'implementations'. We tried this, but it is too hard to maintain: algorithms can be weird hybrids, and can behave differently based on a parameter setting (e.g. Bagged Trees and Random Forests, or gradient boosting and other types of boosting). You also need to look into the code to see what an algorithm is really doing, which is not always possible. Instead, to organize implementations, OpenML has 'tags', so that anybody can tag algorithms with certain keywords. Hence, hybrid algorithm can have multiple tags.

DMOP

DM-Algorithm: An algorithm in general is a well defined sequence of steps that specifies how to solve a problem or perform a task. It typically accepts an input and produces an output. A DM algorithm is an algorithm that has been designed to perform any of the DM tasks, such as feature selection, missing value imputation, or modeling (or induction). The higher-level classes of the DM-Algorithm hierarchy correspond to DM-Task types. Immediately below are broad algorithm families or what data miners more commonly call paradigms or approaches. The Algorithm hierarchy bottoms out in individual algorithms such as CART, Lasso or ReliefF. A particular case of a DM algorithm is a Modeling (or Learning) algorithm, which is a well-defined procedure that takes data as input and produces output in the form of models or patterns.

OntoDM

In OntoDM, we differentiate between three aspects of algorithms: algorithm as a specification, algorithm as an implementation, and the process of executing an algorithm. Data mining algorithm (as a specification) is represented as a subclass of IAO: algorithm. In this sense, a data mining algorithm is defined as an algorithm that solves a data mining task and as a results outputs a generalization and is usually published/described in some document (journal/conference/workshop publication or a technical report).

data mining algorithm _has-part_ data mining task

data mining algorithm _has-part_ generalization specification

data mining algorithm _has-part_ IAO:document

In OntoDM, we give a higher level taxonomy of algorithms. At the first level, we differentiate between single generalization algorithms (algorithms that produces a single generalization as a result) and ensemble algorithms (algorithms that produce an ensemble of generalizations as a result). At the second level, the taxonomy follows the taxonomy of tasks. This modular and generic approach allows easy extensions to characterize each algorithm class with its own distinctive set of characteristics that can be represented as qualities.

MEX

MEX: sharing the problem stated by OpenML, we label high levels of ML algorithms in Algorithm class instead of specific algorithm characterisations. As much as more precise information is needed, related classes could be instantiated, such as Learning Problem + Learning Method + Algorithm Class + Implementation. The #tag component looks like an interesting solution and will be implemented in the next release as well.

Implementation

Property	Value
Description	An executable implementation of a machine learning algorithm, script, or workflow. It is versioned, and sometimes belongs to a library (e.g. WEKA)
Example Classes	LearnerImplementation, DataProcessingImplementation, EvaluationProcedureImplementation
Example Individuals	SVMlib, weka.J48, rapidminer.RandomForest, weka.evaluation.CrossValidation, weka.attributeSelection.GainRatioAttributeEval
OpenML	Flow / implementation
DMOP	DM-Operator / DM-Workflow
OntoDM	"Data mining algorithm implementation"
Exposé	AlgorithmImplementation
MEX	mexalgo:Implementation

OpenML

OpenML doesn't distinguish 'operators' and 'workflows', because the line is very blurry. Some algorithms have complex internal workflows. Also, many environments (R, Matlab,...) don't have the concept of operator. They just have function calls, which are part of scripts. Hence, in OpenML, everything is called a Flow, and everything is composite. A flow (algorithm) can have subcomponents (and they in turn can have subcomponents).

DMOP

DMOP: DM-Operator: a programmed, executable implementation of a DM-Algorithm.

OntoDM

In OntoDM, we represent a data mining algorithm implementation as a subclass of OBI: plan and is a concretization of a data mining algorithm. Data mining algorithms have as qualities parameters that are described by a parameter specification. A parameter is a quality of an algorithm implementation,and it refers the data provided as input to the algorithm implementation that influences the flow of the execution of algorithm realized by a data mining operator that has information about the specific parameter setting used in the execution process.

data mining algorithm implementation _is-concretization-of_ data mining algorithm

data mining algorithm implementation _has-quality_ parameter

paremeter specification _is-about_ parameter

MEX

MEX: Implementation in MEX is meant to represent the Software Implementation and has no link to the algorithm itself. Examples are Weka, SPSS, Octave, DL-Learner.

HyperParameter

Property	Value
Description	A prior parameter of an implementation (e.g. C, the complexity parameter, in weka.SMO)
Example Classes	HyperParameter
Example Individuals	weka.SMO_C, weka.J48_M, rapidminer.RandomForest_number_of_trees
OpenML	Parameter
DMOP	OperatorParameter
OntoDM	"parameter"
Exposé	ParameterImplementation
MEX	mexalgo:AlgorithmParameter (mexalgo:HyperParameter under proposal)

Run

Property	Value
Description	An execution of an implementation on a machine (computer). It is limited in time (has a start and end point), can be successful or failed.
Example Classes	SimpleProcess
Example Individuals	Process running SVMlib on Iris on Machine m on timestamp t
OpenML	Run
DMOP	DM-Process (i.e., execution)
OntoDM	"Data mining algorithm execution"
Exposé	AlgorithmApplication
MEX	This information is stored in mexcore:Execution (singly mexcore:SingleExecution, collectively mexcore:OverallExecution)

DMOP

DMOP: DM-Operation: a process in which a DM-Operator is executed. Synonym: DM-OperatorExecution.

OntoDM

In OntoDM, we represent a data mining algorithm execution as subclass of SWO:information processing, which is an OBI:planned process. Planned processes realize a plan which is a concretization of a plan specification. A data mining algorithm execution realizes (executes) a data mining operator, has as input a dataset, has as output a generalization, has as agent a computer, and achieves as a planned objective a data mining task.

Data mining operator is a role of a data mining algorithm implementation that is realized (executed) by a data mining algorithm execution process. The data mining operator has information about the specific parameter setting of the algorithm, in the context of the realization of the operator in the process of execution. The parameter setting is an information entity which is a quality specification of a parameter.

data mining algorithm execution has_agent OBI:computer

data mining algorithm execution _achieves-planned-objective_ data mining task

data mining algorithm execution _has-specified-input_ DM-dataset

data mining algorithm execution _has-specified-output_ generalization

data mining algorithm execution _realizes_ data mining operator

data mining operator role-of data mining algorithm implementation

data mining operator _has-information_ parameter setting

parameter setting _is-quality-specification-of_ parameter

Data

Property	Value
Description	... This needs to be defined better. E.g. can it be a file, or a simple string?
Example Classes	Dataset, Train-test splits, Predictions
Example Individuals	Iris
OpenML	Data
DMOP	DM-Data
OntoDM	dataset specification, DM-dataset
Exposé	Information Content Entity (from BFO)
MEX	mexcore:Dataset (as metadata)

DMOP

DMOP: DM-Data: In SUMO, Data is defined as 'an item of factual information derived from measurement or research' [http://sigma.ontologyportal.org:4010/sigma/WordNet.jsp?word=data&POS=1] In IAO, Data is an alternative term for 'data item' =def 'an information content entity that is intended to be a truthful statement about something (modulo, e.g., measurement precision or other systematic errors) and is constructed/acquired by a method which reliably tends to produce (approximately) truthful statements.' [http://purl.obolibrary.org/obo/IAO_0000027] In the context of DMOP, DM-Data is the generic term that englobes different levels of granularity: data can be a whole dataset (one main table and possibly other tables), or only a table, or only a feature (column of a table), or only an instance (row of a table), or even a single feature-value pair.

OntoDM

OntoDM imports the IAO class dataset (defined as ‘a data item that is an aggregate of other data items of the same type that have something in common’) and extends it by further specifying that a DM dataset has part data examples. OntoDM-core also defines the class dataset specification to enable characterization of different dataset classes. It specifies the type of the dataset based on the type of data it contains. In OntoDM, we model the data characteristics with a data specification entity that describes the datatype of the underlying data examples. For this purpose, we import the mechanism for representing arbitrarily complex datatypes from the OntoDT ontology. Using data specifications and the taxonomy of datatypes from the OntoDT ontology, in OntoDM-core have a taxonomy of datasets.

DM-dataset _has-part_ data-example

dataset specification _is-about_ dataset

dataset_specification _has-part_ data_specification

data_specification _is-about_ OntoDT:datatype

MEX

In MEX, it is possible to represent even each instance (mexcore:Example) and each feature (mexcore:Feature) of the dataset.

ModelRepresentation

Property	Value
Description	A representation (serialization) of a model that can be stored as a file. It needs to be loaded by a program in order to predict.
Example Classes	WEKA.Classifier,...
Example Individuals	A description of a decision tree built on Iris as a .model file
OpenML	Model (subset of Data)
DMOP
OntoDM	"generalization representation"
Exposé	None
MEX	None

DMOP

DMOP: By Hypothesis we actually meant roughly ML models. We introduced the concept of a 'hypothesis' to differentiate ML models from pattern sets. DM-PatternSet: A pattern set, as opposed to a model which by definition has global coverage, is a set of local hypotheses, i.e. each applies to a limited region of the sample space.

OntoDM

We take generalization to denote the outcome of a data mining task. In OntoDM, we consider and model three different aspects of generalizations: the specification of a generalization, a generalization as a realizable entity, and the process of executing a generalization.

In OntoDM, the generalization specification class is a subclass of the OBI class data representational model. It specifies the type of the generalization and includes as part the data specification for the data used to produce the generalization, and the generalization language, for the language in which the generalization is expressed. Examples of generalization language formalisms for the case of a predictive model include the languages of: trees, rules, Bayesian networks, graphical models, neural networks, etc.

generalization specification _has_part_ data specification

generalization specification _has_part_ generalization language

Generalizations have a dual nature. They can be treated as data structures and as such represented, stored and manipulated. On the other hand, they act as functions and are executed, taking as input data examples and giving as output the result of applying the function to a data example. In OntoDM, we define a generalization as a sub-class of the BFO class realizable entity. It is an output from a data mining algorithm execution.

generalization _is-specified-output-of_ data mining algorithm execution

generalization _is-concretization-of_ generalization specification

The dual nature of generalizations in OntoDM is represented with two classes that belong to two different description layers: generalization representation, which is a sub-class of information content entity and belongs to the specification layer, and generalization execution, which is a subclass of planned process and belongs to the application layer.

generalization representation _is-about_ generalization

generalization execution _realizes_ generalization

generalization execution _has-specified-input_ DM-dataset

generalization execution _has-specified-output_ DM-dataset

A generalization representation is a sub-class of the IAO class information content entity. It represents a formalized description of the generalization, for instance in the form of a formula or text. For example, the output of a decision tree algorithm execution in any data mining software usually includes a textual representation of the generated decision tree.

A generalization execution is a sub-class of the OBI class planned process that has as input a dataset and has as output another dataset. The output dataset is a result of applying the generalization to the examples from the input dataset.

Model

Property	Value
Description	A generalization of a set of training data, a process, able to predict values for unseen instances.
Example Classes	Decision tree, Rule set, Clusterings, Pattern set, Bayesian Network, Neural net, Probability Distribution,...
Example Individuals	Decision tree built on Iris
OpenML	None
DMOP	DM-Hypothesis (with main subclasses: DM-Model, DM-PatternSet)
OntoDM	generalization
Exposé	None
MEX	None

DMOP

OntoDM

see under Model representation.

Study

Property	Value
Description	A collection of runs that belong together to do some kind of analysis on its results. This analysis can be general or very specific (e.g. an hypothesis test). Can be linked to files, data, that belong to it.
Example Classes	BenchmarkStudy
Individuals	Specific collections of runs
OpenML	Study
DMOP	DM-Experiment (i.e., something that resembles a bundle in PROV, e.g. `prov:Bundle`)
OntoDM
Exposé	None
MEX	`mexcore:Experiment`

Note: Do we need a separate concept for a more general study, a collection of experiments meant for understanding, not always the test a hypothesis...

MEX: mexcore:Experiment is like an OpenML Study, but not limited to the sole description.

EvaluationMeasure

Property	Value
Description	A measure to evaluate the performance of a model.
Example Classes	ClassificationMeasure, RegressionMeasure, ClusteringMeasure, RuntimeMeasure...
Example Individuals	Predictive_accuracy, root_mean_squared_error, inter_cluster_variance, cputime_training_milliseconds
OpenML	EvaluationMeasure
DMOP	HypothesisEvaluationMeasure
OntoDM
Exposé	Function
MEX	mexperf:PerformanceMeasure

DMOP: there is a concept 'Measure' in DMOP, but it seems more broader than that. E.g. it has subclasses: ComputationalComplexityMeasure, HypothesisEvaluationMeasure, and ModelComplexityMeasure.

EliminatedConcepts

Concepts that we probably don't need

Workflow?

Property	Value
Description	...
OpenML	Flow
DMOP	DM-Workflow (i.e., specification)
OntoDM	'data mining workflow'
Exposé	None
MEX	N/A

See remark above about OpenML and flows

OntoDM

In OntoDM, we represent data mining workflows in three aspects: data mining scenario (as a specification of the workflow), data mining workflow (as an implementation), and data mining workflow execution (as an process). In OntoDM-core, a data mining scenario is an extension of the OBI class protocol. It includes as parts other information entities such as: title of scenario, scenario description, author of scenario, and document. From the protocol class it also inherits as parts objective specification and action specification. A data mining workflow is a concretization of a data mining scenario, and extends the plan entity (defined by OBI). Finally, a data mining workflow is realized (executed) through a data mining workflow execution process.