Vocabulary (mappings) - ML-Schema/core GitHub Wiki
ML-Schema — Machine Learning Schema — provides a model for expressing data mining and machine learning algorithms, datasets, and experiments. This section introduces the core of the ML-Schema model, namely the classes (types) that are used to represent the majority of the cases. (note: this draft vocabulary has become obsolete and this description is going to be replaced by a new version shortly)
Task
Property | Value |
---|---|
Description | A formal description of a process that needs to be completed (e.g. based on inputs and outputs). A Task is any piece of work that needs to be addressed in the data mining process. |
Example Classes | Classification, Regression, Clustering, Feature Selection, Missing value imputation,... |
Example Individuals | Classification on Dataset Iris |
OpenML | TaskType |
DMOP | DM-Task |
OntoDM | "Data Mining Task" |
Exposé | Objective |
MEX | The closest concept is mexcore:ExperimentConfiguration |
OpenML
OpenML differentiates a TaskType (e.g. classification) and Task instances. The tasktype defines which types of inputs are given (e.g. a dataset, train-test splits, optimization measures) and which outputs are expected (e.g. a model, predictions,...). A Task contains specific dataset, splits, etc... It can be seen as an individual as the class.
DMOP
DM-Task: A task in general is any piece of work that is undertaken or attempted [SUMO]. A DM-Task is any task that needs to be addressed in the data mining process. DMOP's DM-Task hierarchy models all the major task classes. Top-level of DMOP's task hierarchy DM-Task CoreDM-Task DataProcessingTask HypothesisApplicationTask HypothesisEvaluationTask HypothesisProcessingTask InductionTask ModelingTask DescriptiveModelingTask PredictiveModelingTask PatternDiscoveryTask
OntoDM
OntoDM defines a data mining task as an objective specification that specifies the objective that a data mining algorithm needs to achieve when executed on a dataset to produce as output a generalization. It is represented as a subclass of the IAO: objective specification class, where objective specification is a directive information entity that describes and intended process endpoint. The data mining task is directly dependent of the datatypes of the data examples on which the task is defined, and is included directly in the task representations. This allows us to represent tasks defined on arbitrarily complex datatypes. The definition of data mining algorithm and generalizations is strongly dependent on the task definition.
OntoDM contains a taxonomy of data mining tasks. At the first level, we differentiate between four major task classes: predictive modelling task, pattern discovery task, clustering task, and probability distribution estimation task. Predictive modelling task is worked out in more detail. Since, a predictive modeling task is defined on a pair of datatypes (one describing the part of the data example on the descriptive side and the other describing the part of the data example on the target/output side), we differentiate between primitive output prediction tasks (that include among others the traditional ML tasks such as classification and regression) and structured output prediction tasks (that include among others tasks such as multi-label classification, multi-target prediction, hierarchical multi-label classification).
MEX
MEX has a higher level of abstraction, designed for representing ML executions and related metadata and not DM tasks. There are specific classes for representing specific ML standards. This information could be obtained from Learning Problem + Learning Method + Algorithm Class in a more concise level though.
Learning Problem
- e.g.: Association, Classification, Clustering, Metaheuristic, Regression, Summarization, ...
Learning Method
- e.g.: Supervised Learning, Unsupervised Learning, Semi-supervised Learning, Reinforcement Learning, ...
Algorithm Class
- ANN, ILP, Bagging, Bayes Theory, Boosting, Clustering, Decision Trees, Genetic Algorithms, Logical Representations, Regression Functions, Rules, Support Vector Networks, ...
EDIT: As an :ExperimentConfiguration
may have many :Execution
s and an :Experiment
may have many :ExperimentConfiguration
s, these could be aligned to a mls:Task
.
Algorithm?
Property | Value |
---|---|
Description | the algorithm regardless software implementation |
OpenML | None |
DMOP | DM-Algorithm |
OntoDM | "Data Mining Algorithm" |
Exposé | Algorithm Specification |
MEX | mexalgo:Algorithm |
OpenML
OpenML doesn't abstract over algorithms, it just has 'implementations'. We tried this, but it is too hard to maintain: algorithms can be weird hybrids, and can behave differently based on a parameter setting (e.g. Bagged Trees and Random Forests, or gradient boosting and other types of boosting). You also need to look into the code to see what an algorithm is really doing, which is not always possible. Instead, to organize implementations, OpenML has 'tags', so that anybody can tag algorithms with certain keywords. Hence, hybrid algorithm can have multiple tags.
DMOP
DM-Algorithm: An algorithm in general is a well defined sequence of steps that specifies how to solve a problem or perform a task. It typically accepts an input and produces an output. A DM algorithm is an algorithm that has been designed to perform any of the DM tasks, such as feature selection, missing value imputation, or modeling (or induction). The higher-level classes of the DM-Algorithm hierarchy correspond to DM-Task types. Immediately below are broad algorithm families or what data miners more commonly call paradigms or approaches. The Algorithm hierarchy bottoms out in individual algorithms such as CART, Lasso or ReliefF. A particular case of a DM algorithm is a Modeling (or Learning) algorithm, which is a well-defined procedure that takes data as input and produces output in the form of models or patterns.
OntoDM
In OntoDM, we differentiate between three aspects of algorithms: algorithm as a specification, algorithm as an implementation, and the process of executing an algorithm. Data mining algorithm (as a specification) is represented as a subclass of IAO: algorithm. In this sense, a data mining algorithm is defined as an algorithm that solves a data mining task and as a results outputs a generalization and is usually published/described in some document (journal/conference/workshop publication or a technical report).
data mining algorithm _has-part_ data mining task
data mining algorithm _has-part_ generalization specification
data mining algorithm _has-part_ IAO:document
In OntoDM, we give a higher level taxonomy of algorithms. At the first level, we differentiate between single generalization algorithms (algorithms that produces a single generalization as a result) and ensemble algorithms (algorithms that produce an ensemble of generalizations as a result). At the second level, the taxonomy follows the taxonomy of tasks. This modular and generic approach allows easy extensions to characterize each algorithm class with its own distinctive set of characteristics that can be represented as qualities.
MEX
MEX: sharing the problem stated by OpenML, we label high levels of ML algorithms in Algorithm class instead of specific algorithm characterisations. As much as more precise information is needed, related classes could be instantiated, such as Learning Problem + Learning Method + Algorithm Class + Implementation. The #tag component looks like an interesting solution and will be implemented in the next release as well.
Implementation
Property | Value |
---|---|
Description | An executable implementation of a machine learning algorithm, script, or workflow. It is versioned, and sometimes belongs to a library (e.g. WEKA) |
Example Classes | LearnerImplementation, DataProcessingImplementation, EvaluationProcedureImplementation |
Example Individuals | SVMlib, weka.J48, rapidminer.RandomForest, weka.evaluation.CrossValidation, weka.attributeSelection.GainRatioAttributeEval |
OpenML | Flow / implementation |
DMOP | DM-Operator / DM-Workflow |
OntoDM | "Data mining algorithm implementation" |
Exposé | AlgorithmImplementation |
MEX | mexalgo:Implementation |
OpenML
OpenML doesn't distinguish 'operators' and 'workflows', because the line is very blurry. Some algorithms have complex internal workflows. Also, many environments (R, Matlab,...) don't have the concept of operator. They just have function calls, which are part of scripts. Hence, in OpenML, everything is called a Flow, and everything is composite. A flow (algorithm) can have subcomponents (and they in turn can have subcomponents).
DMOP
DMOP: DM-Operator: a programmed, executable implementation of a DM-Algorithm.
OntoDM
In OntoDM, we represent a data mining algorithm implementation as a subclass of OBI: plan and is a concretization of a data mining algorithm. Data mining algorithms have as qualities parameters that are described by a parameter specification. A parameter is a quality of an algorithm implementation,and it refers the data provided as input to the algorithm implementation that influences the flow of the execution of algorithm realized by a data mining operator that has information about the specific parameter setting used in the execution process.
data mining algorithm implementation _is-concretization-of_ data mining algorithm
data mining algorithm implementation _has-quality_ parameter
paremeter specification _is-about_ parameter
MEX
MEX: Implementation in MEX is meant to represent the Software Implementation and has no link to the algorithm itself. Examples are Weka, SPSS, Octave, DL-Learner.
HyperParameter
Property | Value |
---|---|
Description | A prior parameter of an implementation (e.g. C, the complexity parameter, in weka.SMO) |
Example Classes | HyperParameter |
Example Individuals | weka.SMO_C, weka.J48_M, rapidminer.RandomForest_number_of_trees |
OpenML | Parameter |
DMOP | OperatorParameter |
OntoDM | "parameter" |
Exposé | ParameterImplementation |
MEX | mexalgo:AlgorithmParameter (mexalgo:HyperParameter under proposal) |
Run
Property | Value |
---|---|
Description | An execution of an implementation on a machine (computer). It is limited in time (has a start and end point), can be successful or failed. |
Example Classes | SimpleProcess |
Example Individuals | Process running SVMlib on Iris on Machine m on timestamp t |
OpenML | Run |
DMOP | DM-Process (i.e., execution) |
OntoDM | "Data mining algorithm execution" |
Exposé | AlgorithmApplication |
MEX | This information is stored in mexcore:Execution (singly mexcore:SingleExecution, collectively mexcore:OverallExecution) |
DMOP
DMOP: DM-Operation: a process in which a DM-Operator is executed. Synonym: DM-OperatorExecution.
OntoDM
In OntoDM, we represent a data mining algorithm execution as subclass of SWO:information processing, which is an OBI:planned process. Planned processes realize a plan which is a concretization of a plan specification. A data mining algorithm execution realizes (executes) a data mining operator, has as input a dataset, has as output a generalization, has as agent a computer, and achieves as a planned objective a data mining task.
Data mining operator is a role of a data mining algorithm implementation that is realized (executed) by a data mining algorithm execution process. The data mining operator has information about the specific parameter setting of the algorithm, in the context of the realization of the operator in the process of execution. The parameter setting is an information entity which is a quality specification of a parameter.
data mining algorithm execution has_agent OBI:computer
data mining algorithm execution _achieves-planned-objective_ data mining task
data mining algorithm execution _has-specified-input_ DM-dataset
data mining algorithm execution _has-specified-output_ generalization
data mining algorithm execution _realizes_ data mining operator
data mining operator role-of data mining algorithm implementation
data mining operator _has-information_ parameter setting
parameter setting _is-quality-specification-of_ parameter
Data
Property | Value |
---|---|
Description | ... This needs to be defined better. E.g. can it be a file, or a simple string? |
Example Classes | Dataset, Train-test splits, Predictions |
Example Individuals | Iris |
OpenML | Data |
DMOP | DM-Data |
OntoDM | dataset specification, DM-dataset |
Exposé | Information Content Entity (from BFO) |
MEX | mexcore:Dataset (as metadata) |
DMOP
DMOP: DM-Data: In SUMO, Data is defined as 'an item of factual information derived from measurement or research' [http://sigma.ontologyportal.org:4010/sigma/WordNet.jsp?word=data&POS=1] In IAO, Data is an alternative term for 'data item' =def 'an information content entity that is intended to be a truthful statement about something (modulo, e.g., measurement precision or other systematic errors) and is constructed/acquired by a method which reliably tends to produce (approximately) truthful statements.' [http://purl.obolibrary.org/obo/IAO_0000027] In the context of DMOP, DM-Data is the generic term that englobes different levels of granularity: data can be a whole dataset (one main table and possibly other tables), or only a table, or only a feature (column of a table), or only an instance (row of a table), or even a single feature-value pair.
OntoDM
OntoDM imports the IAO class dataset (defined as ‘a data item that is an aggregate of other data items of the same type that have something in common’) and extends it by further specifying that a DM dataset has part data examples. OntoDM-core also defines the class dataset specification to enable characterization of different dataset classes. It specifies the type of the dataset based on the type of data it contains. In OntoDM, we model the data characteristics with a data specification entity that describes the datatype of the underlying data examples. For this purpose, we import the mechanism for representing arbitrarily complex datatypes from the OntoDT ontology. Using data specifications and the taxonomy of datatypes from the OntoDT ontology, in OntoDM-core have a taxonomy of datasets.
DM-dataset _has-part_ data-example
dataset specification _is-about_ dataset
dataset_specification _has-part_ data_specification
data_specification _is-about_ OntoDT:datatype
MEX
In MEX, it is possible to represent even each instance (mexcore:Example) and each feature (mexcore:Feature) of the dataset.
ModelRepresentation
Property | Value |
---|---|
Description | A representation (serialization) of a model that can be stored as a file. It needs to be loaded by a program in order to predict. |
Example Classes | WEKA.Classifier,... |
Example Individuals | A description of a decision tree built on Iris as a .model file |
OpenML | Model (subset of Data) |
DMOP | |
OntoDM | "generalization representation" |
Exposé | None |
MEX | None |
DMOP
DMOP: By Hypothesis we actually meant roughly ML models. We introduced the concept of a 'hypothesis' to differentiate ML models from pattern sets. DM-PatternSet: A pattern set, as opposed to a model which by definition has global coverage, is a set of local hypotheses, i.e. each applies to a limited region of the sample space.
OntoDM
We take generalization to denote the outcome of a data mining task. In OntoDM, we consider and model three different aspects of generalizations: the specification of a generalization, a generalization as a realizable entity, and the process of executing a generalization.
In OntoDM, the generalization specification class is a subclass of the OBI class data representational model. It specifies the type of the generalization and includes as part the data specification for the data used to produce the generalization, and the generalization language, for the language in which the generalization is expressed. Examples of generalization language formalisms for the case of a predictive model include the languages of: trees, rules, Bayesian networks, graphical models, neural networks, etc.
generalization specification _has_part_ data specification
generalization specification _has_part_ generalization language
Generalizations have a dual nature. They can be treated as data structures and as such represented, stored and manipulated. On the other hand, they act as functions and are executed, taking as input data examples and giving as output the result of applying the function to a data example. In OntoDM, we define a generalization as a sub-class of the BFO class realizable entity. It is an output from a data mining algorithm execution.
generalization _is-specified-output-of_ data mining algorithm execution
generalization _is-concretization-of_ generalization specification
The dual nature of generalizations in OntoDM is represented with two classes that belong to two different description layers: generalization representation, which is a sub-class of information content entity and belongs to the specification layer, and generalization execution, which is a subclass of planned process and belongs to the application layer.
generalization representation _is-about_ generalization
generalization execution _realizes_ generalization
generalization execution _has-specified-input_ DM-dataset
generalization execution _has-specified-output_ DM-dataset
A generalization representation is a sub-class of the IAO class information content entity. It represents a formalized description of the generalization, for instance in the form of a formula or text. For example, the output of a decision tree algorithm execution in any data mining software usually includes a textual representation of the generated decision tree.
A generalization execution is a sub-class of the OBI class planned process that has as input a dataset and has as output another dataset. The output dataset is a result of applying the generalization to the examples from the input dataset.
Model
Property | Value |
---|---|
Description | A generalization of a set of training data, a process, able to predict values for unseen instances. |
Example Classes | Decision tree, Rule set, Clusterings, Pattern set, Bayesian Network, Neural net, Probability Distribution,... |
Example Individuals | Decision tree built on Iris |
OpenML | None |
DMOP | DM-Hypothesis (with main subclasses: DM-Model, DM-PatternSet) |
OntoDM | generalization |
Exposé | None |
MEX | None |
DMOP
DMOP: By Hypothesis we actually meant roughly ML models. We introduced the concept of a 'hypothesis' to differentiate ML models from pattern sets. DM-PatternSet: A pattern set, as opposed to a model which by definition has global coverage, is a set of local hypotheses, i.e. each applies to a limited region of the sample space.
OntoDM
see under Model representation.
Study
Property | Value |
---|---|
Description | A collection of runs that belong together to do some kind of analysis on its results. This analysis can be general or very specific (e.g. an hypothesis test). Can be linked to files, data, that belong to it. |
Example Classes | BenchmarkStudy |
Individuals | Specific collections of runs |
OpenML | Study |
DMOP | DM-Experiment (i.e., something that resembles a bundle in PROV, e.g. prov:Bundle ) |
OntoDM | |
Exposé | None |
MEX | mexcore:Experiment |
Note: Do we need a separate concept for a more general study, a collection of experiments meant for understanding, not always the test a hypothesis...
MEX: mexcore:Experiment
is like an OpenML Study, but not limited to the sole description.
EvaluationMeasure
Property | Value |
---|---|
Description | A measure to evaluate the performance of a model. |
Example Classes | ClassificationMeasure, RegressionMeasure, ClusteringMeasure, RuntimeMeasure... |
Example Individuals | Predictive_accuracy, root_mean_squared_error, inter_cluster_variance, cputime_training_milliseconds |
OpenML | EvaluationMeasure |
DMOP | HypothesisEvaluationMeasure |
OntoDM | |
Exposé | Function |
MEX | mexperf:PerformanceMeasure |
DMOP: there is a concept 'Measure' in DMOP, but it seems more broader than that. E.g. it has subclasses: ComputationalComplexityMeasure, HypothesisEvaluationMeasure, and ModelComplexityMeasure.
EliminatedConcepts
Concepts that we probably don't need
Workflow?
Property | Value |
---|---|
Description | ... |
OpenML | Flow |
DMOP | DM-Workflow (i.e., specification) |
OntoDM | 'data mining workflow' |
Exposé | None |
MEX | N/A |
See remark above about OpenML and flows
OntoDM
In OntoDM, we represent data mining workflows in three aspects: data mining scenario (as a specification of the workflow), data mining workflow (as an implementation), and data mining workflow execution (as an process). In OntoDM-core, a data mining scenario is an extension of the OBI class protocol. It includes as parts other information entities such as: title of scenario, scenario description, author of scenario, and document. From the protocol class it also inherits as parts objective specification and action specification. A data mining workflow is a concretization of a data mining scenario, and extends the plan entity (defined by OBI). Finally, a data mining workflow is realized (executed) through a data mining workflow execution process.