PipeOp Specifications - mlr-org/mlr3pipelines GitHub Wiki

General rules:

Inherit from PipeOp for general pipeops, PipeOpTaskPreproc for preprocessing pipeops that have one task input, one task output, and from PipeOpTaskPreprocSimple for a subset of these that perform exactly the same operation during training and prediction.
Overwrite the train_internal() and predict_internal() functions when inheriting PipeOp. Overwrite the train_task()/train_dt() and predict_task()/predict_dt() as well as possibly select_cols() (for ..._dt()) functions when inheriting PipeOpTaskPreproc. Overwrite the get_state()/get_state_dt(), transform()/transform_dt() as well as possibly select_cols() (for ..._dt()) functions when inheriting PipeOpTaskPreprocSimple.
Set the $input and $output train and predict columns to the acceptable types for these operations. Do not check input values for types that are already specified in the $input and $output tables. Ok:
```
train_internal(inputs) {
  if (inputs$nrow < 1) stop("Input too small")
```
Bad (because the input type "Task" is already checked by the train() function):
```
train_internal(inputs) {
  assert_task(inputs[1](/mlr-org/mlr3pipelines/wiki/1))
```
Inputs in train_internal() / predict_internal() are always given by-reference, so if any R6 objects are modified, they must be cloned before. This is not the case for train_task, train_dt, ... in PipeOpTaskPreproc[Simple]: The PipeOpTaskPreproc[Simple] takes care of cloning so Tasks/data.tables can be modified in-place.
PipeOpTaskPreproc[Simple] $state must always be a named list; The machinery in PipeOpTaskPreproc[Simple] adds a few slots: $affected_cols, $intasklayout, $outtasklayout, $dt_columns (only if train_task/predict_task/get_state/transform are not overwritten). Therefore, these names are "reserved" and should not be set by the class inheriting by PipeOpTaskPreproc[Simple]. Even though PipeOp $state can be anything, it is recommended to also keep it a named list.
Every change done by the $train() method must be reflected by the $state variable. I.e.
```
po2 = po1$clone(deep = TRUE)
po1$train(input)
po2$state = po1$state
po1 = po1$clone(deep = TRUE)
```
must leave po1 and po2 identical. (The last clone call is necessary to mirror effects done by po2 = po1$clone())

$predict() must be idempotent, i.e.

po2 = po1$clone(deep = TRUE)
po1$predict(input1)
po1$predict(input2)
po2$predict(input3)
po1 = po1$clone(deep = TRUE)

must leave po1 and po2 identical. (The last clone call for the same reason as above.)