API Design - HK3-Lab-Team/pytrousse GitHub Wiki

LowLevel Operations

These operations are performed directly on pandas DataFrame attribute, by methods of Dataset object.

HighLevel Operations / Transformations

FeatureOperation as an abstract base class, with attributes “original_column”, “derived_column”, .... Concrete Operation classes are all the operations that can be performed on a Dataset object. The init takes the name of the column where to apply the operation and optionally the name of the column where to store the result (otherwise the result is applied in place), plus all the possible arguments needed by the operation. The instances of these Operations are callable objects and they take as input a Dataset object and return a Dataset object.

There can be an Apply class which takes a callable object and the axis

Similarly to histolab’s filters, operations can be chained together by means of a special object Compose, which takes all the operations to perform in a list and then applies all of them one after another.

Operations not yet implemented (they require to access to the df attribute directly)

pd.to_timedelta()
pd.to_datetime()
pd.DataFrame().astype()
pd.merge()
pd.DataFrame().groupby
pd.DataFrame().apply
Copy
to_csv()
to_file() [shelve]
drop
pd.to_numeric() ???
pd.DataFrame().set_index() ???

Dataset module

Module functions:

read_csv(path, sep, metadata_cols, feature_cols) [1a]
nan_percentage_threshold attribute used in many_nan_columns → as parameter
read_dataset(path, metadata_cols, feature_cols) [1c] Read path and reconstruct the Dataset with its Operations history Pay attention to metadata_cols

Dataset methods:

save_dataset(path) [1b] Save preprocessed CSV + its operations in a human readable format (and the parameters used).
nan_columns(tolerance) [2c] Tolerance is a (optional, default=1) float number (0 to 1) representing the ratio “nan samples”/”total samples” for the column to be considered a “nan column”.
add_operation(feat_op: FeatureOperation) [3a] Acts on private _operations_history to add feat_op to it
find_operations(feat_op: FeatureOperation) [3b] It can return zero, one or more operations in a list/OperationsList It requires a method (like ‘is_similar, not eq` ) to verify if two FeatureOperation are similar: each of the subclasses of FeatureOperation should implement this in their own method

Dataset Properties:

metadata_cols [2a]
feature_cols [2a]
numerical_cols [2b]
categorical_cols [2b]
boolean_cols [2b]
string_cols [2b]
mixed_cols [2b]
constant_columns [2d]
operations_history [3a]

Example

Def add_operation(self,  feat_op: FeatureOperation):
    Self._operations_history += feat_op

feature_operation module

OrdinalEncoder [3c iii]
OneHotEncoder [3c ii]
BinSplitting [3c i]
FillNA
ReplaceSubStrings (single chars or substrings) [3d i]
ReplaceStrings (whole values) [3d i]
ReplaceOutOfScale (cases where “>80”) [3d ii]
Apply (to apply function like in pandas)
AnonymizeDataset [3e] Takes private columns list (that will be removed from Dataset and used to compute the unique ID), path to store private info file, “private_cols_to _keep” (columns to keep both in private and public df). Returns the anonymized Dataset and saves the private info dataset.
...

Example

Class FeatureOperation(Protocol):
    Column : type
    Result_column : type

    @abstractmethod
    def is_similar(self, other):
        raise NotImplementedError

Class Fillna(FeatureOperation):
    Init(self, column, result_column=None, fill_value=’--’)
        self.column = column
        self.result_column = result_column
        self.fill_value = fill_value

    Def __call__(self, dfinfo):
        filled_df =  dfinfo.fillna(self.column)
        filled_df.add_operation(self) 
        return filled_df

Class OperationsList():
    init():
        Self.operations = []

    __contains__(self, feat_op: FeatureOperation):
        Something

    __add__()
    __iadd__(feat_op: FeatureOperation)
        self.operations.append(feat_op)

Scripts for usecases

transformations = Compose([Fillna(.....), Operation(...)])
dataset = read_csv(.....)

encoded_dataset = OneHotEncoder(...)(dataset)

Preprocessed_dataset = transformations(encoded_dataset)