API Design - HK3-Lab-Team/pytrousse GitHub Wiki
LowLevel Operations
These operations are performed directly on pandas DataFrame attribute, by methods of Dataset object.
HighLevel Operations / Transformations
FeatureOperation as an abstract base class, with attributes “original_column”, “derived_column”, .... Concrete Operation classes are all the operations that can be performed on a Dataset object. The init takes the name of the column where to apply the operation and optionally the name of the column where to store the result (otherwise the result is applied in place), plus all the possible arguments needed by the operation. The instances of these Operations are callable objects and they take as input a Dataset object and return a Dataset object.
There can be an Apply class which takes a callable object and the axis
Similarly to histolab’s filters, operations can be chained together by means of a special object Compose, which takes all the operations to perform in a list and then applies all of them one after another.
Operations not yet implemented (they require to access to the df attribute directly)
-
pd.to_timedelta()
-
pd.to_datetime()
-
pd.DataFrame().astype()
-
pd.merge()
-
pd.DataFrame().groupby
-
pd.DataFrame().apply
-
Copy
-
to_csv()
-
to_file() [shelve]
-
drop
-
pd.to_numeric() ???
-
pd.DataFrame().set_index() ???
Dataset module
Module functions:
- read_csv(path, sep, metadata_cols, feature_cols) [1a]
- nan_percentage_threshold attribute used in many_nan_columns → as parameter
- read_dataset(path, metadata_cols, feature_cols) [1c] Read path and reconstruct the Dataset with its Operations history Pay attention to metadata_cols
Dataset methods:
- save_dataset(path) [1b] Save preprocessed CSV + its operations in a human readable format (and the parameters used).
- nan_columns(tolerance) [2c] Tolerance is a (optional, default=1) float number (0 to 1) representing the ratio “nan samples”/”total samples” for the column to be considered a “nan column”.
- add_operation(feat_op: FeatureOperation) [3a]
Acts on private _operations_history to add
feat_opto it - find_operations(feat_op: FeatureOperation) [3b]
It can return zero, one or more operations in a list/OperationsList
It requires a method (like ‘is_similar
, noteq` ) to verify if two FeatureOperation are similar: each of the subclasses of FeatureOperation should implement this in their own method
Dataset Properties:
- metadata_cols [2a]
- feature_cols [2a]
- numerical_cols [2b]
- categorical_cols [2b]
- boolean_cols [2b]
- string_cols [2b]
- mixed_cols [2b]
- constant_columns [2d]
- operations_history [3a]
Example
Def add_operation(self, feat_op: FeatureOperation):
Self._operations_history += feat_op
feature_operation module
- OrdinalEncoder [3c iii]
- OneHotEncoder [3c ii]
- BinSplitting [3c i]
- FillNA
- ReplaceSubStrings (single chars or substrings) [3d i]
- ReplaceStrings (whole values) [3d i]
- ReplaceOutOfScale (cases where “>80”) [3d ii]
- Apply (to apply function like in pandas)
- AnonymizeDataset [3e] Takes private columns list (that will be removed from Dataset and used to compute the unique ID), path to store private info file, “private_cols_to _keep” (columns to keep both in private and public df). Returns the anonymized Dataset and saves the private info dataset.
- ...
Example
Class FeatureOperation(Protocol):
Column : type
Result_column : type
@abstractmethod
def is_similar(self, other):
raise NotImplementedError
Class Fillna(FeatureOperation):
Init(self, column, result_column=None, fill_value=’--’)
self.column = column
self.result_column = result_column
self.fill_value = fill_value
Def __call__(self, dfinfo):
filled_df = dfinfo.fillna(self.column)
filled_df.add_operation(self)
return filled_df
Class OperationsList():
init():
Self.operations = []
__contains__(self, feat_op: FeatureOperation):
Something
__add__()
__iadd__(feat_op: FeatureOperation)
self.operations.append(feat_op)
Scripts for usecases
transformations = Compose([Fillna(.....), Operation(...)])
dataset = read_csv(.....)
encoded_dataset = OneHotEncoder(...)(dataset)
Preprocessed_dataset = transformations(encoded_dataset)