PyTrousse functional requirements - HK3-Lab-Team/pytrousse GitHub Wiki
PyTrousse functional requirements
1. Data loading/saving
The system must be able to load raw data from a CSV file / pandas DataFrame.
The system must be able to save to disk the data with the associated preprocessing operations that have been performed on that data.
The system must be able to read the preprocessed data from disk along with the previously performed preprocessing operations. (NOTE: See how sklearn does this part because if we save the info as json, we should also write code to validate the schema. Try to avoid this)
2. Columns handling
The system must be able to distinguish between metadata columns and feature columns.
The system must be able to categorize columns depending on their type, i.e. numerical columns, categorical columns, boolean columns, string columns, mixed columns.
The system must be able to identify columns that contain a lot of missing values (more than 99.9% of the total number of samples by default).
The system must be able to identify columns that contain the very same value.
The system must be able to identify columns with the same name.
3. Preprocessing
The system must be able to store the preprocessing operations performed on the data, with a reference to the name of the column(s) the operation has been performed on, the name of the resulting column(s), the type of operation performed. In case encoding is performed, the system must also store the type of encoding used and a map from the encoded values to the original values.
The system must be able to retrieve preprocessing operations that match user-defined requirements, i.e. performed on a specific column, which type of operation has been performed, which column has been derived from it, which type of encoding has been used (if applicable).
Encoding
The system must be able to perform data binning on a numerical column, based on the binning threshold values provided by the user, and to create a new column with the corresponding bin indexes. The threshold values can be defined for sample sub-groups, where the grouping is performed based on the value of another column.
The system must be able to perform the One Hot encoding of a categorical column and to create a new column with the encoded values.
The system must be able to perform the Ordinal encoding of a categorical column and to create a new column with the encoded values.
The system must be able to perform the Ordinal encoding of multiple categorical columns by indexing the combinations of their unique values and to create a new column with the corresponding indexes.
Correction
The system must be able to replace strings (e.g. "---", "ASSENTI", "NV") or substrings (e.g. "°", ",") in mixed type columns with strings provided by the user.
The system must be able to replace string values in mixed type columns that represent out-of-scale numerical values (e.g. “>800”) with an appropriate numerical value (e.g 880).
After correction, the system must be able to recompute the type of the columns, as defined in 2b
The system must be able to perform the Ordinal encoding of multiple categorical columns by indexing the combinations of their unique values and to create a new column with the corresponding indexes.
Anonymize
The user must define the columns which contain private information.
The system must be able to associate a unique ID to the different combinations of information contained in the private columns.
The system must be able to replace the private columns with a column containing the corresponding unique ID.
The system must store the association between private information and the corresponding unique ID in a separate file.