eda - not-so-fat/conjurer GitHub Wiki

Motivation of eda module

Load data

`eda.read_csv(buffer_or_filepath, **kwargs)`

To use conjurer module, you need to load data as pandas.DataFrame. However when you load CSV file with pandas.read_csv, there are several issues;

When integer column has null, it is considered as float64 (but it can be Int64 not to lose the information it is integer)
Data type for timestamp columns is not inferred
Multiple dtypes could exist in a column (e.g. the example in https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.errors.DtypeWarning.html)

`eda.DfDictLoader(df_list)`

To run feature generation in test data set, each input data frame should have the same dtype. This module help you to load all the file as the same dtype.

df_dict_training = {
    "target": eda.read_csv("training.csv"),
    "a1": eda.read_csv("a1_training.csv"),
    "a2": eda.read_csv("a2_training.csv")
}
...
# Instantiate DfDictLoader with df_dict used in training
loader = eda.DfDictLoader(df_dict_training)
# Prepare dict of filepath for test data set (keys should be the same)
file_path_dict = {
    "target": "test.csv",
    "a1": "a1_test.csv",
    "a2": "a2_test.csv"
}
# Load dict of pandas.DataFrame as the same dtypes as training
df_dict_validation = loader.load(file_path_dict)

Check data

`eda.get_columns_in_dfs(df_list, name_list)`

To use conjurer module, you want to confirm which key could be connected. Often case such keys have the same name. By this function you can get pandas.DataFrame which summarize table name and column name.

columns_df = eda.get_column_in_dfs(df_dict.values(), df_dict.keys())
# check duplicated column names
columns_df["column_name"].value_counts()

`eda.check_stats(df, skip_histogram=True)`

This method summarize each column's basic statistics, and histogram.

`eda.get_fk_coverage(fk_df, k_df, fk_columns, k_columns, do_print=True)`

This method checks how many values match in keys in two data frame.

Visualize data

There are some methods eda.plot_XXX, which visualize data with plot.ly (just to reduce the number of lines).