User Guide - not-so-fat/conjurer GitHub Wiki

Objective

We aim to help you to perform data analytics smoothly on data prepared as pandas.DataFrame (https://pandas.pydata.org/). In this document, we will explain what is challenges on existing library to implement data pipeline, and how we want to solve them.

Quick Start

Installation

Please use pip to install

pip install conjurer

or you could use our notebook Docker image (which by default install conjurer) https://github.com/not-so-fat/notebook_docker.

Usage

In below sample code we assume you imported our library as follows;

from conjurer import (
    eda,
    ml
)

Load Data

To load CSV file as pandas.DataFrame, most popular method could be pandas.read_csv. However loaded data frame could have different dtype as you expected, especially for integer / timestamp. eda.read_csv has the same interface as pandas.read_csv but it try to infer whether each column is integer / timestamp additionally.

df = eda.read_csv("sample.csv")

Get Basic Statistics

We often want to get basic statistics for prepared data set at first. You could use pandas.DataFrame.describe, we added some stats & plot histograms for each column.

stats_df = eda.check_stats(df)

Visualize Data

You can plot some types of charts which generated by altair (https://altair-viz.github.io/). These APIs returns altair.Chart object, you can modify chart configuration by modifying it.

  • Histogram eda.plot_histogram(series, num_bins=50)
  • Scatter plot / Heatmap eda.plot_scatter(df, column_x, column_y)
    • If the number of rows are larger than 5000 (altair's limitation), plot heatmap
  • Heatmap eda.plot_heatmap(df, column_x, column_y)

Feature Engineering (Under Development)

T.B.A.

Machine Learning

Get default RandomizedSearchCV / GridSearchCV to tune the machine learning models

For lightgbm, xgboost, linear_model, random_forest, we prepared CV object with default hyper parameters search space.

cv = ml.get_default_cv("lightgbm", "cl")
model = cv.fit_cv_pandas(df, target_column="y", feature_columns=["x{}".format(i) for i in range(100)], n_fold=3)
model.predict(df)

Or to tune machine learning models in your way by using pandas interfaced RandomizedSearchCV / GridSearchCV

You can also use RandomizedSearchCV / GridSearchCV with pandas interface, with these you can customize hyper parameter tuning.

from sklearn import linear_model

cv = ml.RandomizedSearchCV(linear_model.LogisticRegression(), ...)
model = cv.fit_cv_pandas(df, target_column="y", feature_columns=["x{}".format(i) for i in range(100)], n_fold=3)
model.predict(df)

Analyze hyper-parameters tuning results

You can confirm how hyper parameter worked in validation data set in tuning.

analyzer = ml.CVAnalyzer(cv)  # ml.CVAnalyzer(model.estimator) also works
analyzer.plot_by_param_all()