User Guide - not-so-fat/conjurer GitHub Wiki
Objective
We aim to help you to perform data analytics smoothly on data prepared as pandas.DataFrame (https://pandas.pydata.org/). In this document, we will explain what is challenges on existing library to implement data pipeline, and how we want to solve them.
Quick Start
Installation
Please use pip
to install
pip install conjurer
or you could use our notebook Docker image (which by default install conjurer) https://github.com/not-so-fat/notebook_docker.
Usage
In below sample code we assume you imported our library as follows;
from conjurer import (
eda,
ml
)
Load Data
To load CSV file as pandas.DataFrame, most popular method could be pandas.read_csv
. However loaded data frame could have different dtype as you expected, especially for integer / timestamp. eda.read_csv
has the same interface as pandas.read_csv
but it try to infer whether each column is integer / timestamp additionally.
df = eda.read_csv("sample.csv")
Get Basic Statistics
We often want to get basic statistics for prepared data set at first. You could use pandas.DataFrame.describe
, we added some stats & plot histograms for each column.
stats_df = eda.check_stats(df)
Visualize Data
You can plot some types of charts which generated by altair (https://altair-viz.github.io/). These APIs returns altair.Chart object, you can modify chart configuration by modifying it.
- Histogram
eda.plot_histogram(series, num_bins=50)
- Scatter plot / Heatmap
eda.plot_scatter(df, column_x, column_y)
- If the number of rows are larger than 5000 (altair's limitation), plot heatmap
- Heatmap
eda.plot_heatmap(df, column_x, column_y)
Feature Engineering (Under Development)
T.B.A.
Machine Learning
Get default RandomizedSearchCV / GridSearchCV to tune the machine learning models
For lightgbm, xgboost, linear_model, random_forest, we prepared CV object with default hyper parameters search space.
cv = ml.get_default_cv("lightgbm", "cl")
model = cv.fit_cv_pandas(df, target_column="y", feature_columns=["x{}".format(i) for i in range(100)], n_fold=3)
model.predict(df)
Or to tune machine learning models in your way by using pandas interfaced RandomizedSearchCV / GridSearchCV
You can also use RandomizedSearchCV / GridSearchCV with pandas interface, with these you can customize hyper parameter tuning.
from sklearn import linear_model
cv = ml.RandomizedSearchCV(linear_model.LogisticRegression(), ...)
model = cv.fit_cv_pandas(df, target_column="y", feature_columns=["x{}".format(i) for i in range(100)], n_fold=3)
model.predict(df)
Analyze hyper-parameters tuning results
You can confirm how hyper parameter worked in validation data set in tuning.
analyzer = ml.CVAnalyzer(cv) # ml.CVAnalyzer(model.estimator) also works
analyzer.plot_by_param_all()