From dataset exploration to Analytics Services pipeline - sedgewickmm18/mmfunctions GitHub Wiki

The approach is derived from a kaggle challenge "Weather Conditions in World War Two: Is there a relationship between the daily minimum and maximum temperature? Can you predict the maximum temperature given the minimum temperature?" It is based on a data set on weather conditions during WWII found here and a copy has been uploaded to this git repository.

The challenge and a solution has been covered extensively in Nagesh Singh Chauhan's article in Towards Datascience "A beginner’s guide to Linear Regression in Python with Scikit-Learn". I'm merely retracing the steps in this article to show the path from exploring the characteristics/features of a dataset to extracting features periodically in an Analytics Service job. I'm also omitting Nagesh's theory of linear regression here, diving directly into the code.

For the full notebook see here, for a simple way to run Jupyter locally you could use the Jupyter container found here.

As usual we start with importing python modules we need later.

import

This is equivalent to the initial lines of an Analytics Service pipeline function - major difference are the AS specific metadata, logging and the absence of plotting modules. The pipeline functions are run in a container with prereqs like sklearn or sqlalchemy already installed.

import json
import logging
import numpy as np
from sqlalchemy import Column, Integer, String, Float, DateTime, Boolean, func
from iotfunctions import bif
from iotfunctions.metadata import EntityType
from iotfunctions.db import Database
from iotfunctions.enginelog import EngineLogging
from iotfunctions import estimator
import datetime as dt

Next step is loading the data into a pandas dataframe. Note that we have tell the csv reader how to interpret the data type in certain columns: we turn them into objects and leave them alone.

load

The AS job step already has its data in a data frame unless it's a data source itself.

Now we start exploring the data in our notebook

explore1

It's a matrix with 31 columns and 119040 rows. Now we look at the content

explore2

and plot the data

plot

The Analytics Service dashboard allows to introspect data, too, but it doesn't focus on data science explorative approaches.

show AS dashboard instance

Now we try to apply linear regression to MaxTemp as dependent variable to MinTemp.

Let 'massage' the data first

massage

and turn the two relevant columns into two arrays X and y.

Next split these two arrays into a training and test set

massage

and train our linear regression

train

The AS equivalent would be the simple call to an estimator - keeping aside the fact that the SimpleRegressor estimator employ more complex approaches like gradient boosting regressor and stochastic gradient descent and even compares the results before storing the model with the best predictions.

estimator.BaseRegressor(
                        features = ['X_train'],
                        targets = ['y_train'],
                        predictions = ['y1_predicted']),

with predictions empty where the Regressor stores its predictors after training. Estimators take care of dropping NaNs in features, targets and predictions before turning these pandas dataframe columns into numpy arrays for the sklearn regression module. Training data, in this case slope/coefficient and constant/intercept, are stored in Cloud Object Store. If empty the SimpleRegressor interprets features and targets as training data. You can drop the existing model prior to triggering a retraining. As stated before SimpleRegressors run two methods and compare their results using the r2_metrics, i.e.

train results

In our simple linear regression approach training results into a intercept (constant) and coefficient (gradient).

train results

Now predict on the test set

predict predict2

This is equivalent to AS

estimator.SimpleRegressor(
                        features = ['X_test'],
                        targets = ['y_test'],
                        predictions = ['y1_predicted']),

with y1_predicted now established.

Last we plot our predictions

predict2

and how the predicted linear function fits the scatter plot of the measured max temperature values.

plot4

Use stochastic gradient descent

See wikipedia for a good explanation of it, we dive right in the middle.

Using this approach for our univariate problem is bit of overkill since we have a closed form to find the optimal solution and do not need an iterative approximation. Nevertheless, let's go through this exercise:

We use the same training data and run 1000 SGD iterations

ToDo: Explain the SDG parameters.

sgd regressor

Now predict max temperature from min temp with the SGD regressor and show results

plot 3

plot 4

The Analytics Service BaseRegressor compares stochastic gradient descent with gradient boosted regressor run with different learning rates, loss function etc. against the r2 metrics (as above). It employs randomized search cross validation to find the best parameter set.

For more on cross validation to tune a model, i.e. find suitable parameters, see here.

This is the model tuning workflow built into the Analytics Services estimators

and it's called cross validation because training data again is split in multiple ways to validate training results

According to Sonya Sawtelle's article a "good" value for the number of iteration is 1000000 / n with n being the size of the training array.

In our case n is 95232, so we might have gotten away with 11 iterations instead of 1000. I did just that

sgdregressor.set_params(max_iter = np.ceil(10**6 / len(y_train)))
sgdregressor.fit(X_train, y_train.reshape(-1,)) 
y_pred = sgdregressor.predict(X_test)
plt.scatter(X_test, y_test,  color='gray')
plt.plot(X_test, y_pred, color='red', linewidth=2)
plt.show()

Confirming the results the plots all look alike, so indeed 11 iterations had been good enough.

Furthermore Sonya also observed that normal linear regression is often good enough and then there is not much benefit of SGD over linear regression even in the multivariate case.

See here for a continuation.