WW II weather data and Gaussian processes - sedgewickmm18/mmfunctions GitHub Wiki

This time I'm trying to predict max temperature from the minimal temperature and precipitation. Notebook is found here

Note that Gaussian processes require a lot of computational power for a large set of observation points - it's complexity is O ( n ³ )

Background

Gaussian Processes for little data starts with explaining OLS (ordinary least square) across a parameter space (parameters of a polynomial). Then he moves to his main subject, the non-parametric GP and explains the mathematical background. Furthermore he shows how to implement the covariance matrix based on the radial basis kernel.
Gaussian Processes are not so fancy is a more lightweight introduction to the subject.
Gaussian Process and Big Data deals with scaling Gaussian Processes for big data with lots of observation points.
Visual Exploration of Gaussian Processes is the most aesthetically appealing article.
Tutorial with lots of examples how to plot data.
Tutorial
Explains how to use the typical python packages
My favorite for the mathematical background A copy of it is found in this github repo here

β = ( X ′ X ) ^{−
1} X ′ Y

Hands-on

As usual we kick of the notebook by importing libraries we need.

import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as seabornInstance 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import gaussian_process
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel, ExpSineSquared, ConstantKernel as C
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import ShuffleSplit
import sklearn.decomposition
import sklearn.metrics
%matplotlib inline

Let's read the weather data

dataset = pd.read_csv('./Weather.csv', dtype={"Snowfall":object, "PoorWeather":object, "SNF": object, "TSHDSBRSGF":object})

and fix precipitation data we need later

dataset['Precip'] = pd.to_numeric(dataset['Precip'], errors='coerce')                                
dataset = dataset.fillna(0)
dataset.shape

The last statement should return (119040, 31)

We have now the following data

weather data

that we have to 'massage' a bit

#prepare data, sort by minimal temperature and turn X,y into a function
df = dataset[['MinTemp','Precip','MaxTemp']]
#just take the first 1000 values - the full dataset takes way too much memory
X = df.values[:,0:-1]  # features MinTemp & Precip
y = df.values[:,2]  # target
X_train = df.head(1000).values[:,0:-1]
y_train = df.head(1000).values[:,2]
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

before we dive right into Gaussian Processes.

np.random.seed(1)

#Predict max temp from min temp and precipitation
kernel = C(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2))

gp = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=9)

gp.fit(X_train, y_train)

print (gp.score(X_train, y_train))

y_pred, sigma = gp.predict(X, return_std=True)

# Back to a dataframe
dfnew = pd.DataFrame({'MinTemp': X[:,0], 'Precip': X[:,1], 'MaxTemp': y[:], 'MaxPred': y_pred})

# and predict for the training data only
y_predT, sigma = gp.predict(X_train, return_std=True)

Sigma is 0.5901525736283517 so our R² score is rather mediocre

Plotting observations and predictions for training data with

# Gaussian doesn't like values below 0 !!!
plt.figure(figsize=[10,7])

plt.plot(X_train[:,0], y_train, 'r.', markersize=10, label='MinTemp')
plt.plot(X_train[:,1], y_train, 'g.', markersize=10, label='Precipitation')
plt.plot(X_train[:,0], y_predT, 'b-', label='Prediction')
plt.plot(X_train[:,1], y_train, 'y-', markersize=10, label='Prediction2')
#plt.fill(np.concatenate([X, X[::-1]]),
#         np.concatenate([y_pred - 1.9600 * sigma,
#                       (y_pred + 1.9600 * sigma)[::-1]]),
#         alpha=.5, fc='b', ec='None')
#         alpha=.5, fc='b', ec='None', label='95% confidence interval')
plt.xlabel('$MinTemp$')
plt.ylabel('$MaxTemp$')
plt.ylim(-10, 35)
plt.xlim(-10, 35)
plt.legend(loc='upper right')

yields

plot

Let's see how good we predict non-training data with

df1 = dfnew.head(30) + 40 # lift by 40 for strictly positive numbers
df1.plot(kind='bar',figsize=(16,10))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.show()

to generate the following plot.

plot2

Last little plot shows we're doing so lala

plot2

Linear regression is better ;-)

3-d plotting of MinTemp, Precipitation and MaxTemp exhibits a rough linear relationship between the 2 temperature values (as expected).

from matplotlib import cm
from mpl_toolkits.mplot3d import axes3d
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')
ax.plot_trisurf(df['MinTemp'], df['Precip'],df['MaxTemp'], cmap = cm.coolwarm, linewidth=0.2)
plt.show

plot2

After cleansing data and removing rows where MinTemp > MaxTemp with df = df1[~(df1['MinTemp'] > df1['MaxTemp'])] # strip rows where MinTemp > MaxTemp and taking the first 3000 elements as training data results are slightly better. BTW, fitting (training) took quite long time, i.e. ~30 mins - complexity of O(n³) shows off. As a result training, especially training with multiple parameters, for example different kernels to find the best one via cross validation, is not the kind of function you want to run synchronously in your pipeline.

The mmfunction module also contains a GaussianProcess subclass of BaseEstimatorFunction. Run ./register.py test to execute a rather run with over a one-dimensional feature space to generate column5 with its predictions - it's test case number 4. It's just there to show the plain "mechanics".

WW II weather data and Gaussian processes - sedgewickmm18/mmfunctions GitHub Wiki

Background

Hands-on

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️