第一章 - leetschau/Python-Machine-Learning-Cookbook GitHub Wiki

Python Machine Learning Cookbook by Prateek Joshi.

Chapter 1

本章主要内容

Preprocessing data using different techniques
Label encoding
Building a linear regressor
Computing regression accuracy
Achieving model persistence
Building a ridge regressor
Building a polynomial regressor
Estimating housing prices
Computing the relative importance of features
Estimating bicycle demand distribution

Preprocessing data using different techniques

为什么数据需要做Standardization和Normalization？数据什么时候需要做中心化和标准化处理？解释了Standardization的作用，但把它和Normalization当成了一个东西。 Standardization处理的是特征（feature，一列数据），将每个特征转换为平均值=0，方差=1；Normalization处理的是样本（sample，一行数据），将每个样本组成的向量的范数变为1（参考 scikit-learn 文档 4.3. Preprocessing data 中 4.3.3. Normalization ），范数分为L1 和 L2 两种，通过 preprocessing.normalize函数的 norm 参数指定。L1是向量所有元素绝对值的和，L2是向量元素平方和的根。用 np.linalg.norm 求一个向量的范数，其中的 ord 参数为1表示 L1范数，ord=2 表示求 L2 范数。

preprocessing.MinMaxScaler是把一个向量的所有数据都线性转换到 feature_range 限定的上下限之间。

preprocessing.Binarizer根据 threshold 与矩阵元素的大小关系将元素转换为0和1。

One Hot Encoding 原理可以参考数据清洗和特征选择：特征放缩与One Hot编码，但这个例子没有说清文本类别数据（如下所示）是如何编码为 [0,0,3],[1,1,0],[0,2,1],[1,0,2](/leetschau/Python-Machine-Learning-Cookbook/wiki/0,0,3],[1,1,0],[0,2,1],[1,0,2) 的：

性别：["male"，"female"]
地区：["Europe"，"US"，"Asia"]
浏览器：["Firefox"，"Chrome"，"Safari"，"Internet Explorer"]

Computing regression accuracy

对于线性回归的评价指标主要有两个：

sklearn.metrics.mean_squared_error(): 输入参数完全相同时为0，越靠近0表示拟合程度越好；
sklearn.metrics.explained_variance_score(): 输入参数完全相同时为1，可能为负值，越靠近1表示拟合程度越好。

scikit-learn的多项式拟合

scikit-learn多项式拟合的整体思路是将多项式方程变为线性方程，再用线性拟合求解。例如要拟合函数 $n$ 元 $k$ 次函数 $y = f(x_0, x_1, x_2, ..., x_n)$，首先确定次数 $k$： polynomial = sklearn.preprocessing.PolynomialFeatures(degree=k)。接下来确定 $n$：X_train_transformed = polynomial.fit_transform(X_train) 中 X_train 确定，如果它是矩阵 (numpy.ndarray)，$n$ 等于矩阵的列数，如果是 list，$n$ 等于它的长度。这时可以通过 polynomial 查看多项式的形态，例如下面是3元2次方程 $y = a_0 + a_1 x_0 + a_2 x_1 + a_3 x_2 + a_4 x_0^2 + a_5 x_0 x_1 + a_6 x_0 x_2 + a_7 x_1^2 + a_8 x_1 x_2 + a_9 x_2^2$ 的项列表和变量次数列表：

>>> polynomial.get_feature_names()
['1', 'x0', 'x1', 'x2', 'x0^2', 'x0 x1', 'x0 x2', 'x1^2', 'x1 x2', 'x2^2']
>>> polynomial.powers_
array([0, 0, 0], [1, 0, 0], [0, 1, 0], [0, 0, 1], [2, 0, 0], [1, 1, 0], [1, 0, 1], [0, 2, 0], [0, 1, 1], [0, 0, 2](/leetschau/Python-Machine-Learning-Cookbook/wiki/0,-0,-0],-[1,-0,-0],-[0,-1,-0],-[0,-0,-1],-[2,-0,-0],-[1,-1,-0],-[1,-0,-1],-[0,-2,-0],-[0,-1,-1],-[0,-0,-2))

其中的 $a_0, a_1, ..., a_9$ 向量保存在 X_train_transformed 中。此矩阵的行数与输出矩阵 X_train 的行数相同，列数由 $n$ 和 $k$ 决定：

In [151]: X_train.shape
Out[154]: (400, 3)
In [151]: X_train_transformed
Out[152]:
array([[ 1.00000000e+00, 3.90000000e-01, 2.78000000e+00, ...,
7.72840000e+00, 1.97658000e+01, 5.05521000e+01],
[ 1.00000000e+00, 1.65000000e+00, 6.70000000e+00, ...,
4.48900000e+01, 1.62140000e+01, 5.85640000e+00],
[ 1.00000000e+00, 5.67000000e+00, 6.38000000e+00, ...,
4.07044000e+01, 2.41802000e+01, 1.43641000e+01],
...,
[ 1.00000000e+00, 2.16000000e+00, 1.13000000e+00, ...,
1.27690000e+00, 8.36200000e-01, 5.47600000e-01],
[ 1.00000000e+00, 7.04000000e+00, 3.19000000e+00, ...,
1.01761000e+01, 3.70040000e+00, 1.34560000e+00],
[ 1.00000000e+00, 1.65000000e+00, 6.20000000e-01, ...,
3.84400000e-01, 1.05400000e-01, 2.89000000e-02]])

In [153]: X_train_transformed.shape
Out[153]: (400, 10)

把展开式每一项中除了系数 $a_i$ 外其他部分当时一个独立的变量，多项式拟合就转换为了线性拟合问题，后面用线性拟合器的 fit -> predict 两步就可以得到拟合结果了，完整代码见 regression_multivar.py。

Estimating housing prices

sklearn.utils.shuffle(*arrays, **options) 函数对输入的一个或多个数组（*array）按随机顺序进行重新排列，所以对于固定的输入，每次的输出可能是不同的：

>>> X = np.array([1, 2], [3, 4], [5, 6](/leetschau/Python-Machine-Learning-Cookbook/wiki/1,-2],-[3,-4],-[5,-6))
>>> y = np.array([11, 12, 13])
>>> shuffle(X, y)
[array([3, 4], [5, 6], [1, 2](/leetschau/Python-Machine-Learning-Cookbook/wiki/3,-4],-[5,-6],-[1,-2)), array([12, 13, 11])]
>>> shuffle(X, y)
[array([5, 6], [3, 4], [1, 2](/leetschau/Python-Machine-Learning-Cookbook/wiki/5,-6],-[3,-4],-[1,-2)), array([13, 12, 11])]

重排的具体规则可以在**options参数中