1__Machine learning A Z - xinshuaiqi/My_books GitHub Wiki
[TOC]
# Machine learning A-Z**
Relationships between two continuous (quantitative) variables y = b0 + b1X1 in R:
regressor=lm(formula=Salary~ YearsExperience,
data = training_set)
summary(regressor)
# the coefficients significance indicates how strong does x associate with the Y.
X independent variable
Y dependent variable coefficients (least square coefficients, LSC; estimation based on the observed data)
residual SD : the average amount that the response will deviate from the true regression line.
- the absolute measure of lack of fit of the model.
R2 : the proportion of variance explained (PVE) by the regression. Range from [0,1]
- close to 0, either the model is wrong, or the inherent error σ2 is high.
Assumptions of linear regression
- Linearity
- Homoscedasticity 方差齐性[不同样本的总体方差是否相同]
- Multivariate normality
- Independence of errors
- Lack of multicollinearity
Such as : NYC and CA => 0, 1 You can not including both dummy variables in the multi linear regression.
Because of dummy variable trap: always omit one dummy variable (n -1)
Why choose variables
- garbage IN, garbage OUT
- you have to explain, only keep the right variables
-
All in
-
Backward elimination (stepwise regression) fastest one among the five
- Select a significant level, such as 0.05
- fit the model with all possible predictors
- remove the variable with the highest P value that >0.05
- fit the model again (after remove step 3 variable)
- Until all P value smaller than 0.05
-
Forward selection (stepwise regression)
- Select a significant level, such as 0.05
- all simple linear regression models, select the one with the smallest P value
- two variable, 3, 4, variable linear regression.
- when > 0.05, the previous model is the best model
-
Bidirectional elimination (stepwise regression)
- select 0.05
- forward selection, add a new variable
- Backward elimination
- forward, Back, forward, Back.
- No more variables can enter, no old variables can exit.
-
All possible models
- select goodness of fit (eg Akaike criterion)
- construct all possible models (2square n -1) 10 columns =>1023 models
- Select the one with the best criterion.
-
score comparison
y = b0 + b1x1 + b2X12 This is still linear regression. The linear refers to the power of coefficients.
CART
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences
qxs: after subsets the plots, get distinct mean values among the subsets.
*[qxs]: Xinshuai Qi
Ensemble Learning. Like bootstrap
- pick K points from the whole dataset
- build a decision tree based on the subset
- build another, another decision tree...
- get N results, take average of these N results.
R-squared:
SSres = (yi-yi`)2
SStotal = (yi-yave`)2
R2 =1-(SSres-SStotal )
goodness of fit
- How good is your line (compare to the average line)!
- The closer to 1, the better your model is.
adjusted R-squared
- Problem of R2:
- add more variable always increase the R2.
- penalize for add additional variable.
Estimate the probability of 1 or 0
- choose the number of K
- using Euclidean distance find the nearest neighbors.
- count the numbers of nearest points of the new point in each group.
- put the new point to the group with the most nearest-neighbors.
find the best boundary; find the support vectors the line in the middle called: maximum margin hyperplane/classifier Merits
- apple like orange, orange like apple
- when data is not linear separatable
- use 3D; map to a higher dimension
- can be highly compute-intensive
- Gaussian RBF Kernel
Types of Kernel Function
- Gaussian RBF Kernel
- Sigmoid Kernel
- Polynomial Kernel
Why "Naive"?
-
Bayes requires the variables are independent, but in many cases, that is not true. Thus the assumptions are "naive".
-
P (Drives | X ) = P (X |drives) * P(drives) / P(x)
-
poster probability: P (Drives | X ) ; PP 样本X中,事件发生 ("1") 的概率
-
likelihood: P (X |drives); sample X are people who drive
-
prior probability: 在总样本中,事件发生 ("1") 的概率
-
P(X) 样本占总体的比例 可以忽略,当你compare 0 和 1的概率.
very similar or same as K-mean clustering.
Two types of clustering:
- Agglomerative
- each point is one cluster, in total N clusters
- merge the two closest points to one cluster. N-1
- Divisive (revise of Agglomerative)
- Until you reach only one cluster.
Also called: online learning
Solve interacting problems where the data observed up to time t is considered to decide which action to take at time t + 1
一种试图使用包含复杂结构或由多重非线性变换构成的多个处理层对数据进行高层抽象的算法
- use a cascade of multiple layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input.
- not a new thing, but need a LOT of data
- Geoffrey Hinton: the GODfather of Deep Learning. [Check his video on YouTube]
- Mimic human brain
- Artificial Neural Network:
-
input layer
- columns in the data
-
Hidden layer
- weight the input. like regression, least square...
- each of the neuron weight the input layers differently
- a combination of the weighed decision of all these neuron can provide a powerful output layer.
- Output layer****
- Hyperbolic Tangent (tanh)
-
input layer
- Threshold (yes/no)
-
Sigmoid (commonly used in the output layer)
- 类似 logistic regression trendline
-
Rectifier (commonly used in the hidden layers)
- 折线图: __/
- Hyperbolic Tangent (tanh)
- 类似sigmoid, logistic (-1,1)
- once get the output value, compare with the actual value, then feedback on the weight of each neuron. (cost function)
- cost function: what is the gap between prediction and actual value
- reduce the cost, adjust the weight
- Adjust the weight of each neuron.
- Then feed roles in, then again, again, and again.
list of cost funcstions- Adjust the weight of each neuron.
- Then feed roles in, then again, again, and again.
minimize the cost of function in a very efficient way 碗里的小球,最后停在了碗底
multiple local minimum, not the best global minimum => use Stochastic gradient descent
**steps of deep learning:**
- randomly initialise the weights, small number, close to 0
- => forward propagation; calculate the errors
- <= back propagation, adjust weights at the same time
- repeat 1 and 2 after each observation (reinforcement learning ) or after a batch of observation (batch learning)
- then you got an epoch. Redo more epochs.
many customer leaved the bank in the past 6 months sample of 10,000 customer, did they left or not Now, why?#
GPU is better for ANN and deep learning. Good at parallel
- Theano
- TensorFlow
- Keras (develop deep learning using Theano and TensorFlow within a few lines)
classifier =Sequential()
classifier.add(Dense(output_dim = 6, init = "uniform", activation = 'relu', input_dim = 11))
classifier.add(Dense(output_dim = 6, init = "uniform", activation = 'relu'))
classifier.add(Dense(output_dim = 1, init = 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
classifier.fit(X_train, y_train, batch_size = 10, nb_epoch = 100)
- use the (input column + output column)/2 as your output_dim
- the second hidden layer does not need to know the input_dim
- if you are dealing with dependent variable (y) with multiple categories, use softmax, instead of sigmoid for the activation function
-
image recognition
-
self-driving car
-
Yann Lecun, students of Geoffrey Hinton, gradfather of CNN
(find features in the image)
- use a feature detector (3x3, 5x5, 7x7) / filter, go through the image, see if it match or not, create a feature map. Now the image is smaller.
- then create many feature maps, together called convolutional layer
- apply pooling/downsampling to each of the convolutional layer
- repeat. make another convolutional layer and pooling layer
- flattening the matrix, as the input layer of a ANN.
- add a new ANN. input layer => Full connected layer => output layer
- Rectifier increase non-linearity in the image play with this link
Why two independent variables?
- to visualize better There are two types of Dimensionality Reduction techniques:
1 Feature Selection
Backward Elimination, Forward Selection, Bidirectional Elimination, Score Comparison and more. We covered these techniques in Part 2 - Regression.
2 Feature Extraction
- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
- Kernel PCA
- Quadratic Discriminant Analysis (QDA)
wiki a linear combination of features that characterizes or separates two or more classes of objects or events. Udemy: from the n independent variables, LDA extracts p (<n) new independent variables that separate the most classes of the dependent variable. A supervised model.
Evaluate the model performance and improve the performance
Model Selection techniques including:
split the data into K iteration, like bootstrap, get mean and SD for the accuracy of the model.
Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
cv is the K-fold parameter
in R
library(caret)
folds = createFolds(training_set$Purchased, k = 10)
cv = lapply(folds, function(x) {
training_fold = training_set[-x, ]
test_fold = training_set[x, ]
classifier = svm(formula = Purchased ~ .,
data = training_fold,
type = 'C-classification',
kernel = 'radial')
y_pred = predict(classifier, newdata = test_fold[-3])
cm = table(test_fold[, 3], y_pred)
accuracy = (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
return(accuracy)
})
accuracy = mean(as.numeric(cv))
improve model performance find the optimal value for parameters which model is the best? linear or non-linear
Find the parameters that you want to improve in your model fitting, such as "C", "kernel". Also use K-fold cross validation to estimate the accuracy.
# Applying Grid Search to find the best model and the best parameters
from sklearn.model_selection import GridSearchCV
parameters = [{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}]
grid_search = GridSearchCV(estimator = classifier,
param_grid = parameters,
scoring = 'accuracy',
cv = 10,
n_jobs = -1)
# n_jobs: in case working on large dataset
grid_search = grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
Eventually we will finish this course by a last bonus section included in this part, dedicated to one of the most powerful Machine Learning model, that has become more and more popular: XGBoost.
High performance on large dataset
# Fitting XGBoost to the Training set
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)
- robot walking
- Spoken and Written word
- translator
- book categories, review,
-
NumPy: Data wrangling
-
SciPy: Data wrangling
-
Pandas: Data wrangling
-
Matplotlib: Visualization
-
Seaborn: Visualization
-
Bokeh: Visualization
-
Plotly: Visualization
-
SciKit-Learn: Machine learning
-
Keras: Machine learning
-
TensorFlow: Machine learning
-
Scrapy: Data scraping
-
NLTK: NLP(natural language processing)
-
Gensim: NLP
-
Statsmodels: Statistics
More:
- PyTorch
- a deep learning framework; Tensor; Deep Neural Networks
- e1071 Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier etc (142479 downloads)
- rpart Recursive Partitioning and Regression Trees. (135390)
- igraph A collection of network analysis tools. (122930)
- nnet Feed-forward Neural Networks and Multinomial Log-Linear Models. (108298)
- randomForest Breiman and Cutler's random forests for classification and regression. (105375)
- caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. (87151)
- kernlab Kernel-based Machine Learning Lab. (62064)
- glmnet Lasso and elastic-net regularized generalized linear models. (56948)
- ROCR Visualizing the performance of scoring classifiers. (51323)
- gbm Generalized Boosted Regression Models. (44760)
- party A Laboratory for Recursive Partitioning. (43290)
- arules Mining Association Rules and Frequent Itemsets. (39654)
- tree Classification and regression trees. (27882)
- klaR Classification and visualization. (27828)
- RWeka R/Weka interface. (26973)
- ipred Improved Predictors. (22358)
- lars Least Angle Regression, Lasso and Forward Stagewise. (19691)
- earth Multivariate Adaptive Regression Spline Models. (15901)
- CORElearn Classification, regression, feature evaluation and ordinal evaluation. (13856)
- mboost Model-Based Boosting. (13078)
buckler lab machine learning paper
TensorFlow https://www.tensorflow.org/
- TensorFlow
- [TensorFlow 官方文档中文版](tional etor Too T 官方http://m/tensorflow/)
- TensorFlow 中文
- https://github.com/xinshuaiqi/TensorFlow-Course
- Keras 中文文档
- TensorFlow's Eager API
- Tensor board basic
- Thensor board advanced
Underfitting: still room to improve the test data.
- model is not powerful enough
- over-regularized
- has simply not been trained long enough
How to avoid overfitting?
- more training data
- if not possible, regularization:constraints on the quantity and type of information your model can store. Focus on the most prominent patterns.
- weight regularization
- dropout
Always keep this in mind: deep learning models tend to be good at fitting to the training data, but the real challenge is generalization, not fitting.
L1 regularization, where the cost added is proportional to the absolute value of the weights coefficients (i.e. to what is called the "L1 norm" of the weights).
L2 regularization, where the cost added is proportional to the square of the value of the weights coefficients (i.e. to what is called the "L2 norm" of the weights). L2 regularization is also called weight decay in the context of neural networks. Don't let the different name confuse you: weight decay is mathematically the exact same as L2 regularization.
kernel_regularizer=keras.regularizers.l2(0.001)
Dropout: randomly "dropping out" (i.e. set to zero) a number of output features of the layer during training
keras.layers.Dense(16, activation=tf.nn.relu, input_shape=(NUM_WORDS,)), keras.layers.Dropout(0.5), keras.layers.Dense(16, activation=tf.nn.relu), keras.layers.Dropout(0.5), keras.layers.Dense(1, activation=tf.nn.sigmoid)
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
Computational Network Toolkit (CNTK) 是微软出品的开源深度学习工具包。
-
官方入门教程 https://github.com/Microsoft/CNTK/wiki/Tutorial 本文也主要以这里的教程为例
-
官方论文 # Keras [Keras 中文](https://research.microsoft.com/pubs/226641/CNTKBook-20160217..pdf 这个有150页,我是当作字典来用,遇到问题的时候就在里面搜keras-cn.readthedocs.io/en/latest/)
从TensorFlow到Theano:横向对比七大深度学习框架 对比深度学习十大框架:TensorFlow最流行但并不是最好 {23个深度学习库大排名}(https://www.jiqizhixin.com/articles/2017-10-24-6)