Scikit Learn - bobbae/gcp GitHub Wiki

Scikit-learn provides a range of awesome supervised and unsupervised learning algorithms via a consistent interface. SciKit Learn is a general machine learning library, built on top of NumPy. It features a lot of machine learning algorithms such as support vector machines, random forests, as well as a lot of utilities for general pre- and postprocessing of data. It is not a neural network framework.

It is built on Numpy, Scipy, and Matplotlib libraries.

Scikit learn and TensorFlow

TensorFlow is more of a low-level library. TensorFlow can be considered as the Lego bricks (similar to NumPy and SciPy) that we can use to implement machine learning algorithms whereas scikit-learn comes with off-the-shelf algorithms, e.g., algorithms for classification such as SVMs, Random Forests, Logistic Regression, and many, many more. TensorFlow really shines if we want to implement deep learning algorithms, since it allows us to take advantage of GPUs for more efficient training.

https://sebastianraschka.com/faq/docs/tensorflow-vs-scikitlearn.html

Scikit learn Videos

An example of scikit-learn.

https://www.youtube.com/watch?v=rvVkVsG49uU

Setting up deployment pipeline with scikit-learn.

https://www.youtube.com/watch?v=MaKLWy5zXe8

Linear Regression Example

https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

Examples

https://scikit-learn.org/stable/auto_examples/index.html

Pre-Processing

https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

StandardScaler

Sklearn provides several built-in machine learning models, called estimators, with these estimators Standardization of datasets is a common requirement. For instance, if all the individual features do not more or less look like a standard normally distributed data the model may not perform as expected. To overcome this scikit-learn offers StandardScaler.

Scaling to a range

Sk-learn provides another option, to scale down a feature to a particular minimum and maximum value using MinMaxScaler.

Normalizer

Each sample with at least one non zero component is rescaled independently of other samples so that its norm equals one i.e scaling real-valued metric attributes into the range 0 and 1.

Encoding categorical data

Many a times data is not in the continuous numerical form, such data is said to be categorical for instance, [‘Good’,’Bad’], [‘Male’,’Female’] such features can be encoded as numeric value for instance, [1,0],[1,0].

To convert categorical data into numerical one can use something known as one-hot or dumming, OneHotEncoder transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them being 1, and all others 0.

Splitting up Training and Testing data

Once data pre-processing is done, we’ll move ahead to splitting the training and testing data for our model. With Sklearn the data randomly splits up into Training and Testing sets as per the given size.

Classification

Phases

The classification has two phases, a learning phase, and the evaluation phase. In the learning phase, the classifier trains its model on a given dataset and in the evaluation phase, it tests the classifier performance. Performance is evaluated on the basis of various parameters such as accuracy, error, precision, and recall.

https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn

SVM Classification

Identifying which category an object belongs to.

https://scikit-learn.org/stable/modules/svm.html#svm-classification

Regression

Predicting a continuous-valued attribute associated with an object.

https://scikit-learn.org/stable/modules/svm.html#svm-regression

Linear Regression

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. It basically performs the task to predict a dependent variable value (y) based on a given independent variable (x). So, this regression technique finds out a linear relationship between x (input) and y(output).

https://towardsdatascience.com/how-does-linear-regression-actually-work-3297021970dd

Logistic Regression

Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities.

https://towardsdatascience.com/an-introduction-to-logistic-regression-8136ad65da2e

SVM

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression. With this algorithm, we plot each data item as a point in n-dimensional space with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes. Follow this blog for an in-depth understanding of SVM.

https://towardsdatascience.com/https-medium-com-pupalerushikesh-svm-f4b42800e989

Decision Trees (DTs)

Decision Trees (DTs) are supervised learning methods used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052

Random Forest

Random forests are an ensemble learning method for classification, regression and other, tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

https://towardsdatascience.com/random-forest-3a55c3aca46d

Clustering

Cluster analysis of unlabeled data can be performed with the module sklearn.cluster.

Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on train data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. For the class, the labels over the training data can be found in the labels_ attribute.

https://scikit-learn.org/stable/modules/clustering.html

KMeans

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

https://scikit-learn.org/stable/modules/clustering.html#k-means

Affinity Propagation

AffinityPropagation creates clusters by sending messages between pairs of samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs. This updating happens iteratively until convergence, at which point the final exemplars are chosen, and hence the final clustering is given.

https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation

Model Evaluation

Classification Model Evaluation

Accuracy Score

Accuracy is one of the simplest and most used metric. It can be defined as the number of correct predictions made by the model divided by the total number of predictions.

Precision Score

Precision is the ability of a classifier to appropriately classify a data point as either positive or negative. It’s the ratio of tp/(tp+fp) where tp is the number of true positives and fp the number of false positives.

Recall Score

Recall is the ability of a classifier to find all positive/relevant samples within a dataset. To be precise, it’s the ratio of tp/(tp+fn) where tp is the number of true positives and fn the number of false negatives.

F1 Score

There is often a trade-off between Recall & Precision, so to have an optimal blend, we combine the two metrics using F1 Score. It can be interpreted as an average of the precision and recall, where F1 score reaches its best value at 1 and the worst score at 0.

Classification report

Builds a report containing main classification metrics.

Confusion Matrix

It is a table often used to evaluate the performance of a classification model.

Regression Model Evaluation

Mean Absolute Error (MAE)

MAE measures the average magnitude errors in a set of predictions. It’s the average of the absolute difference between predicted and actual values. One disadvantage associated with MAE is that it doesn't punish large errors.

Mean Squared Error (MSE)

MSE overcomes the disadvantage associated with MAE i.e. Large errors. As it can be seen instead of taking the absolute value, here we square the difference between predicted and actual values. However, another issue with MSE is that the unit of target variable(Y) gets squared too.

Root Mean Squared Error (RMSE)

To fix the above issue associated with MSE, we have RMSE which basically performs the square root function of MSE.

R2 Score

R² (coefficient of determination) regression score function.The best possible score can be 1, it can go negative as well in the worst conditions.