Machine Learning - dennisholee/notes GitHub Wiki

Src: https://cloud.google.com/ml-engine/docs/scikit/ml-solutions-overview

Data Preprocessing

Imputation of Missing Data

Src: https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4

mean: regular meaning of "average"
median: middle value
mode: most often

Src: https://www.purplemath.com/modules/meanmode.htm

Feature Engineering

Representing categorical data

OneHotEncoder vs LabelEncoder

src https://www.kaggle.com/c/home-credit-default-risk/discussion/59873

One hot encoding makes sense when your variables has classes which has no sense in being compared(i.e. not ordinal). E.g. Red>Blue>Green does not signify anything. Label encoding assigns a numerical value to the classes in the variable. E.g. Good=0, Better=1, Best=2. So depending on the variable, one can choose the encoding.

Example

Country
France
Germany
Italy

from sklearn.composer import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

...
ct = ColumnTransformer(
     transformers = [
         ("onehot", "OneHotEncoder(sparse=False)", [0])
       ], 
     remainder="passthrough")
...

0	1	2
1	0	0
0	1	0
0	0	1

Notes

The LabelEncoder is implicitly applied. It converts each class under specified feature to a numerical value. The OneHotEncoder creates extra columns indicating the presence or absence of a category with a value thus eliminating the numerical feature.

Representing text

Representing image

Linear Regression

Multiple Linear Regression

http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis#Example

Miscellaneous

Questions	Remarks
LinearRegression vs LogisticRegression

Classification

Confusion Matrix

Big Query with SciKit Learn

Import the BigQuery library to Anaconda https://anaconda.org/conda-forge/google-cloud-bigquery

conda install -c conda-forge google-cloud-bigquery

Glossary

Batch size - Number of training examples utilised in one iteration. The batch size can be one of three options:
1. batch mode: where the batch size is equal to the total dataset thus making the iteration and epoch values
2. mini-batch mode:
  - Where the batch size is greater than one but less than the total dataset size. Usually, a number that can be divided into the total dataset size.
  - Training on a part of the overall examples.
3. stochastic mode:
  - Where the batch size is equal to one. Therefore the gradient and the neural network parameters are updated after each sample.
  - Performing training on one randomly selected example at a time
Epoch
Evaluation
Gradient descent
Stochastic - Having a random probability distribution or pattern that may be analysed statistically but may not be predicted precisely.
Training
Weights