Machine Learning - dennisholee/notes GitHub Wiki

Src: https://cloud.google.com/ml-engine/docs/scikit/ml-solutions-overview

Data Preprocessing

Imputation of Missing Data

Src: https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4

  • mean: regular meaning of "average"
  • median: middle value
  • mode: most often

Src: https://www.purplemath.com/modules/meanmode.htm

Feature Engineering

Representing categorical data

OneHotEncoder vs LabelEncoder

src https://www.kaggle.com/c/home-credit-default-risk/discussion/59873

One hot encoding makes sense when your variables has classes which has no sense in being compared(i.e. not ordinal). E.g. Red>Blue>Green does not signify anything. Label encoding assigns a numerical value to the classes in the variable. E.g. Good=0, Better=1, Best=2. So depending on the variable, one can choose the encoding.

Example

Country
France
Germany
Italy
from sklearn.composer import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

...
ct = ColumnTransformer(
     transformers = [
         ("onehot", "OneHotEncoder(sparse=False)", [0])
       ], 
     remainder="passthrough")
...
0 1 2
1 0 0
0 1 0
0 0 1

Notes

  • The LabelEncoder is implicitly applied. It converts each class under specified feature to a numerical value. The OneHotEncoder creates extra columns indicating the presence or absence of a category with a value thus eliminating the numerical feature.

Representing text

Representing image

Linear Regression

Multiple Linear Regression

http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis#Example

Miscellaneous

Questions Remarks
LinearRegression vs LogisticRegression

Classification

Confusion Matrix

Big Query with SciKit Learn

Import the BigQuery library to Anaconda https://anaconda.org/conda-forge/google-cloud-bigquery

conda install -c conda-forge google-cloud-bigquery

Glossary

  • Batch size - Number of training examples utilised in one iteration. The batch size can be one of three options:
    1. batch mode: where the batch size is equal to the total dataset thus making the iteration and epoch values
    2. mini-batch mode:
      • Where the batch size is greater than one but less than the total dataset size. Usually, a number that can be divided into the total dataset size.
      • Training on a part of the overall examples.
    3. stochastic mode:
      • Where the batch size is equal to one. Therefore the gradient and the neural network parameters are updated after each sample.
      • Performing training on one randomly selected example at a time
  • Epoch
  • Evaluation
  • Gradient descent
  • Stochastic - Having a random probability distribution or pattern that may be analysed statistically but may not be predicted precisely.
  • Training
  • Weights