Machine Learning - dennisholee/notes GitHub Wiki
Src: https://cloud.google.com/ml-engine/docs/scikit/ml-solutions-overview
Data Preprocessing
Imputation of Missing Data
Src: https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4
- mean: regular meaning of "average"
- median: middle value
- mode: most often
Src: https://www.purplemath.com/modules/meanmode.htm
Feature Engineering
Representing categorical data
OneHotEncoder vs LabelEncoder
src https://www.kaggle.com/c/home-credit-default-risk/discussion/59873
One hot encoding makes sense when your variables has classes which has no sense in being compared(i.e. not ordinal). E.g. Red>Blue>Green does not signify anything. Label encoding assigns a numerical value to the classes in the variable. E.g. Good=0, Better=1, Best=2. So depending on the variable, one can choose the encoding.
Example
Country |
---|
France |
Germany |
Italy |
from sklearn.composer import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
...
ct = ColumnTransformer(
transformers = [
("onehot", "OneHotEncoder(sparse=False)", [0])
],
remainder="passthrough")
...
0 | 1 | 2 |
---|---|---|
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
Notes
- The LabelEncoder is implicitly applied. It converts each class under specified feature to a numerical value. The OneHotEncoder creates extra columns indicating the presence or absence of a category with a value thus eliminating the numerical feature.
Representing text
Representing image
Linear Regression
Multiple Linear Regression
http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis#Example
Miscellaneous
Questions | Remarks |
---|---|
LinearRegression vs LogisticRegression |
Classification
Confusion Matrix
Big Query with SciKit Learn
Import the BigQuery library to Anaconda https://anaconda.org/conda-forge/google-cloud-bigquery
conda install -c conda-forge google-cloud-bigquery
Glossary
- Batch size - Number of training examples utilised in one iteration. The batch size can be one of three options:
- batch mode: where the batch size is equal to the total dataset thus making the iteration and epoch values
- mini-batch mode:
- Where the batch size is greater than one but less than the total dataset size. Usually, a number that can be divided into the total dataset size.
- Training on a part of the overall examples.
- stochastic mode:
- Where the batch size is equal to one. Therefore the gradient and the neural network parameters are updated after each sample.
- Performing training on one randomly selected example at a time
- Epoch
- Evaluation
- Gradient descent
- Stochastic - Having a random probability distribution or pattern that may be analysed statistically but may not be predicted precisely.
- Training
- Weights