Part 1: Data Preprocessing - oooookk7/machine-learning-a-to-z GitHub Wiki

This is the initial and crucial step before applying any models as real-world data tend to be incomplete, inconsistent, or inaccurate. Therefore it requires cleaning, formatting, and organizing before it is ready to be trained into the ML (Machine Learning) models. These data can come in various forms such as free text, image, video data, structured tables as machines only understand data in binary terms (0 or 1s), and would get transformed, encoded into a state that the machine can easily parse.

What are the Features in ML?

A dataset is a collection of data objects (e.g. records, points, events, cases, observations, vectors), and is described by the features that capture the basic characteristics of an object (e.g. time when an event occurred, recording the weights of students). Some examples of features are color, voltage, grades, age, etc.

(Source: Pranjal Pandey, 2019)

(Source: graphpad.com, 2019)

Features can be:

Categorical (Qualitative): Features are of a category.
- Nominal: Variables without any order (e.g. color of ball - blue/green/red).
- Ordinal: Variables with implied natural order but no definite scale (e.g. shirt size - XXS/XS/S/M/L).
Numerical (Quantitative): Features that are continuous or integer-valued.
- Interval: Variables where intervals between each value are equally split (e.g. temperature in Fahrenheit or Celcius, pH, SAT score (200-800), or credit score (300-850)).
- Ratio: Variables where intervals data with natural point such as 0 (e.g. enzyme activity, reaction rate, flow rate, concentration, pulse, weight, length, the temperature in Kelvin (where 0.0 means "no heat")).

Feature encoding

For categorical features, One-Hot is used to categorize these qualitative features.

Label encoding

(Source: Dinesh Yadav, 2019)

This approach is very simple and it involves converting each value in a column to a number. Although the label encoding is straightforward, the numeric values can be misinterpreted by algorithms as having some sort of hierarchy or order.

One-Hot encoding

(Source: Dinesh Yadav, 2019)

This approach attempts to resolve the Label encoding issue, where each category value is converted into a new column and assigned a 1 or 0 to the column. However, this has a drawback, unlike the Label encoding where it can cause the number of columns to grow depending on the number of categories.

Feature sampling

Happens when a subset of the dataset is being analyzed compared to a whole as working with a complete dataset can be too expensive depending on memory and time constraints. Using a sampling algorithm helps us reduce the size of the dataset to use better and more expensive ML algorithms. The key principle is that sampling is done in a manner that the sample generated should have approximately the same properties as the original sized dataset, which makes the sample a good representative.

Steps involved in Sampling

Points to note in sampling:

Population is a collection of elements about which we wish to make an inference (e.g. students at AAA University, state population).

Element (or observation unit) is an object on which a measurement is taken (e.g. student, resident).

Unit are non-overlapping collections of elements from the population that cover the entire population (e.g. individual student or major which is a collection of elements, or household containing residents)

Frame are a list of sampling units (e.g. . non-overlapping collection of elements from the population that cover the entire population (e.g. if a student is the unit, registry/records is the frame, or if major is the unit, division of academic affairs serve as a frame, or a city serves as a frame).

Sample is a collection of sampling units selected from a single frame or from multiple frames.

Read more here and here (ppt).

(Source: Ronak Gangwal, 2019)

Step 1: Clearly define the target population (e.g. ages 18 and above are only able to vote, people with a terminal illness).
Step 2: Specifying the sample frame and units (e.g. list of people whose names appear on the voter list of the constituency, list of diseases to only consider). It is a list of items of consideration to form the population from which the sample is taken (e.g. in the case where individuals or a household should represent the unit). Prevent errors such as over-representation (e.g. sampling of residents in a state may over-represent if some residents move states - units should only appear in a frame once).
Step 3: Selection of sampling techniques.
Step 4: Determination of sample size. It is the number of units in a sample enough to make inferences about the population with the desired level of accuracy and precision.
Step 5: Start selecting the sample.

You may read a more detailed guide here.

Sampling techniques

(Source: Ronak Gangwal, 2019)

(Source: Ronak Gangwal, 2019 - Simple Random Sampling)

(Source: Ronak Gangwal, 2019 - Stratified Sampling)

Probability Sampling: Sampling in which every member of the population has a calculable and non-zero probability of being included in the sample is known as probability sampling. Probability sampling methods are generally used because every unit of the population has an equal chance of being selected (e.g. every vote has equal value and can be included irrespective of socio-economical status, religion or race)
- Simple Random Sampling: Every unit is chosen entirely by chance has an equal chance of being selected.
- Systematic Sampling: The first unit is selected randomly and others are selected using a fixed "sampling interval" (e.g. <population size>/<sample size> intervals depending on what is set).
- Stratified Sampling: Divide the population into subgroups (aka. strata) based on traits (e.g. gender) and select the units from these subgroups.
- Cluster Sampling: Randomly select a cluster that which the population is divided into clusters.
Non-Probability Sampling: Sampling in which the selection of units is based on factors other than random chance (aka. deliberate and purposive sampling)
- Convenience Sampling: Units are selected based on their availability (e.g. if an individual is willing to take part).
- Quota Sampling: Select Units based on predetermined characteristics of the population (e.g. quota for each race).
- Judgment Sampling: Selection of units based on the judgment of the experts.
- Snowball Sampling: Selection of units based on individual selection passed on like a baton to the next (e.g. individual A is selected and then A nominates B to be selected and so forth).

You may read more details here.

Dimensionality reduction

(Source: turingfinance.com, n.d.)

Because most datasets in the real world have a large number of features (e.g. few patients with too many genes, which causes issues in the identification of problems), dimensionality reduction aims to reduce the number of features using Feature selection (or subset selection). The higher the dimension from 1D to 3D, it'll be more difficult to visualize the data given the number of geometric planes the dataset lies in.

This phenomenon is known as The Curse of Dimensionality, where data analysis tasks become significantly harder as the dimensionality of the data increases (remember each dimension represents a feature), which adds more sparsity to the data (which is a scenario of a matrix in which most of the elements (or objects with feature value in this case) are zero) making the model more difficult to visualize.

(Source: Judy T Raj, 2019)

This could also result in overfitting, where the model corresponds too closely or exactly to a particular set of data and fails to fit or predict future observations reliably on real data. Therefore, avoiding overfitting is a major motivation for dimensionality reduction, as the fewer the feature the training data has, the lesser assumptions the model makes.

Advantages:

Because it reduces misleading data, it improves the model accuracy.
By reducing dimensions, it requires less computing (or processing time), memory consumption, and data, which makes algorithms train faster.
Also by reducing dimensions, more algorithms can be applied which will be unfit for a large number of dimensions.
Less data would also mean less storage space required, which makes it more cost-efficient.
Lesser misleading data also removes the redundant features.

What it essentially does is to map the dataset to lower-dimensional space (e.g. 2D), and by reducing the dimensionality, it creates new features which are a combination of old features. One popular technique to reduce such is known as the Principal Component Analysis (PCA).

Fixing data quality issues

When working on an ML problem, more than half of the time can be spent dealing with data quality issues as it is unrealistic to expect perfect data. These can be caused by human error, measuring device limitations - basically flaws in the data collection process.

Some of these data issues are:

Missing values: Scenario such as data validations dropping some values from being written.
- You may consider eliminating rows with missing data if it only pertains to some rows that have lots of missing values and hence the rows are just noise data. However, this fails if almost many rows have missing values, which requires a feature (we will cover that later) re-evaluation.
- Another strategy is to use interpolation methods to fill in these missing values (e.g. filling in the mean/mode/median of feature).
Inconsistent values: Scenarios such as drug "Name" field containing both "Name" and "Dosage" values, or profile "Phone" value in "Address" field.
- At this stage it's ideal to perform a data assessment to understand the data type of the features and find relationships.
Duplicate values: Scenarios where a user submits a form more than once. Removal is essential as it may give particular data object advantages or biases during the ML training.
- At this stage, algorithms can be applied to analyze and remove duplicates.

Read here for a detailed tutorial on cleaning up data.

Feature scaling

To understand why we need feature scaling, we would first go through the concepts of Cost function and Gradient descent.

Cost function

In our models, we want to create the estimator function which values would be close to for most to create an accurate prediction/estimation.

However, we would need to find the right approximate for this function so that it produces the least errors or loss and to do so, we need a cost function.

A cost function (also known as loss function) is a measure of how wrong the model is in terms of its ability to estimate the relationship between and . It's expressed as the difference (or distance) between the predicted and actual value, which is known as the loss or error.

Mathematical explaination

(Source: Shivam Bhardwaj, 2020 - shows the least square error for point )

The Least squares is a statistical procedure to find the best fit for a set of data points by minimizing the sum of the offsets or residuals of points from the plotted curve. This is useful in data fitting, as the best fit in the least-squares minimizes the sum of squared residuals (difference between and provided by the model). The represents the number of data points.

Building upon this, it gives the following formula,

Then we use the Mean-Squared error to measure the quality of the estimator function and derive the average of all squared errors (how far are they for each point on average).

This introduces the averaging factor which averages the value of the sum in total.

Then in our final function, the cost function becomes,

The is an extra term that is added as gets cancelled during gradient descent formulation. This term is insignificant and can be ignored and left alone irregardless of its existence or not.

Optimising the constants to get minimum errors on average

(Source: Lachlan Miller, 2018 - This shows 1 param cost function which goal is to ultimately find the min errors with optimized)

Now knowing the cost function, for our ML models, we want to obtain the , by finding the minimum values for constants and that gives us the minimum error for . This is done using the gradient descent algorithm.

In the above example, it uses 1 parameter in which is optimized. However in our case, it's 2 parameters in which would look something more like this:

(Source: Richter Brzeski, 2016)

(Source: Andrew Ng, n.d.)

To interpret the contour lines, lines of the same color belongs to the same value regardless of its and at its point on the line.

Gradient descent algorithm

(Source: Richter Brzeski, 2016)

The gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. Its idea is to take repeated steps in the opposite direction of the gradient of the function at the current point to reach the steepest point.

The point to note is that depending on the point where the algorithm starts from, the constants that results in a different . Hence several iterations from different points of values is done to find the most minimum value.

Mathematical explanation

The algorithm works where all the iterations are repeated until it reaches convergence to the min. To describe the algorithm in equations,

This can be also described as the equation below,

Where the iterations decrease monotically until (total number of iterations until the min) is reached,

Note that since the gradient would be none on upon reaching the minima and then the iterations stops. To understand the formula,

is the learning step (aka. how long is the step and how fast should it descend)
is the gradient (or rate of change) that the function is moving.
The intuition behind the subtraction of the gradient is due to the descent it is moving (against its gradient or direction it is moving).

Expanding the formula, it results in this (this comes into play of the extra term we mentioned in the cost function),

To understand how the derivatives work, this is based on differentiation using Chain rule.

Points to note on choosing step size

(Source: Sara Iris Garcia, 2018)

Choosing the optimal step size matters, as the larger the step size, although it may progress quickly, may result in divergence inside of convergence to the min. The smaller the step size may result in a more accurate convergence to the min, but progresses too slowly.

Why we need Feature scaling?

(Source: Quinn Yan, 2018)

Algorithms such as distance-based algorithms (e.g. K-means) and gradient-descent algorithms (e.g. linear regression) may get impacted without feature scaling.

(Source: Baijayanta Roy, 2020 - Example of gradient descent without feature scaling)

For example, a huge value X in the formula will affect the step size of the gradient-descent algorithm, and differences in ranges of features will cause different step sizes for each feature. For the gradient descent to move smoothly towards the minima and at the same rate for all features, scaling the data to a standard may aid it.

For distance-based algorithms, if different features have different scales, there might be a chance that a higher weightage is given to features with a higher magnitude which will impact the performance of the ML algorithm and make it biased toward a feature.

Some algorithms may assume that the data is also centered at 0. For example, if we initialize the weights of a small multi-layer perceptron with tanh activation units to 0 (centered around 0) we want to ensure that the model weights "equally".

There are 2 types of common feature scaling,

Normalization: Good to use when data does not follow a Gaussian (Normal) distribution, which is useful in algorithms that do not assume any distribution of data (e.g. NN, KNN).
Standardization: Good to use when data follows a Gaussian distribution (not always true). It does not have a bounding range hence outliers will not be affected by standardization.

Ultimately there's no hard rule to which to choose, and it's always ideal to compare the performance for each for the best results.

Normalization

This is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 - 1 (known as the min-max scaling).

Why this would be within the range of 0-1?

would certainly result in the largest distance.
for all

Standardization

This is a scaling technique in which values are centered around the mean with a unit standard deviation (values are basically the z-score). This results in a Gaussian distribution with a standard deviation of 1 and a mean of 0.

The intuition can be inferred from the distribution itself on how the z-score is determined.

(Source: Donna Roberts, n.d. - Example of normal distribution where -axis is the z-score and is the probability density)

For more other types of feature scalers, read here.

Training, Validation, and Test datasets

Points to note:

(Source: baeldung.com, 2021)

Hyperparameter: These are the specific parameters of an ML algorithm chosen before it runs on data, and are model specific (e.g. epochs for a deep learning model, or the number of branches in a decision tree model)

Epoch: It indicates the number of passes the entire training dataset the ML algorithm has completed. As datasets can be huge, they may be split into batches. An epoch represents the entire batches being passed forward and backward. With more epochs are being run, the better the accuracy of the weights.

Iterations: It indicates the number of batches runs in an epoch (aka batch size).

Batch: The slice/chunk of an entire dataset is split into.

(Source: Sajid Lhessani, 2019)

Depending on the project, the total dataset may be split into 3 types of datasets (or 2). Splits can range from 30%-70% or 50%-50% ratio for training and test datasets.

There are 3 types of datasets:

Training: This is the dataset that would be used to learn features about the data and train to fit a model (e.g. in the case of NN (Neural Networks), it adjust the weights).
Validation: This is separate from the training dataset. Having this dataset is essential (can be split from the training dataset) as it would help in avoiding overfitting (or underfitting) when any classification parameter needs to be adjusted. Adjustment of hyperparameters is done during this stage (e.g. no. of epochs/branches). This is also where the type of models would be used in the testing data. For example, (1) 5-layer CNN model - 80% accuracy, (2) 5-layer transformer - 85% accuracy, (3) 10-layer CNN - 83% accuracy, the (2) model would be chosen at this state after it has been tested against the validation dataset.
Test: This is the dataset to test how accurate the model has been after running through the training dataset (and validation dataset). If the performance does not match the expectations (fails validations), the entire process has to be started from scratch.

Bias-variance tradeoff

Bias is the difference between the average prediction of the model and the correct value we are trying to predict (e.g. a model with high bias may underfit the model as it gives less attention to the training data). This creates a poor fit on both the training and test dataset.

Variance is the variability of the prediction for a given data point which tells us the spread of the data (e.g. a model with high variance may overfit the model as it gives more attention to the training data). This creates a good fit on the training dataset but a poor fit on the test dataset.

Mathematical derivation

Assume we have a function to predict such a relationship between and ,

Where represents the Noise which represents the roughly linear data which are irrelevant information to fit a smooth line or curve. Its properties are and .

Let to represent the function to estimate and model as close to .

Recall the following formulas for Variance,

Which could be rewritten (for the first formula) as,

Then apply the above function with ,

Note that is because is deterministic hence resulting in 1 probability.

Resulting in,

The mean (or expectation) of all data points error becomes the . Some intuition:

Notice that if you increase the bias (thus increasing the distance between the mean of and ), the further the line would be from most data points generally. The further the distance the more likely for underfitting.
The higher the also increases the distance from the line, creating an overfitting.
Ultimately, the lesser the for most would mean a more optimal fit.

Understanding the tradeoff

(Source: Seema Singh, 2018)

Using the bulls-eye as an analogy, the center of the target is the model that would perfectly predict the correct values. The further away the values are from the bulls-eye, the worse the predictions.

In supervised learning, underfitting occurs when the model is unable to capture the underlying pattern of the data and tends to be models with high bias and low variance. The overfitting occurs when the model captures the noise along with the underlying pattern in the data and tends to be models (e.g. decision trees) with low bias and high variance.

(Source: Seema Singh, 2018)

Therefore, a good balance without overfitting and underfitting the data is required, a trade-off between bias and variance, where an algorithm can’t be more complex and less complex at the same time. An optimal balance of its bias and variance would never overfit or underfit the model.