ML2 ‐ Equations - RenadShamrani/test GitHub Wiki
1. Entropy 📉
-
What?: Measures uncertainty or randomness in a dataset.
-
Formula:
Entropy(S) = -\sum_{i=1}^c p_i \log_2(p_i)
p_i
: Proportion of classi
in the dataset.c
: Number of classes.
-
Steps to Solve:
- Calculate the proportion of each class in the dataset (
p_i
). - Use the calculator to compute
log₂(p_i)
. - Multiply each
p_i
by itslog₂(p_i)
. - Sum the results for all classes.
- Multiply the sum by
-1
to get the entropy.
- Calculate the proportion of each class in the dataset (
2. Information Gain (IG) 📈
-
What?: Measures the reduction in entropy after splitting the dataset.
-
Formula:
IG(S, A) = Entropy(S) - \sum_{v \in \text{values}(A)} \frac{|S_v|}{|S|} Entropy(S_v)
S
: Dataset.A
: Attribute.S_v
: Subset of data where attributeA
has valuev
.
-
Steps to Solve:
- Calculate the entropy of the entire dataset (
Entropy(S)
). - Split the dataset based on attribute
A
. - Calculate the entropy for each subset (
Entropy(S_v)
). - Compute the weighted sum of subset entropies.
- Subtract the weighted sum from the original entropy.
- Calculate the entropy of the entire dataset (
3. Gini Index 🎯
-
What?: Measures impurity in a dataset.
-
Formula:
Gini(S) = 1 - \sum_{i=1}^c p_i^2
p_i
: Proportion of classi
in the dataset.c
: Number of classes.
-
Steps to Solve:
- Calculate the proportion of each class (
p_i
). - Square each proportion using the calculator.
- Sum the squared proportions.
- Subtract the sum from 1.
- Calculate the proportion of each class (
4. Weighted Gini Index ⚖️
-
What?: Measures impurity after splitting the dataset.
-
Formula:
Gini_{split}(S, A) = \sum_{v \in \text{values}(A)} \frac{|S_v|}{|S|} Gini(S_v)
S
: Dataset.A
: Attribute.S_v
: Subset of data where attributeA
has valuev
.
-
Steps to Solve:
- Split the dataset based on attribute
A
. - Calculate the Gini Index for each subset (
Gini(S_v)
). - Compute the weighted sum of subset Gini Indices.
- Split the dataset based on attribute
5. Principal Component Analysis (PCA) 📊
- What?: Reduces dimensionality by transforming data into uncorrelated components.
- Steps to Solve:
- Standardize Data:
Z = \frac{X - \mu}{\sigma}
- Compute Covariance Matrix:
\text{Cov}(X, Y) = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{n-1}
- Find Eigenvalues & Eigenvectors:
Solve:
\text{det}(\text{Covariance Matrix} - \lambda I) = 0
- Select Top PCs: Keep eigenvectors with the highest eigenvalues.
- Standardize Data:
6. K-Means Clustering 🔢
- What?: Partitional clustering algorithm.
- Steps to Solve:
- Randomly initialize
k
centroids. - Assign each point to the nearest centroid.
- Recalculate centroids as the mean of assigned points.
- Repeat until convergence (no change in centroids).
- Randomly initialize
7. Gaussian Mixture Model (GMM) 🎲
- What?: Probabilistic model representing data as a mixture of Gaussians.
- Formula:
p(x) = \sum_{i=1}^k \pi_i N(x | \mu_i, \Sigma_i)
π_i
: Mixing weight.μ_i
: Mean of componenti
.Σ_i
: Covariance of componenti
.
8. Expectation-Maximization (EM) Algorithm 🔄
- What?: Iterative algorithm to estimate GMM parameters.
- Steps to Solve:
- E-Step (Expectation):
Compute responsibilities:
\tau(z_{nk}) = \frac{\pi_k N(x_n | \mu_k, \Sigma_k)}{\sum_{j=1}^k \pi_j N(x_n | \mu_j, \Sigma_j)}
- M-Step (Maximization):
Update parameters:
\mu_k^{new} = \frac{\sum_{n=1}^N \tau(z_{nk}) x_n}{N_k}
\Sigma_k^{new} = \frac{1}{N_k} \sum_{n=1}^N \tau(z_{nk}) (x_n - \mu_k^{new})(x_n - \mu_k^{new})^T
\pi_k^{new} = \frac{N_k}{N}
- E-Step (Expectation):
Compute responsibilities:
Key Equations Summary 🔑
Concept | Equation |
---|---|
Entropy | Entropy(S) = -∑ p_i log₂(p_i) |
Information Gain | `IG(S, A) = Entropy(S) - ∑ ( |
Gini Index | Gini(S) = 1 - ∑ p_i² |
Weighted Gini | `Gini_split(S, A) = ∑ ( |
PCA | Cov(X, Y) = ∑ (X_i - X̄)(Y_i - Ȳ) / (n-1) |
K-Means | Assign points to nearest centroid, recalculate centroids. |
GMM | `p(x) = ∑ π_i N(x |
EM Algorithm | E-Step: Compute responsibilities. M-Step: Update parameters. |
Mind Map 🧠
Machine Learning Equations
├── Entropy (Uncertainty)
├── Information Gain (Reduction in Entropy)
├── Gini Index (Impurity)
├── Weighted Gini (Impurity after Split)
├── PCA (Dimensionality Reduction)
├── K-Means (Clustering)
├── GMM (Mixture of Gaussians)
└── EM Algorithm (Parameter Estimation)
How to Solve Equations 🛠️
-
Entropy:
- Calculate proportions (
p_i
). - Use the calculator to compute
log₂(p_i)
. - Multiply each
p_i
by itslog₂(p_i)
. - Sum the results for all classes.
- Multiply the sum by
-1
to get the entropy.
- Calculate proportions (
-
Information Gain:
- Calculate entropy before and after split.
- Subtract weighted sum of subset entropies.
-
Gini Index:
- Calculate proportions (
p_i
). - Square each proportion using the calculator.
- Sum the squared proportions.
- Subtract the sum from 1.
- Calculate proportions (
-
Weighted Gini:
- Split dataset, calculate Gini for each subset.
- Compute weighted sum.
-
PCA:
- Standardize data, compute covariance matrix.
- Find eigenvalues/eigenvectors, select top PCs.
-
K-Means:
- Initialize centroids, assign points, recalculate centroids.
-
GMM:
- Model data as a mixture of Gaussians.
-
EM Algorithm:
- E-Step: Compute responsibilities.
- M-Step: Update parameters.