ML2 ‐ Equations - RenadShamrani/test GitHub Wiki

1. Entropy 📉

  • What?: Measures uncertainty or randomness in a dataset.

  • Formula:

    Entropy(S) = -\sum_{i=1}^c p_i \log_2(p_i)
    
    • p_i: Proportion of class i in the dataset.
    • c: Number of classes.
  • Steps to Solve:

    1. Calculate the proportion of each class in the dataset (p_i).
    2. Use the calculator to compute log₂(p_i).
    3. Multiply each p_i by its log₂(p_i).
    4. Sum the results for all classes.
    5. Multiply the sum by -1 to get the entropy.

2. Information Gain (IG) 📈

  • What?: Measures the reduction in entropy after splitting the dataset.

  • Formula:

    IG(S, A) = Entropy(S) - \sum_{v \in \text{values}(A)} \frac{|S_v|}{|S|} Entropy(S_v)
    
    • S: Dataset.
    • A: Attribute.
    • S_v: Subset of data where attribute A has value v.
  • Steps to Solve:

    1. Calculate the entropy of the entire dataset (Entropy(S)).
    2. Split the dataset based on attribute A.
    3. Calculate the entropy for each subset (Entropy(S_v)).
    4. Compute the weighted sum of subset entropies.
    5. Subtract the weighted sum from the original entropy.

3. Gini Index 🎯

  • What?: Measures impurity in a dataset.

  • Formula:

    Gini(S) = 1 - \sum_{i=1}^c p_i^2
    
    • p_i: Proportion of class i in the dataset.
    • c: Number of classes.
  • Steps to Solve:

    1. Calculate the proportion of each class (p_i).
    2. Square each proportion using the calculator.
    3. Sum the squared proportions.
    4. Subtract the sum from 1.

4. Weighted Gini Index ⚖️

  • What?: Measures impurity after splitting the dataset.

  • Formula:

    Gini_{split}(S, A) = \sum_{v \in \text{values}(A)} \frac{|S_v|}{|S|} Gini(S_v)
    
    • S: Dataset.
    • A: Attribute.
    • S_v: Subset of data where attribute A has value v.
  • Steps to Solve:

    1. Split the dataset based on attribute A.
    2. Calculate the Gini Index for each subset (Gini(S_v)).
    3. Compute the weighted sum of subset Gini Indices.

5. Principal Component Analysis (PCA) 📊

  • What?: Reduces dimensionality by transforming data into uncorrelated components.
  • Steps to Solve:
    1. Standardize Data:
      Z = \frac{X - \mu}{\sigma}
      
    2. Compute Covariance Matrix:
      \text{Cov}(X, Y) = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{n-1}
      
    3. Find Eigenvalues & Eigenvectors: Solve:
      \text{det}(\text{Covariance Matrix} - \lambda I) = 0
      
    4. Select Top PCs: Keep eigenvectors with the highest eigenvalues.

6. K-Means Clustering 🔢

  • What?: Partitional clustering algorithm.
  • Steps to Solve:
    1. Randomly initialize k centroids.
    2. Assign each point to the nearest centroid.
    3. Recalculate centroids as the mean of assigned points.
    4. Repeat until convergence (no change in centroids).

7. Gaussian Mixture Model (GMM) 🎲

  • What?: Probabilistic model representing data as a mixture of Gaussians.
  • Formula:
    p(x) = \sum_{i=1}^k \pi_i N(x | \mu_i, \Sigma_i)
    
    • π_i: Mixing weight.
    • μ_i: Mean of component i.
    • Σ_i: Covariance of component i.

8. Expectation-Maximization (EM) Algorithm 🔄

  • What?: Iterative algorithm to estimate GMM parameters.
  • Steps to Solve:
    1. E-Step (Expectation): Compute responsibilities:
      \tau(z_{nk}) = \frac{\pi_k N(x_n | \mu_k, \Sigma_k)}{\sum_{j=1}^k \pi_j N(x_n | \mu_j, \Sigma_j)}
      
    2. M-Step (Maximization): Update parameters:
      \mu_k^{new} = \frac{\sum_{n=1}^N \tau(z_{nk}) x_n}{N_k}
      
      \Sigma_k^{new} = \frac{1}{N_k} \sum_{n=1}^N \tau(z_{nk}) (x_n - \mu_k^{new})(x_n - \mu_k^{new})^T
      
      \pi_k^{new} = \frac{N_k}{N}
      

Key Equations Summary 🔑

Concept Equation
Entropy Entropy(S) = -∑ p_i log₂(p_i)
Information Gain `IG(S, A) = Entropy(S) - ∑ (
Gini Index Gini(S) = 1 - ∑ p_i²
Weighted Gini `Gini_split(S, A) = ∑ (
PCA Cov(X, Y) = ∑ (X_i - X̄)(Y_i - Ȳ) / (n-1)
K-Means Assign points to nearest centroid, recalculate centroids.
GMM `p(x) = ∑ π_i N(x
EM Algorithm E-Step: Compute responsibilities. M-Step: Update parameters.

Mind Map 🧠

Machine Learning Equations
├── Entropy (Uncertainty)
├── Information Gain (Reduction in Entropy)
├── Gini Index (Impurity)
├── Weighted Gini (Impurity after Split)
├── PCA (Dimensionality Reduction)
├── K-Means (Clustering)
├── GMM (Mixture of Gaussians)
└── EM Algorithm (Parameter Estimation)

How to Solve Equations 🛠️

  1. Entropy:

    • Calculate proportions (p_i).
    • Use the calculator to compute log₂(p_i).
    • Multiply each p_i by its log₂(p_i).
    • Sum the results for all classes.
    • Multiply the sum by -1 to get the entropy.
  2. Information Gain:

    • Calculate entropy before and after split.
    • Subtract weighted sum of subset entropies.
  3. Gini Index:

    • Calculate proportions (p_i).
    • Square each proportion using the calculator.
    • Sum the squared proportions.
    • Subtract the sum from 1.
  4. Weighted Gini:

    • Split dataset, calculate Gini for each subset.
    • Compute weighted sum.
  5. PCA:

    • Standardize data, compute covariance matrix.
    • Find eigenvalues/eigenvectors, select top PCs.
  6. K-Means:

    • Initialize centroids, assign points, recalculate centroids.
  7. GMM:

    • Model data as a mixture of Gaussians.
  8. EM Algorithm:

    • E-Step: Compute responsibilities.
    • M-Step: Update parameters.