7.2. Descriptive Statistics - JulTob/Mathematics GitHub Wiki
Descriptive Statistics
Statistical Measures
Descriptive statistics are all about summarizing and describing the main features of a dataset. These summaries often take the form of measures of central tendency (like the mean, median, and mode) and measures of dispersion (like variance and standard deviation). Let’s break these down!
Mean
The mean (or average) is the most common measure of central tendency. It gives us the “center” of the data by dividing the sum of all data points by the number of observations.
For non-grouped data (i.e., raw data):
\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
Here, $\bar{x}$ is the arithmetic mean, and $n$ is the number of observations.
For grouped data (data summarized by frequency):
\bar{x} = \frac{1}{n} \sum_{i=1}^{k} f_i x_i
Where $f_i$ is the frequency of each data point $x_i$ .
Harmonic Mean
The harmonic mean is another measure of central tendency, but it’s more useful when dealing with rates or ratios. It’s the reciprocal of the average of the reciprocals of the data points.
For non-grouped data:
H = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}}
This is great for datasets where you’re comparing things like speeds or rates.
For grouped data:
H = \frac{n}{\sum_{i=1}^{k} \frac{f_i}{x_i}}
Geometric Mean
The geometric mean is often used for growth rates or anything that involves multiplication. It’s the n-th root of the product of the data points.
For non-grouped data:
G = \left( \prod_{i=1}^{n} x_i \right)^{\frac{1}{n}}
For grouped data:
G = \prod_{i=1}^{k} x_i^{n_i}
Quadratic Mean
The quadratic mean, also known as the root mean square, is useful for datasets with both positive and negative values. It’s like the average of the squares of the values.
For non-grouped data:
C = \sqrt{\frac{1}{n} \sum_{i=1}^{n} x_i^2}
For grouped data:
C = \sqrt{\frac{1}{n} \sum_{i=1}^{k} f_i x_i^2}
Mode
The mode is the most frequently occurring value in a dataset. It’s particularly useful for categorical data where we want to know which category is the most common.
For non-grouped data:
\text{Mode} = \text{MODA}(number1, number2, \dots)
Median
The median is the middle value in a dataset when the values are arranged in ascending order. It’s a more robust measure than the mean when dealing with skewed data or outliers.
For non-grouped data:
\text{Median} = \text{MEDIANA}(number1, number2, \dots)
Non-central Positional Measures
These are measures that tell us where a particular value lies in relation to the rest of the data, such as quartiles and percentiles.
Quartiles
\text{Q}_i = \text{CUARTIL}(data, i), \quad i \in {0, 1, 2, 3, 4}
Where:
- $\text{Q}_0$ is the minimum,
- $\text{Q}_1$ is the first quartile (25th percentile),
- $\text{Q}_2$ is the median (50th percentile),
- $\text{Q}_3$ is the third quartile (75th percentile),
- $\text{Q}_4$ is the maximum.
Percentiles:
\text{Percentile} = \text{PERCENTIL}(data, k), \quad k \in [0, 1]
The percentile tells us the percentage of data points that fall below a certain value.
Dispersion Measures
Dispersion measures how spread out the data is. These are crucial when we want to understand variability within the dataset.
Mean Deviation
This measures the average distance of each data point from the mean:
DM = \frac{1}{n} \sum_{i=1}^{n} |x_i - \bar{x}|
Variance
Variance tells us how much the data points deviate from the mean on average. It’s a key concept in understanding variability:
\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2
\text{VARP}(number1, number2, \dots) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2
Standard Deviation
The standard deviation is simply the square root of the variance:
\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2}
Fisher’s Skewness Coefficient
Skewness tells us about the asymmetry of the data. A skewness of zero means the data is perfectly symmetrical, while positive or negative skewness indicates the direction of the tail.
For non-grouped data:
g_1 = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{x_i - \bar{x}}{\sigma} \right)^3
Fisher’s Kurtosis Coefficient
Kurtosis tells us about the “tailedness” of the distribution. It can be:
- Platykurtic (less extreme values, $g_2 < 0$ ),
- Mesokurtic (normal distribution, $g_2 \approx 0$ ),
- Leptokurtic (more extreme values, $g_2 > 0$ ).
For non-grouped data:
g_2 = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{x_i - \bar{x}}{\sigma} \right)^4 - 3