7.2. Descriptive Statistics - JulTob/Mathematics GitHub Wiki

Descriptive Statistics

Statistical Measures

Descriptive statistics are all about summarizing and describing the main features of a dataset. These summaries often take the form of measures of central tendency (like the mean, median, and mode) and measures of dispersion (like variance and standard deviation). Let’s break these down!

Mean

The mean (or average) is the most common measure of central tendency. It gives us the “center” of the data by dividing the sum of all data points by the number of observations.

For non-grouped data (i.e., raw data):

\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i 

Here, $\bar{x}$ is the arithmetic mean, and $n$ is the number of observations.

For grouped data (data summarized by frequency):

\bar{x} = \frac{1}{n} \sum_{i=1}^{k} f_i x_i 

Where $f_i$ is the frequency of each data point $x_i$ .

Harmonic Mean

The harmonic mean is another measure of central tendency, but it’s more useful when dealing with rates or ratios. It’s the reciprocal of the average of the reciprocals of the data points.

For non-grouped data:

H = \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}}

This is great for datasets where you’re comparing things like speeds or rates.

For grouped data:

H = \frac{n}{\sum_{i=1}^{k} \frac{f_i}{x_i}}

Geometric Mean

The geometric mean is often used for growth rates or anything that involves multiplication. It’s the n-th root of the product of the data points.

For non-grouped data:

G = \left( \prod_{i=1}^{n} x_i \right)^{\frac{1}{n}}

For grouped data:

G = \prod_{i=1}^{k} x_i^{n_i}

Quadratic Mean

The quadratic mean, also known as the root mean square, is useful for datasets with both positive and negative values. It’s like the average of the squares of the values.

For non-grouped data:

C = \sqrt{\frac{1}{n} \sum_{i=1}^{n} x_i^2} 

For grouped data:

C = \sqrt{\frac{1}{n} \sum_{i=1}^{k} f_i x_i^2}

Mode

The mode is the most frequently occurring value in a dataset. It’s particularly useful for categorical data where we want to know which category is the most common.

For non-grouped data:

\text{Mode} = \text{MODA}(number1, number2, \dots) 

Median

The median is the middle value in a dataset when the values are arranged in ascending order. It’s a more robust measure than the mean when dealing with skewed data or outliers.

For non-grouped data:

\text{Median} = \text{MEDIANA}(number1, number2, \dots) 

Non-central Positional Measures

These are measures that tell us where a particular value lies in relation to the rest of the data, such as quartiles and percentiles.

Quartiles

\text{Q}_i = \text{CUARTIL}(data, i), \quad i \in {0, 1, 2, 3, 4}

Where:

  • $\text{Q}_0$ is the minimum,
  • $\text{Q}_1$ is the first quartile (25th percentile),
  • $\text{Q}_2$ is the median (50th percentile),
  • $\text{Q}_3$ is the third quartile (75th percentile),
  • $\text{Q}_4$ is the maximum.

Percentiles:

 \text{Percentile} = \text{PERCENTIL}(data, k), \quad k \in [0, 1] 

The percentile tells us the percentage of data points that fall below a certain value.

Dispersion Measures

Dispersion measures how spread out the data is. These are crucial when we want to understand variability within the dataset.

Mean Deviation

This measures the average distance of each data point from the mean:

DM = \frac{1}{n} \sum_{i=1}^{n} |x_i - \bar{x}| 

Variance

Variance tells us how much the data points deviate from the mean on average. It’s a key concept in understanding variability:

\sigma^2 = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2 
\text{VARP}(number1, number2, \dots) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2

Standard Deviation

The standard deviation is simply the square root of the variance:

\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2}

Fisher’s Skewness Coefficient

Skewness tells us about the asymmetry of the data. A skewness of zero means the data is perfectly symmetrical, while positive or negative skewness indicates the direction of the tail.

For non-grouped data:

 g_1 = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{x_i - \bar{x}}{\sigma} \right)^3 

Fisher’s Kurtosis Coefficient

Kurtosis tells us about the “tailedness” of the distribution. It can be:

  • Platykurtic (less extreme values, $g_2 < 0$ ),
  • Mesokurtic (normal distribution, $g_2 \approx 0$ ),
  • Leptokurtic (more extreme values, $g_2 > 0$ ).

For non-grouped data:

g_2 = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{x_i - \bar{x}}{\sigma} \right)^4 - 3