Basic Stat - mrprov12/DSPrep GitHub Wiki
Statistics: a collection of mathematics used to summarize, analyze, and interpret a group of numbers or observations.
Additional Resources
Measures of Center Measures of Variance Five Number Summary
what is central tendency?
The Median is Resistant to Outliers
The primary difference between the mean or median is their levels of resistance to outliers. The mean is not very resistant to outliers, especially when dealing with a dataset that has non-symmetric outliers. If a collection has extreme outliers, the mean may describe the distribution "center" inaccurately. A classic example of this is when looking at household incomes. Households with far greater incomes skew the mean to the point where it no longer accurately describes the dataset.
The Mean is Preferable in Large Datasets with Few Outliers
There are some situations where the mean is considered a preferable measure to median; typically these are situations in which there are a large number of items in the collection, and there are not any outliers (or the outliers are symmetric). Also, inferential statistics are largely built upon measurements of the mean, so it is the statistic which is used most often.
Mode is Preferable When Using Categorical Data
In a collection with categorical data that is (generally) not ordinal in nature, the mode is the best measure of center, though the use of the term "center" may be taking a bit of liberty.
The mode can also be a useful descriptive statistic when there isn't one single central concentration of values. A common example of this would be the weights of household pets. If one were to take a sample of housepet weights, there would likely be a concentration of cats, each weighing between eight and twelve pounds, and a concentration of dogs weighing between twenty and thirty five pounds. The mean or median may tell us that a typical household pet weighs fifteen pounds, but that description doesn't accurately describe the typical weight of either cats or dogs. A distribution such as this is often referred to as bi-modal.
#mean
μ The lowercase greek letter mu is the standard notation for a population mean
x^¯ Pronounced "x-bar" is the standard notation for a sample mean
X^¯ Capitalized x-bar is a common notation for sample mean, where X is a random variable
population vs sample
The study of statistics consists of the analysis and study of datasets, and there are two types of datasets, populations and samples. A population represents all of the possible data points or observations from a set of data, for example, a rancher who owns 1000 cattle could take the population mean of their weights by measuring the weight of all 1000 cattle, and taking their mean. Conversely, a sample does not represent every possible observation, for example, the rancher above could make an estimate of the population mean by taking a random sampling of 100 of the cattle, taking weight measurements, and then taking the mean of those 100 observations. Statistics, specifically inferential statistics revolves around making inferences about populations from samples.
**** alternatively can import numpy and call numpy.mean() for below
def mean(lst):
"""
A function which takes in a list as a parameter, and returns the arithmetic
mean.
Parameters
----------
lst : list
A list containing numeric data
Returns
-------
mean : float
A float representing the mean of the given list
"""
return sum(lst) / len(lst)
median
Definition: Median Denoting or relating to a value or quantity lying at the midpoint of a frequency distribution of observed values or quantities, such that there is an equal probability of falling above or below it.
med(A) Where A is the collection on which to take the median
x^˜ Lower-case x with a tilde over the top of it is often used to denote the median
the median is another measure of central tendency. The median can be considered the "middle" value of some sorted numerical collection. Half of the collection is equal to or lesser than the median, and half of the collection is equal to or greater than the median. In circumstances where a collection has extreme outliers (specifically datasets which contain outliers which are not symmetrical) the median can be a more robust, or superior measure to the mean.
median from odd-length collection
len(collection)+.5
median from even-length collection
len(collection)/2
***** again, can import numpy.median()
def median(lst):
"""
A function which takes in a list as a parameter, and returns the median.
Parameters
----------
lst : list
A list containing numeric data
Returns
-------
mean : float
A float representing the median of the given list
"""
import numpy as np
length = len(lst)
sorted_lst = sorted(lst)
# Lists with an odd number of items in the list
if length % 2 != 0:
idx = (length / 2 + .5) - 1
return sorted_lst[int(idx)]
# Lists with an even number of items in the list
else:
idx1 = int(length / 2 - 1)
idx2 = int(idx1 + 1)
return np.mean([sorted_lst[idx1], sorted_lst[idx2]])
mode
mode(A) A is the collection on which to take the mode
Mo Also denotes the mode
The mode of a numerical collection is a different approach than mean or median. Instead of finding the center of a collection, the mode seeks to find the item with the greatest frequency. In other words, the mode describes the value that occurs most often.
It is worth noting that mode can be used for collections that are not numerical. The mode can determine frequency for nominal (categorical or named) data as well.
finding mode as using dictionary as counter:
def mode(lst):
""" A function which takes in a sequence (list, tuple, np.array) and returns the
mode. If there is not a mode, the function returns none.
Parameters
----------
lst : list
A list of items which are assumed to be a primitive data type (int, float, string)
Returns
-------
modes : list
Returns the mode(s) of the sequence in a list. If there is no mode, returns None
"""
# Declare a dictionary to use as a counter
dict_counter = {}
# Iterate through the list of objects
for item in lst:
# If the dictionary key already exists (item has already been encountered)
if item in dict_counter.keys():
dict_counter[item] += 1
# If the dictionary key doesn't exist (1st time item's encountered)
else:
dict_counter[item] = 1
# Find the maximum value from the dictionary counter
# (realize that there may be more than one key that has this value)
max_freq = max(list(dict_counter.values()))
# Collect a list of all the keys (items) with a value equal to max_freq
modes = [item for item, freq in dict_counter.items() if freq == max_freq]
# Return the mode(s), if every item has the same freq (no mode) return None
if len(modes) == len(lst):
return None
else:
return modes
returning mode using collections.Counter() as counter:
def mode(lst):
""" A function which takes in a sequence (list, tuple, np.array) and returns the
mode. If there is not a mode, the function returns none.
Parameters
----------
lst : list
A list of items which are assumed to be a primitive data type (int, float, string)
Returns
-------
modes : list
Returns the mode(s) of the sequence in a list. If there is no mode, returns None
"""
# Import the collections module
import collections
# Create a dictionary counter using .Counter()
dict_counter = dict(collections.Counter(lst))
# Find the maximum value from the dictionary counter
# (realize that there may be more than one key that has this value)
max_freq = max(list(dict_counter.values()))
# Collect a list of all the keys (items) with a value equal to max_freq
modes = [item for item, freq in dict_counter.items() if freq == max_freq]
# Return the mode(s), if every item has the same freq (no mode) return None
if len(modes) == len(lst):
return None
else:
return modes
using scipy.stats.mode():
# Import the statistics module from scipy
from scipy import stats
# Will return the mode, and it's frequency, each in a numpy array
stats.mode(lst)
The 5 number summary:
know how to code this for the TI
Definition: Five Number Summary The five-number summary is a set of descriptive statistics that provides information about a dataset. It consists of the five most important sample percentiles
The minumum The lower (first) quartile: Q1 The median The upper (third) quartile The maximum
Note: The values are often expressed in a tuple, as follows:
(min, Q1, median, Q3, max)
# Import numpy module to use the .median() function
import numpy as np
def five_summary(lst):
# Sort list, find median
sorted_lst = sorted(lst)
med = np.median(sorted_lst)
# If odd length, find index of median, partition data
if len(lst) % 2 != 0:
med_idx = (len(lst) / 2 + .5) - 1
low_subset = sorted_lst[:int(med_idx)]
high_subset = sorted_lst[int(med_idx + 1):]
# If even length, find index of median, partition data
else:
idx1 = int(len(lst) / 2 - 1)
idx2 = int(idx1 + 1)
low_subset = sorted_lst[0:(idx1 + 1)]
high_subset = sorted_lst[idx2:]
# Define Q1, Q3
q1 = np.median(low_subset)
q3 = np.median(high_subset)
# Return five number summary in a tuple
return min(sorted_lst), q1, med, q3, max(sorted_lst)
# Imports
from numpy import percentile
def five_number_summary(lst):
'''Calculates the five number summary of a list of given values
and returns them as a tuple.
PARAMETERS
----------
lst : list
A list of numerical values from which to calculate the five number
summary.
RETURNS
-------
five_nums : tuple
A 5-tuple of values representing the five number summary
from lowest to highest.
'''
# Calculate quartiles (Q1, Median, Q3)
q1, median_, q3 = percentile(lst, [25, 50, 75])
# Calculate min/max
min_, max_ = min(lst), max(lst)
return min_, q1, median_, q3, max_
variance and std dev
variance (population and sample)
s^2 generally refers to the variance of a sample
σ^2 generally refers to the variance of a population
Variance refers to the squared sum of the difference between each point and the overall mean
A note about the application of Bessel's correction: The difference in the variances between the sample and the population are a byproduct of applying Bessel's correction. In short, when one finds the variance of a population, they are sure to include all possible outliers. In contrast, when sampling from a population there is a chance that very few (or none!) outliers will end up in the sample dataset. Because of this the variance will likely be smaller than the true variance of the population. Because the object is to make inferences about a population from a sample, the application of Bessel's correction makes the variance from a sample more likely to be accurately representative of the population.
import numpy as np
def variance(lst):
"""Returns a float which represents the variance of a given list, this
function does not implement Bessel's correction.
Parameters
----------
lst : list
A list which represents a dataset that is comprised entirely of numerical values
Returns
-------
variance : float
A float which represents the variance of the dataset
"""
mean = np.mean(lst)
v_sum = 0
# For each value in the list, find the squared difference
for value in lst:
v_sum += (value - mean)**2
# Multiply by 1 / n
return v_sum / len(lst)
import numpy as np
# Will return the variance of the numeric dataset w/o Bessel's correction
np.var(lst)
# Will return the variance of the numeric dataset w/ Bessel's correction
np.var(lst, ddof=1)
# Will return the standard deviation of the dataset w/o Bessel's correction
np.std(lst)
# Will return the standard deviation of the dataset w/ Bessel's correction
np.std(lst, ddof=1)
Notice above, the variance and standard deviation functions do not implement Bessel's correction by default. However, there is an addition parameter which can be used to implement the correction, ddof, short for "delta degrees of freedom". It will divide the sum of squared differences by n - ddof.
std dev
standard deviation is the sq root of the variance
# To calculate a square root, simply take the square root (or raise to the 1/2 power)
def sd(lst):
"""Returns a float which represents the standard deviation of a given list, this
function does not implement Bessel's correction.
Parameters
----------
lst : list
A list which represents a dataset that is comprised entirely of numerical values
Returns
-------
variance : float
A float which represents the standard deviation of the dataset
"""
return variance(lst)**.5