Week 7ate9 Simple Linear Regression - pointOfive/stat130chat130 GitHub Wiki
LEC 1 New Topics
TUT/HW Topics
- import statsmodels.formula.api as smf
- smf.ols
-
smf.ols("y~x", data=df).fit() and .params
$\hat \beta_k$ versus$\beta_k$ - smf.ols("y~x", data=df).fit().summary() and .tables[1] for Testing "On Average" Linear Association
LEC 2 New Topics / Extensions
- Two(2) unpaired samples group comparisons
- Two(2) unpaired sample permutation tests
- Two(2) unpaired sample bootstrapping
- Indicator variables and contrasts linear regression
Out of scope:
- Material covered in future weeks
- Anything not substantively addressed above...
- ...such as all the stuff around multi/bivariate normal distribution and their covariance matrices, ellipses and their math and visual weirdness outside of a 1:1 aspect ratio, and eigenvectors and eigenvalues and major axis lines, etc...
- ...such as the mathematical formulas correlation, but just noting that they sort of just look like formulas for variance...
For two random variables,
If two random variables
like
A better explanation for these are that "summer" is the reason for increased ice cream sales
There are two other considerations to keep in mind as well regarding the "causation is not correlation" adage.
- Are the variables being measured the real causes? Or are they just proxies for what's really happening? For example, do parents heights cause their children's heights? Well, no, it's actually their genes and those subsequent effects that are what actually contribute to a child's height. Parent heights is just a proxy for their genes.
- Real causation is complicated so there are ALWAYS MANY COMPLEX THIGNS contributing to causal pathways. If we say
$X$ causes$Y$ we have already started to lose the more holistic perspective of the situation which is actually the more correct view. We should avoid oversimplifications that lead us towards an incomplete view of what's actually happening. Just say$X$ and$Y$ correlate.
Correlation measures the strength of "straight line" linear association between two random variables
The following is Anscombe's Quartet which is the classic demonstration showing that correlation as a measure the strength of "straight line" linear association only "works" if there is indeed linear association in the data. Below this figure the "American football shape" that can be seen in Galton's famous "regression to the mean" data.
import seaborn as sns
import plotly.graph_objects as go
import numpy as np
from plotly.subplots import make_subplots
# Load Anscombe's quartet from Seaborn
df = sns.load_dataset("anscombe")
# Function to calculate stats (correlation, mean, and std)
def calculate_stats(subset):
correlation = np.corrcoef(subset['x'], subset['y'])[0, 1]
mean_x = subset['x'].mean()
mean_y = subset['y'].mean()
std_x = subset['x'].std()
std_y = subset['y'].std()
return correlation, mean_x, mean_y, std_x, std_y
# Create a 2x2 subplot layout with Plotly
fig = make_subplots(rows=2, cols=2, vertical_spacing=0.1,
subplot_titles=[f"Dataset {d}" for d in df['dataset'].unique()])
# Plot scatter and regression line for each dataset
for i, dataset in enumerate(df['dataset'].unique()):
subset = df[df['dataset'] == dataset]
correlation, mean_x, mean_y, std_x, std_y = calculate_stats(subset)
row,col = (i // 2) + 1, (i % 2) + 1
# Add scatter plot
fig.add_trace(go.Scatter(x=subset['x'], y=subset['y'],
mode='markers', marker=dict(size=10),
name=f"Dataset {dataset}"), row=row, col=col)
# Add regression line
slope, intercept = np.polyfit(subset['x'], subset['y'], 1)
fig.add_trace(go.Scatter(x=subset['x'], y=slope * subset['x'] + intercept,
mode='lines', name=f"Fit {dataset}",
line=dict(color='red')), row=row, col=col)
# Add a separate trace for the annotation in the bottom right corner
stats_annotation = (f"Corr: {correlation:.2f}<br>"f"Mean(x): {mean_x:.2f}<br>"
f"Mean(y): {mean_y:.2f}<br>"f"Std(x): {std_x:.2f}<br>"f"Std(y): {std_y:.2f}")
fig.add_trace(go.Scatter(x=[max(subset['x']) - 1.2],
y=[min(subset['y']) + 1],
mode='text', text=[stats_annotation],
showlegend=False), row=row, col=col)
fig.update_layout(height=600, width=800, title_text="Anscombe's Quartet with Correlation, Mean, and Std Dev",
margin=dict(l=50, r=50, t=50, b=50))
fig.show()
import statsmodels.api as sm
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
# Load Galton's parent-child height dataset
galton_data = sm.datasets.get_rdataset("GaltonFamilies", "HistData").data
# Select midparentHeight and childHeight for classic Galton analysis
Y = galton_data['midparentHeight']
x = galton_data['childHeight']
# Usually x "predicts" Y but it's here backwards for aspect ratio purposes:
# an ellipse shape is actually way more fkn weird to look at (for a human) than you would first think
#x = galton_data['midparentHeight']
#Y = galton_data['childHeight']
# Create a scatter plot with alpha transparency
fig = px.scatter(galton_data, x='childHeight', y='midparentHeight',
title='Midparent vs Child Height')
fig.update_traces(marker=dict(size=8, opacity=0.5))
# THE DETAILS OF THE FOLLOWING ARE OUT OF SCOPE
# - multi/bivariate normal distribution and their covariance matrices
# - ellipses and their math and visual weirdness outside of a 1:1 aspect ratio
# - eigenvectors and eigenvalues and major axis lines, etc. etc.
# ALL OF THESE ARE way BEYOND THE SCOPE OF STA130:
# They're just here so that I can have some pictures for illustration
# Function to calculate the ellipse points (for the bivariate normal ellipse)
def get_ellipse(mean, cov, n_std=1.0, num_points=100):
"""Generate coordinates for a 2D ellipse based on covariance matrix."""
theta = np.linspace(0, 2 * np.pi, num_points)
circle = np.array([np.cos(theta), np.sin(theta)]) # unit circle
# Ellipse transformation: scale by sqrt(eigenvalues) and rotate by eigenvectors
eigvals, eigvecs = np.linalg.eigh(cov)
ellipse_coords = np.dot(eigvecs, np.sqrt(eigvals)[:, np.newaxis] * circle * n_std)
# Shift ellipse to the mean
ellipse_coords[0] += mean[0]
ellipse_coords[1] += mean[1]
return ellipse_coords, eigvecs
# Calculate covariance matrix and mean
cov_matrix = np.cov(x, Y)
mean_vals = [np.mean(x), np.mean(Y)]
# Get ellipse coordinates and eigenvectors
ellipse_coords, eigvecs = get_ellipse(mean_vals, cov_matrix, n_std=2)
ellipse_x, ellipse_Y = ellipse_coords
# Get the first eigenvector (for the primary direction)
primary_direction = eigvecs[:, 1] # First eigenvector (primary direction)
# Plot the ellipse
ellipse_trace = go.Scatter(x=ellipse_x, y=ellipse_Y, mode='lines',
line=dict(color='green', width=2, dash='dash'),
name='American Football Shape')
fig.add_trace(ellipse_trace)
# THE DETAILS OF THE FOLLOWING ARE OUT OF SCOPE
# - multi/bivariate normal distribution and their covariance matrices
# - ellipses and their math and visual weirdness outside of a 1:1 aspect ratio
# - eigenvectors and eigenvalues and major axis lines, etc. etc.
# ALL OF THESE ARE way BEYOND THE SCOPE OF STA130:
# They're just here so that I can have some pictures for illustration
# Add correlation annotation
correlation = np.corrcoef(x, Y)[0, 1]
annotations = [dict(x=min(x), y=max(Y), xanchor='left', yanchor='top',
showarrow=False, text=f'Correlation: {correlation:.2f}',
font=dict(color='black'), bgcolor='white')]
# Update layout with corrected annotations
fig.update_layout(title="Galton's Midparent vs Child Height:<br>Regression to the mean IS NOT the Primary Direction",
xaxis_title="Child Height", yaxis_title="Midparent Height", annotations=annotations)
# Set square aspect ratio for the axes
fig.update_xaxes(scaleanchor="y") # X-axis is anchored to Y-axis
fig.update_yaxes(constrain="domain") # Constrain the Y-axis to the domain of the plot
#fig.update_xaxes(range=[55, 80]) # Fixed x limits
fig.update_layout(height=400, width=800)
fig.show()
Hopefully the equation of a line
What can be pointed out here is that when trendline='ols'
is added to Galton's parent-child heights data cloud, there is certainly going to be a "straight line" linear association in the data that's given by
# You can see this if you replace `fig = px.scatter(...` above with this
fig = px.scatter(galton_data, x='childHeight', y='midparentHeight',
trendline='ols', # Add a linear trendline
title='Midparent vs Child Height with Trendline')
# And the line in this case happens to be the following
# which you can confirm by including or not include the code below
m = 0.1616177526315975 # Slope
b = 58.4194455765732 # Intercept
trendline_x = np.array([55,79])
trendline_Y = b + m*np.array([55,79])
fig.add_trace(go.Scatter(x=trendline_x, y=trendline_Y, mode='lines',
line=dict(color='purple', width=2),
name='yhat = 0.16 + 58.41x'))
Simple Linear Regression is just a normal distribution. It so happens to have a mean location parameter controlled by a
To simulate data from the Simple Linear Regression, and see what correlation it induces, you can use the following code.
df = pd.DataFrame({'x': x, 'Y': Y})
# Calculate the correlation
correlation = np.corrcoef(df['x'], df['Y'])[0,1]
fig = px.scatter(df, x='x', y='Y',
trendline='ols', # Add a linear trendline
trendline_color_override='red',
title='Y vs x')
fig.update_traces(marker=dict(size=8, opacity=0.5))
fig.add_annotation(text=f'Correlation: {correlation:.2f}',
xref='paper', yref='paper',
x=0.05, y=0.95, showarrow=False,
font=dict(size=12, color='black'), bgcolor='white')
fig.show()
Here are a couple example specifications that could be used for the simulation. Feel free to play around with the different choices of the parameters a bit to get a sense of how this model can be used to simulate different relationships and correlations between
# Version 1
n = 25
# Arbitrarily define x and then genrate Y
x = stats.uniform(10, 10).rvs(n_points)
Y = 0 + x + stats.norm(loc=0, scale=10).rvs(size=n)
# Version 2
n = 934
# Arbitrarily define x and then genrate Y
x = galton_data['Parent Mid Height']
# x = stats.norm(loc=x.mean(), scale=x.std()).rvs(size=n)
beta0 = -100 # galton_data['Child Height'].mean()
beta1 = 2
Y = beta0 + beta1*x + stats.norm(loc=0, scale=3).rvs(size=n)
Did you figure out how you can control correlation with the simple linear model? Does it depend on the line
Here's some terminology that you need to know to talk about Simple Linear Regression.
-
$Y_i$ is the so-called outcome -
$x_i$ is then the predictor -
$\beta_0$ is the intercept coefficient -
$\beta_1$ is the slope coefficient -
$\epsilon_i$ is called the error term
That's it.
But here's a little more commentary on these.
Outcome
$Y_i$ is a continuous numeric variableOutcome
$Y_i$ can also be called a response, dependent, or endogenous variable in some domains and contextsPredictor variable
$x_i$ is a numeric variableFow now we'll consider
$x_i$ to be a continuous numeric variable, but this is not necessary, and we will consider versions of$x_i$ laterPredictor variable
$x_i$ can also be called an explanatory, independent, or exogenous variable, or a covariate or feature (which are the preferred terms in the statistics and machine learning domains, respectively)Intercept
$\beta_0$ and slope$\beta_1$ coefficients are the two primary parameters of a Simple Linear Regression modelIntercept and slope describe a linear ("straigh line") relationship between outcome
$Y_i$ and predictor variable$x_i$ Error
$\epsilon_i$ (also sometimes called the noise) makes Simple Linear Regression a statistical model by introducing a random variable with a distributionThe
$\sigma$ parameter is a part of the noise distribution and controls how much vertical variability/spread there is in the$Y_i$ data off of the line:$\sigma$ is an "auxiliary" parameter in the sense that there is usually more interest in$\beta_0$ and$\beta_1$ than$\sigma$ Errors
$\epsilon_i$ (in conjuction with the linear form) define the assumptions of the Simple Linear regression Model specificationbut these assumptions are not the focus of further detailed reviewed here
Further details regarding the assumptions are as follows, but they are not the focus are not the focus of further detailed reviewed now.
The first four assumptions associated with the Simple Linear regression model are that
- the
$\epsilon_i$ errors (sometimes referred to as the noise) are normally distributed- the
$\epsilon_i$ errors are homoscedastic (so their distributional variance$\sigma$ does not change as a function of$x_i$ )- the linear form is [at least reasonably approximately] "true" (in the sense that the above two remain [at least reasonably approximately] "true") so that then behavior of the
$Y_i$ outcomes are represented/determined on average by the linear equation)- and the
$\epsilon_i$ errors are statistically independent (so their values do not depend on each other)and there are additional assumptions; but, a deeper reflection on these is "beyond the scope" of STA130; nonetheless, they are that
- the
$x_i$ predictor variable is measured without error- and the
$\epsilon_i$ errors are unbiased relative to the expected value of outcome$E[Y_i|x_i]=\beta_0 + \beta_1x_i$ (which is equivalently stated by saying that the mean of the error distribution is$0$ , or again equivalently, that the expected value of the errors$E[\epsilon_i] = 0$ )
The Python
modules providing analogous functionality to "base" R
statistical programming language are the scipy.stats
and statsmodels
modules. The scipy.stats
module provides functionality related to statistical distributions (which we've previously started exploring), but it also provides some hypothesis testing capabilities (which we've not examined since we've instead been exploring this topic through simulation). The statsmodels
module, on the other hand, combines the distributional and hypothesis testing concepts of the scipy.stats
module to allow us to "fit" linear regression models and examine various aspects of the fitted models, including analyzing them using hypothesis testing.
The statsmodels
module is well-regarded for its comprehensive range of tools for statistical modeling and hypothesis testing which help us evaluate and understanding evidence about relationships between variables in many fields, such as economics, biology, and social sciences. While there are a number of ways to use the statsmodels
module, for STA130 we'll always its "formula" version.
import statsmodels.formula.api as smf
The "ols" in smf.ols()
stands for "ordinary least squares". The ordinary least squares methodology will be explored further in HW7ate9, but suffice it to say for now that it is "how linear regression models are fit to data" in STA130. To specify a simple linear regression model in the context of a pandas DataFrame object
dataset df
where the ("independent" predictor) column x
variable is used to predict the ("dependent" outcome) column y
variable, use the "formula" "y ~ x"
which corresponds to
linear_specification = "y ~ x" # automatically assumes the "Intercept" beta0 will be included
model_data_specification = smf.ols(linear_specification, data=df)
In the example above x
and y
would be names of columns in df
. If x
and y
in df
respectively corresponded to "parents average height" and "child's height" (as in Francis Galton's classical study which popularized the concept of "regression to the mean" which all regression methodologies are now named after), then model_data_specification
indicates using "parents average height" (x
) to predict "child's height" (y
). If rather than x
and y
the column names were just parent_height
and child_heigh"
directly, then we would instead just use
import pandas as pd
import statsmodels.formula.api as smf
# Simplified version of Galton's data: this dataset is small and simplified for illustrative purposes
# The actual dataset used by Galton was much larger and included more precise measurements and other variables; however,
# this should give you a sense of the type of data Galton worked with
francis_galton_like_data = {
'parent_height': [70.5, 68.0, 65.5, 64.0, 63.5, 66.5, 67.0, 67.5, 68.0, 70.0,
71.5, 69.5, 64.5, 67.0, 68.5, 69.0, 66.0, 65.0, 67.5, 64.0],
'child_height': [68.0, 66.5, 65.0, 64.0, 63.0, 65.5, 67.0, 67.5, 68.0, 70.0,
70.5, 69.5, 63.5, 66.0, 68.5, 69.0, 66.0, 64.5, 67.5, 63.5]
}
francis_galton_df = pd.DataFrame(francis_galton_like_data)
linear_specification = "child_height ~ parent_height" # not `linear_specification = "y~x"`
model_data_specification = smf.ols(linear_specification, data=francis_galton_df)
The linear_specification
is a so-called "R-style formula" which provides a simple way to specify the linear form of a regression model. A "formula" of the form "y ~ x"
automatically assumes the "Intercept" "child_height ~ parent_height"
is a simple linear regression specification because there is a single predictor variable; whereas, "child_height ~ parent_height + nationality"
is a multiple linear regression specification because there is more than one predictor variable. We will return to multiple linear regression in Week 07, so consideration of this topic can be safely postponed for now.
If the columns referenced by the linear form defined in linear_specification
have "spaces" or "special characters" (like an apostrophe) then a special "quote" Q("...")
or `` syntax is required to reference them. It might be easier and better for readability to just use df.rename(...)
instead to change the name in these situations...
linear_specification = 'Q("child\'s height") ~ Q("parents average height")' # or
linear_specification = "`child\'s height` ~ `parents average height`"
After specifying the linear form and the dataset of a regression model with smf.ols(...)
the model is then estimated or "fit" using the smf.ols(...).fit()
method which provides the fitted model (estimate of the theoretical model) from the dataset
data_fitted_model = model_data_specification.fit() # estimate model coefficients and perform related calculations
data_fitted_model.params # the estimated model coefficients beta0 and beta1 (in the case of simple linear regression)
For linear_specification = "child_height ~ parent_height"
- The rows of
data_fitted_model.params
will beIntercept
($\hat \beta_0$ ) andparent_height
($\hat \beta_1$ ) - The
Intercept
$\hat \beta_0$ is automatically assumed to be a part of the linear form specification, and the name of the fitted slope coefficient$\hat \beta_1$ will match that of the "independent" predictor variable used in the model- The fitted slope coefficient
$\hat \beta_1$ is an estimate of the average change in the "dependent" outcome variable for a "one unit" increase of the "independent" predictor variables - The fitted intercept coefficient
$\hat \beta_0$ must then be the estimate of the average of the "dependent" outcome variable when the "independent" predictor variable has the value$0$ -
$\hat \beta_0$ may sometimes have a sensible interpretation, but often (as in the case ofparent_height
for which the "average parents height of$0$ " could never happen) it doesn't really correspond to anything meaningful in the real world -
$\hat \beta_1$ is really what captures and represents the best estimate of the linear relationship betweenx
andy
based on the data you have since this estimates the "average change in the outcome for a one-unit increase in the predictor"
-
- The fitted slope coefficient
- The fitted model is the "straight line" that (with respect to ordinary least squares methodology) "best" reflects the linear association observed in the dataset between the "independent" predictor and "dependent" outcome variables corresponding to the hypothesized linear association of the theoretical model
The fitted [(simple) linear regression] model
$\hat y_i = \hat \beta_0 + \hat \beta_1 x_i$ estimates the theoretical [(simple) linear regression] model
$$Y_i \sim \mathcal{N}(\beta_0 + \beta_1 x_i , \sigma) \quad \text{ or } \quad Y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad \text{ where } \quad \epsilon_i \sim \mathcal{N}(0, \sigma)$$ If the theoretical model was "true" with
$\beta_1 = 2$ , then an estimated coefficient of$\hat \beta = 1.86$ would be the result of sample to sample variability. A different sample might produce an estimated coefficient of$\hat \beta = 2.18$ . In practice,$\hat \beta_1$ is used to make inferences about$\beta_1$ .
The estimated data_fitted_model.params
can be visualized using px.scatter(..., trendline="ols"
) since "ols" indicates the same ordinary least squares fit of the data used by smf.ols(...).fit()
.
import plotly.express as px
# Create the scatter plot with the OLS fit line
px.scatter(francis_galton_df, x='parent_height', y='child_height',
title="Simple Linear Regression", trendline="ols")
The fitted model coefficients from the .params
attribute specifying the .fittedvalues
attribute.
data_fitted_model.fittedvalues # model "predicted" *fitted values*, which for the current example is equivalent to
# y_hat = data_fitted_model.params[0] + data_fitted_model.params[1] * francis_galton_df['parent_height']
The
The notion that fitted values uses
$(y_i, x_i)$ to make their own "predictions"$(\hat y_i, x_i)$ raises the topic of in sample versus out of sample prediction. We will return to this topic and the use of the.predict()
method (as opposed to the.fittedvalues
attribute in the context of multiple linear regression in Week 07.
The squared correlation between the "predicted" fitted values .rsquared
attribute of a fitted model. While the mathematical details are beyond the scope of STA130, the R-squared measures "the proportion of variation explained by the model". So, e.g., an R-squared value of 0.7 means that 70% of the "variation" in the "dependent" outcome variables
import numpy as np
# This measures "the proportion of variation explained by the model"
data_fitted_model.rsquared # model **R-squared**, which for the current example is equivalent to
# np.corr_coef(data_fitted_model.fittedvalues, francis_galton_df['child_height'])[0,1]**2
Analogously to fitted values, the residuals of the fitted model .resid
attribute.
data_fitted_model.resid
The residuals of the fitted model
# If *residuals* do not appear approximately normally distributed, the "normality assumption" is implausible
residual_normality = px.histogram(data_fitted_model.resid, title="Histogram of Residuals")
residual_normality.update_layout(xaxis_title="Does this appear normally distributed?", yaxis_title="Count")
residual_normality.show()
# If *residuals* heights appear to systematically change over the range of y-hat, the "heteroskedasticity assumption" is implausible
residual_heteroskedasticity = px.scatter(x=data_fitted_model.fittedvalues, y=data_fitted_model.resid,
title="Does this systematically change over y-hat?")
residual_heteroskedasticity.update_layout(xaxis_title="Fitted Values", yaxis_title="Residuals")
residual_heteroskedasticity.add_hline(y=0, line_dash="dash", line_color="red")
residual_heteroskedasticity.show()
# If y versus y-hat does not follow the "linearity" of the "y=x" line, the "linearity assumption" is implausible
fitted_values = data_fitted_model.fittedvalues
residual_linearity = px.scatter(x=fitted_values, y=francis_galton_df['child_height'],
title="Is this relationship 'linear' in a 'y=x' way?")
residual_linearity.update_layout(xaxis_title="Fitted Values", yaxis_title="Outcome")
residual_linearity.add_scatter(x=[min(fitted_values), max(fitted_values)], y=[min(fitted_values), max(fitted_values)],
mode='lines', line=dict(color='red', dash='dash'), name="The 'y=x' line")
residual_linearity.show()
The information needed for statistical hypothesis testing analysis of the fitted model is accessed using the .summary()
method.
data_fitted_model.summary() # generates a detailed report of statistical estimates and related information; but, ...
data_fitted_model.summary().tables[1] # what we need for *hypothesis testing* is contained in its `.tables[1]` attribute
The coef
column of data_fitted_model.summary().tables[1]
corresponds to data_fitted_model.params
discussed above; but, the additional P>|t|
column includes the p-values associated with a hypothesis test for each of the coefficients.
- Specifically, the values in
P>|t|
are p-values for a null hypothesis that the true coefficient value is zero. - In the case of the true slope coefficient
$\beta_1$ representing the "average change in the outcome for a one-unit increase in the predictor"- this null hypothesis corresponds to the assumption of "no linear association between the outcome and predictor variables 'on average'".
- Formally, this "no linear relationship" null hypothesis would then be stated as
$H_0: \beta_1 = 0$ where$\beta_1$ is the hypothesized parameter value which the associated statistic$\hat \beta_1$ estimates - And rarely would there be interested in
$H_0: \beta_0 = 0$ since an intercept value of$0$ does not generally have a meaningful interpretation, and the context of linear regression is typically just concerned with evaluating the relationship between the outcome and predictor variables.
Recalling the definition of a p-value, in the current context a p-value would be "the probability of that a dataset would result in an estimated coefficient as or more extreme than the observed coefficient if the null hypothesis (that the true coefficient value is zero) were true". The familiar "as or more extreme" definition of the p-value is reflected in the
P>|t|
column name, which can be further seen to indicated a "two-sided" hypothesis test specification.
A small p-value for the slope coefficient
The
std err
andt
columns relate to the sampling distribution "under the null", and the[0.025 0.975]
columns provide inference for estimated coefficients in the form of confidence intervals, but all of this is beyond the scope of STA130 so we'll just concern ourselves with appropriately using the p-values from theP>|t|
column.
- one, paired, and two sample tests
So far in the course we've seen paired samples turned into paired sample differences, and we've seen the one sample "coin flipping" null hypothesis. Both of these are really just treated as single sample analyses. The following table breaks this down in order to demonstrate the natural generalization of this situation where we ACTUALLY really have two independent samples.
Paired Samples | Unpaired Samples | |
---|---|---|
One Sample | Paired Difference (continuous) Paired Comparisons (binary) |
Coin Flipping or |
Two Sample | Treated as One Sample | NEW CATEGORY TO CONSIDER!!! |
By two independent samples we mean we have two unpaired samples representing two different groups. In this context it is natural for interest to focus on potential differences between the two populations from which the two samples are drawn from. Interestingly, this is not so different from the internet in paired sample context. The difference is that there the two groups may be the same individuals before and after some intervention, or some other natural pairing such as twins, husbands and wives, a parent and a child, etc. But again interest in these contexts lies in examining differences or changes between two "populations" in some abstract sense. The real difference then between paired and unpaired samples is that continuous paired difference or binary paired comparisons allow a paired sample context to be treated as in the manner of a single sample analysis.
But this then shows that the real difference in consideration is the difference between one sample and two sample analysis contexts. But interestingly, there is again a similarity across these two different modes of analysis that is worth emphasizing. Namely, characterizing evidence using hypothesis testing and performing statistical inference using bootstrapped confidence intervals are (perhaps as should be expected) available for both one sample and two sample data analysis contexts. The next two sections will discuss these new methods, which are respectively (obviously) referred to as Permutation Testing (based on label shuffling) and what we'll call just "Double" Bootstrapping for the purposes of our course (even though that's not an "official" name).
One Sample | Two Sample | |
---|---|---|
Hypothesis Testing using p-values |
Sampling Distribution |
Permutation Test label shuffling |
Confidence Intervals for Inference |
Bootstrapping | "Double" Bootstrapping |
The idea of a permutation test starts with the null hypothesis.
If this null hypothesis is true, then "label shuffling" would not actually have any meaning.
So with respect to
Observation value | Actual group label | Shuffled group label 1 | ... 2 | ... 3 |
---|---|---|---|---|
7 | A | A | B | B |
4 | A | A | A | B |
7 | B | A | A | A |
9 | B | B | A | B |
1 | A | B | B | A |
1 | B | B | B | A |
However, this label shuffling then provides a mechanism to simulate samples under the assumption that the null hypothesis is true. And of course these can be used to simulate the sampling distributions of our statistic of interest under the assumption that the null hypothesis is true. And then from there estimate the p-value of the actually observed statistic relative to this sampling distribution.
To illustrate this, suppose we're VERY interested in understanding the Pokémon universe, and in particular we REALLY care to learn if "Water" and "Fire" Pokémon represent an apartheid-esque patriarchal tyranny. As we all no doubt are.
import pandas as pd
import plotly.express as px
url = "https://raw.githubusercontent.com/KeithGalli/pandas/master/pokemon_data.csv"
# fail https://github.com/KeithGalli/pandas/blob/master/pokemon_data.csv
pokeaman = pd.read_csv(url)
fire = pokeaman['Type 1'] == 'Fire'
water = pokeaman['Type 1'] == 'Water'
pokeaman[ fire | water ]
In this case then, what we're interested in asking is whether or not the difference observed in the box plot below could be just be due to the random chance of the sample of Pokémon that from the Pokémon universe that we happen to know about here on earth thanks to "[t]he Pokémon Company (株式会社ポケモン, Kabushiki Gaisha Pokemon, TPC)"
which according to https://en.wikipedia.org/wiki/The_Pokémon_Company is "a Japanese company responsible for" letting humans here on earth learn about the Pokémon universe thanks to their efforts regarding "brand management, production, publishing, marketing, and licensing of the Pokémon franchise, which consists of video games, a trading card game, anime television series, films, manga, home entertainment products, merchandise, and other ventures."
display(pd.DataFrame(pokeaman[ fire | water ].groupby('Type 1')['Sp. Atk'].mean()))
display(pd.DataFrame(pokeaman[ fire | water ].groupby('Type 1')['Sp. Atk'].mean().diff()))
print(pokeaman[ fire | water ].groupby('Type 1')['Sp. Atk'].mean().diff().values[1])
fig = px.box(pokeaman[ fire | water ], x="Type 1", y="Sp. Atk",
title="Distribution of 'Sp. Atk' between Water and Fire Type Pokémon: Are These Different??")
fig.show() # USE `fig.show(renderer="png")` FOR ALL GitHub and MarkUs SUBMISSIONS
# #https://stackoverflow.com/questions/52771328/plotly-chart-not-showing-in-jupyter-notebook
# import plotly.offline as pyo
# # Set notebook mode to work in offline
# pyo.init_notebook_mode()
So, indeed, how DO we determine if the averages difference observed in the above box plots is likely to reflect ACTUAL Pokémon racism? Or if it might just reflect the random of chance of the adventures of Ash Ketchum, Pikachu, Serena, Misty, Togepi, and Brock?
That group DOES drink a lot (it is known) and don't really seem to actually ever walk very straight towards any sensible objective of tryin' to get somewhere (or anywhere for that matter) in the show, just sayin'. And I mean, getting "random battles" in the game is, indeed, as one would say, pretty random.
Well, we do a bunch of label shuffling to simulate samples under the assumption that the null hypothesis is true. Ah, like-a-so, where from here we're simulate the sampling distribution of our sample statistic of interest under the assumption that the null hypothesis (that there is not difference between groups so labels don't actually matter) is true.
import numpy as np
groups4racism = pokeaman[ fire | water ].copy()
# Set parameters for bootstrap
n_bootstraps = 1000 # Number of bootstrap samples
sample_size = len(groups4racism) # Sample size matches the original dataset
label_permutation_mean_differences = np.zeros(n_bootstraps)
for i in range(n_bootstraps):
groups4racism['Shuffled Pokeaman Race'] = groups4racism['Type 1'].sample(n=sample_size, replace=True).values
label_permutation_mean_differences[i] = \
groups4racism.groupby('Shuffled Pokeaman Race')['Sp. Atk'].mean().diff().values[1]
So the above code assumes there's no racism the Pokémon universe, and based on this and the label shuffling we can therefore do under this assumption, we can then compute the p-value of our observed sample statistic (relative to the assumption of this null hypothesis under consideration) in order to complete the permutation test.
fig = px.histogram(pd.DataFrame({"label_permutation_mean_differences": label_permutation_mean_differences}), nbins=30,
title="Mean Difference Sampling under SHUFFLED Pokeaman Race")
mean_differene_statistic = groups4racism.groupby('Type 1')['Sp. Atk'].mean().diff().values[1]
fig.add_vline(x=mean_differene_statistic, line_dash="dash", line_color="red",
annotation_text=f".<br><br>Shuffled Statistic >= Observed Statistic: {mean_differene_statistic:.2f}",
annotation_position="top right")
fig.add_vline(x=-mean_differene_statistic, line_dash="dash", line_color="red",
annotation_text=f"Shuffled Statistic <= Observed Statistic: {-mean_differene_statistic:.2f}<br><br>.",
annotation_position="top left")
fig.show() # USE `fig.show(renderer="png")` FOR ALL GitHub and MarkUs SUBMISSIONS
print("p-value",
(abs(label_permutation_mean_differences) >= abs(mean_differene_statistic)).sum()/n_bootstraps)
Hmm... p-value is STRONG evidence against the null hypothesis according to the table below. I'd therefore REJECT the null hypothesis assumption that the (Fire or Water) type of Pokémon doesn't matter when it comes to the "Sp. Atk" power that you're gonna have. Pokémon def probably racist, y'all. Sorry to burst your bubble.
p-value | Evidence |
---|---|
No evidence against the null hypothesis | |
Weak evidence against the null hypothesis | |
Moderate evidence against the null hypothesis | |
Strong evidence against the null hypothesis | |
Very strong evidence against the null hypothesis |
As always, making a decision regarding a null hypothesis based on the strength of evidence executes the scientific process that hypothesis testing is. But it doesn't tell us the full story. We get that by considering statistical inference. And as you will hopefully recall, the way we can provide statistical inference is through confidence intervals. And of course the way we create confidence intervals in our class is through bootstrapping. The trick to confidence intervals in the two sample context is to simply perform so-called "double" bootstrapping. Now don't get confused, "double" bootstrapping IS NOT "double for
loops. Actually, "double" bootstrapping is just a name that got made up (literally yesterday) because it describes how bootstrapping works in the two independent sample context. Namely, EACH SAMPLE is bootstrapped separately side-by-side. So, bootstrapping for the two independent sample context is thus "double" bootstrapping. The code below illustrates exactly what is meant by this.
within_group_bootstrapped_mean_differences = np.zeros(n_bootstraps)
for i in range(n_bootstraps):
double_bootstrap = \
groups4racism.groupby("Type 1")[["Type 1","Sp. Atk"]].sample(frac=1, replace=True)
within_group_bootstrapped_mean_differences[i] = \
double_bootstrap.groupby('Type 1')["Sp. Atk"].mean().diff().values[1]
np.quantile(within_group_bootstrapped_mean_differences, [0.05,0.95])
Are you able to tell what's happening here? First, the original data is separated into its groups by the groupby
. Then the labels and the data is selected (so all other columns in the data frame are removed). Then, within each group each subgroup sample is bootstrapped in the usual manner. The two sample means and their differences are then computed. This of course is the mean difference statistic of interest. And finally, a 90% "double" bootstrapped confidence interval is created. If you don't like this confidence level screw you. Sue me. Ah but you can't. 'Cuz it's Canada not the US. Too bad so sad sorry I'm not sorry (but VERY Canadian since I'm saying sorry when I don't really mean it).
So, with permutation testing and the "double" bootstrap methods above we can provide evidence against a null hypothesis and inference for a population parameter. And these two methods are quite great. But is there another way, you ask? Why yes, there is. How did you know? Maybe because there's probably actually about a million other ways. But they're all based exactly on the idea of a sampling distribution of a statistic. That is, ALL STATISTICAL METHODS ARE BASED ON either (A) the sampling distribution of a statistic under a null hypothesis (in order to compute p-values and perform hypothesis testing) or (B) a sampling distribution of the sample statistic (in order to provide inference about the corresponding population parameter). For the latter task (of inference), we'll often rely on the bootstrapped sampling distribution of the sample statistic which we've (hopefully) become increasingly familiar and comfortable with through the methods of our course. But, in later statistics courses you'll additionally learn how to create other kinds of confidence intervals based on approximations derived from theoretical derivations (based on some assumptions about the data). In fact, the "statistic plus and minus two standard errors" is exactly such a theoretically derived confidence interval approximation (based on some assumptions about the data). This style of "statistic plus and minus two standard errors" confidence interval is in fact provided in the output of fitted simple linear regression models. So we'll note that later. But first, to start with, let's begin by considering how to use simple linear regression models for hypothesis testing in the two independent samples context.
The idea that we need to do this is called an indicator variable. The simple linear regression model we've considered so far is
But here's what amounts to being essentially the same model, except a qualitative categorical variable
So under this specification, True
. If True
then the binary variable is
Since True
, and "everybody else" is the "other group" where False
(or True
). So if
0. If
1. If
So, the contrast
import statsmodels.formula.api as smf
# Model (a): Predict Attack based on Defense
model_fit = smf.ols(formula="Q('Sp. Atk') ~ Q('Type 1')", data=groups4racism).fit()
print(model_fit.rsquared)
model_fit.summary().tables[1]
beta0-hat / beta1-hat | coef | std err | t | P>|t| | [0.025 | 0.975] |
---|---|---|---|---|---|---|
Intercept | 88.9808 | 4.070 | 21.860 | 0.000 | 80.943 | 97.019 |
Q('Type 1')[T.Water] | -14.1683 | 4.926 | -2.876 | 0.005 | -23.895 | -4.442 |
From this output we see that the "group" when Water
.
So the other group is Fire
(just because that's the only other kind of pokeamans in this data.
So what is Intercept
so Sp. Atk
) in the Fire
group.
So then Fire
group to the average of the Water
group (which is
And what else can we say here? We have a p-value relative to 0.005
.
Again this indicates strong evidence against the null hypothesis.
This should by now feel very reminiscent of what permutation testing we previously considered. Indeed, what we have here is a theoretical p-value based on an approximation derived on the basis of the assumptions of the regression model (which are encoded in the the distributional assumption that error terms
But now remember that ALL STATISTICAL METHODS ARE BASED ON either (A) the sampling distribution of a statistic under a null hypothesis (in order to compute p-values and perform hypothesis testing) or (B) a sampling distribution of the sample statistic (in order to provide inference about the corresponding population parameter).
Regarding (A) the definition of the p-value used to give evidence against P>|t|
as given in the output table of the fitted model.
- A p-value is
P>|t|
the probability of seeing a statistic as or more extreme that what was observed if the null hypothesis was true.
The difference is that the permutation test uses "label shuffling" to simulate the sampling distribution of the sample statistic under the assumption of the null hypothesis (of not difference between the groups), and computes a p-value from there; whereas, the indicator variable formulation of the simple linear regression specification instead assumes the assumption regarding the normality of the error terms, etc. is correct, which allows for a theoretical approximation of the sampling distribution of
- The assumption regarding the normality of the error terms, etc. can diagnostically assessed by examining the distribution of the residuals
$\hat \epsilon_i = Y_i - \hat y_i$ ; specifically, the appropriateness of the claim that this distribution appears to reasonably be characterized as being approximately normally distribute. - Would you be able to assess this assumption in the context of the current analysis? If so, what is your opinion of this assumption in the context of the current analysis? Does it appear to be a reasonable assumption? Or does it appear to be egregiously unlikely and misguided?
To conclude on a high note here, we've seen that we can do hypothesis testing examining group difference using simple linear regression on the basis of indicator variables. But inference using confidence intervals WILL ALWAY BE PREFERABLE TO HYPOTHESIS TESTING FRAMEWORKS since it can itself be used to make decisions with associated confidence levels just as classical hypothesis testing does, AND IT ALSO additionally provides a plausible range of values (where plausible is defined in terms the confidence level corresponding to the confidence interval) for what the actual true value of the parameter of interest might be according to the evidence available in the data. This allows us to meaningful understand and interpret what it exactly is that we understand about the population on the basis of the sample data at hand. This kind of inference is so, so much more usefully actionable than simply characterizing the evidence against the null (but which of course confidence intervals CAN STILL EVEN DO). So now you're asking, "okay, so what's this have to do with indicator version of simple linear regression that we're considering here?". Well let me tell you. What do you think the columns [0.025
and 0.975]
mean in the output table of a fitted linear regression model?
I rest my case.
So, bottom line: pokeaman Pokémon is probably maybe racist no doubt. Strong evidence to believe that. Trust.