Biostatistics Refresher - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
Biostatistics Refresher
Module 1.2: Biostatistics Refresher
Prerequisites: DNA, RNA & Protein Basics
Next: Intro to Biopython & Bioconductor
1. Concept & Motivation
Statistical analysis underpins nearly every bioinformatics workflow—from identifying differentially expressed genes to modeling genotype–phenotype associations. In this module we’ll cover:
- Descriptive statistics: mean, median, variance, standard deviation.
- Data visualization: histograms, boxplots.
- Hypothesis testing: two‐sample t‐test.
- Simple linear regression: fitting and interpreting a trend line.
2. Data Description
We’ll simulate two small datasets:
- Group A and Group B: expression values for a hypothetical gene under two conditions (50 samples each).
- X & Y: paired continuous variables with a linear relationship plus noise.
All data are generated on‐the‐fly with NumPy, so you don’t need any external files.
3. Hands-On Code
Tip: If you haven’t installed SciPy yet, run
!pip install scipy
in a notebook cell first.
3.1 Generate & Peek at Data
import numpy as np
import pandas as pd
# Reproducible randomness
np.random.seed(42)
# Simulate expression values
groupA = np.random.normal(loc=5.0, scale=1.0, size=50)
groupB = np.random.normal(loc=6.0, scale=1.2, size=50)
# Build a DataFrame
df = pd.DataFrame({
'Expression': np.concatenate([groupA, groupB]),
'Group': ['A'] * 50 + ['B'] * 50
})
# Peek at the first 6 rows
df.head()
Output:
3.2 Descriptive Statistics
# Aggregate mean, median, std, min, max by group
desc = df.groupby('Group')['Expression'].agg(['mean','median','std','min','max'])
desc
Output:
3.3 Visualizing Distributions
import matplotlib.pyplot as plt
# Histogram
plt.hist([groupA, groupB],
bins=10,
label=['Group A','Group B'],
alpha=0.7)
plt.title("Expression Distributions")
plt.xlabel("Expression value")
plt.ylabel("Frequency")
plt.legend()
plt.tight_layout()
plt.show()
# Boxplot
df.boxplot(column='Expression', by='Group')
plt.title("Expression by Group")
plt.suptitle("") # remove automatic title
plt.ylabel("Expression value")
plt.tight_layout()
plt.show()
Output:
3.4 Two-Sample t-Test
from scipy import stats
t_stat, p_val = stats.ttest_ind(groupA, groupB, equal_var=False)
print(f"T-statistic = {t_stat:.3f}")
print(f"P-value = {p_val:.3e}")
### Output:
```text
T-statistic = -6.277
P-value = 9.766e-09
Interpretation: A low p-value (e.g. < 0.05) suggests a statistically significant difference in mean expression between the two groups.
3.5 Simple Linear Regression
# Simulate x and y
x = np.random.uniform(0, 10, size=50)
y = 2.5 * x + np.random.normal(scale=5.0, size=50)
# Fit regression
slope, intercept, r_value, p_val_lr, std_err = stats.linregress(x, y)
print(f"Slope = {slope:.2f}")
print(f"Intercept = {intercept:.2f}")
print(f"R² = {r_value**2:.3f}")
# Plot
plt.scatter(x, y, label="Data")
plt.plot(x, intercept + slope * x, color='black', label="Fit")
plt.title("Linear Regression of Y ~ X")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.tight_layout()
plt.show()
Output:
4. Interpretation & Discussion
- Descriptives (
mean
,std
) summarise central tendency and spread. - Histograms and boxplots reveal distribution shape and outliers.
- t-test assesses whether group means differ beyond random chance.
- Regression quantifies and tests the strength of a linear relationship (
R²
close to 1 indicates a strong fit).
5. Exercises
- One-sample t-test: Test whether Group A’s mean differs from
5.0
. - Non-parametric test: Use
stats.mannwhitneyu
on the two groups and compare results to the t-test. - Multiple regression: Simulate two predictors (
X1
,X2
) and usestatsmodels
to fit a multivariate model. - Effect of sample size: Repeat the simulation with
n = 20
,n = 100
, and observe how p-values and confidence intervals change.
End of Module 1.2