Biostatistics Refresher - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

Biostatistics Refresher

Module 1.2: Biostatistics Refresher
Prerequisites: DNA, RNA & Protein Basics
Next: Intro to Biopython & Bioconductor

1. Concept & Motivation

Statistical analysis underpins nearly every bioinformatics workflow—from identifying differentially expressed genes to modeling genotype–phenotype associations. In this module we’ll cover:

Descriptive statistics: mean, median, variance, standard deviation.
Data visualization: histograms, boxplots.
Hypothesis testing: two‐sample t‐test.
Simple linear regression: fitting and interpreting a trend line.

2. Data Description

We’ll simulate two small datasets:

Group A and Group B: expression values for a hypothetical gene under two conditions (50 samples each).
X & Y: paired continuous variables with a linear relationship plus noise.

All data are generated on‐the‐fly with NumPy, so you don’t need any external files.

3. Hands-On Code

Tip: If you haven’t installed SciPy yet, run !pip install scipy in a notebook cell first.

3.1 Generate & Peek at Data

import numpy as np
import pandas as pd

# Reproducible randomness
np.random.seed(42)

# Simulate expression values
groupA = np.random.normal(loc=5.0, scale=1.0, size=50)
groupB = np.random.normal(loc=6.0, scale=1.2, size=50)

# Build a DataFrame
df = pd.DataFrame({
    'Expression': np.concatenate([groupA, groupB]),
    'Group': ['A'] * 50 + ['B'] * 50
})

# Peek at the first 6 rows
df.head()

Output:

3.2 Descriptive Statistics

# Aggregate mean, median, std, min, max by group
desc = df.groupby('Group')['Expression'].agg(['mean','median','std','min','max'])
desc

Output:

3.3 Visualizing Distributions

import matplotlib.pyplot as plt

# Histogram
plt.hist([groupA, groupB],
         bins=10,
         label=['Group A','Group B'],
         alpha=0.7)
plt.title("Expression Distributions")
plt.xlabel("Expression value")
plt.ylabel("Frequency")
plt.legend()
plt.tight_layout()
plt.show()

# Boxplot
df.boxplot(column='Expression', by='Group')
plt.title("Expression by Group")
plt.suptitle("")  # remove automatic title
plt.ylabel("Expression value")
plt.tight_layout()
plt.show()

Output:

3.4 Two-Sample t-Test

from scipy import stats

t_stat, p_val = stats.ttest_ind(groupA, groupB, equal_var=False)
print(f"T-statistic = {t_stat:.3f}")
print(f"P-value     = {p_val:.3e}")

### Output:
```text
T-statistic = -6.277
P-value     = 9.766e-09

Interpretation: A low p-value (e.g. < 0.05) suggests a statistically significant difference in mean expression between the two groups.

3.5 Simple Linear Regression

# Simulate x and y
x = np.random.uniform(0, 10, size=50)
y = 2.5 * x + np.random.normal(scale=5.0, size=50)

# Fit regression
slope, intercept, r_value, p_val_lr, std_err = stats.linregress(x, y)
print(f"Slope     = {slope:.2f}")
print(f"Intercept = {intercept:.2f}")
print(f"R²        = {r_value**2:.3f}")

# Plot
plt.scatter(x, y, label="Data")
plt.plot(x, intercept + slope * x, color='black', label="Fit")
plt.title("Linear Regression of Y ~ X")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.tight_layout()
plt.show()

Output:

4. Interpretation & Discussion

Descriptives (mean, std) summarise central tendency and spread.
Histograms and boxplots reveal distribution shape and outliers.
t-test assesses whether group means differ beyond random chance.
Regression quantifies and tests the strength of a linear relationship (R² close to 1 indicates a strong fit).

5. Exercises

One-sample t-test: Test whether Group A’s mean differs from 5.0.
Non-parametric test: Use stats.mannwhitneyu on the two groups and compare results to the t-test.
Multiple regression: Simulate two predictors (X1, X2) and use statsmodels to fit a multivariate model.
Effect of sample size: Repeat the simulation with n = 20, n = 100, and observe how p-values and confidence intervals change.

End of Module 1.2