3. Conceptual Overview - SjulsonLab/generalized_contrastive_PCA GitHub Wiki

Conceptual Overview

Motivation

Modern biological experiments frequently generate high-dimensional datasets, such as:

Neural population recordings
Single-cell RNA sequencing
Imaging data
Behavioral measurements

These datasets are often collected under two experimental conditions, and the goal is to identify patterns that differ between conditions.

Dimensionality reduction methods such as Principal Component Analysis (PCA) are commonly used to simplify high-dimensional data. PCA identifies directions of maximum variance in a dataset, which often correspond to meaningful structure such as correlated neural populations or gene expression programs.

However, PCA operates on a single dataset. When comparing two conditions, PCA typically identifies shared sources of variance, rather than differences between conditions. As a result, PCA is often poorly suited for contrasting experimental datasets.

The Goal of gcPCA

Generalized contrastive PCA (gcPCA) is designed to identify low-dimensional patterns that differ between two datasets.

Instead of maximizing overall variance, gcPCA finds directions that:

Increase variance in one condition
Decrease variance in the other condition

This allows gcPCA to isolate condition-specific structure while suppressing shared variation.

Conceptually:

PCA finds dominant structure
gcPCA finds differential structure

Why Not Use Classification Methods?

Methods such as linear discriminant analysis (LDA) can distinguish between two datasets, but they answer a different question.

Classification methods identify mean differences between conditions, such as:

Which neurons fire more in condition A
Which genes are upregulated in condition B

gcPCA instead identifies covariance differences, such as:

Which neurons become more correlated in condition A
Which gene networks co-activate in condition B

These differences often reveal more subtle and biologically meaningful structure.

Relationship to Contrastive PCA (cPCA)

Contrastive PCA (cPCA) was previously proposed to compare datasets, but it requires a hyperparameter that must be manually tuned.

This creates two problems:

Multiple possible solutions
No objective way to determine which solution is correct

Additionally, cPCA is asymmetric, treating one dataset as foreground and the other as background.

gcPCA addresses these limitations by:

Removing hyperparameter tuning
Allowing symmetric comparisons
Producing a single interpretable solution

Intuition Behind gcPCA v4

gcPCA v4 identifies dimensions that maximize relative differences in variance between two datasets.

Conceptually, gcPCA v4 evaluates:

(variance in A − variance in B)
divided by
(variance in A + variance in B)

This normalization is important because it:

Reduces bias toward high-variance dimensions
Improves robustness to noise
Produces interpretable values between -1 and 1

Interpretation:

+1 → variance only in condition A
−1 → variance only in condition B
0 → equal variance in both conditions

This makes gcPCA v4 symmetric and easy to interpret. Switching the order of datasets produces the same components with opposite signs.

Neuroscience Example

Consider neural recordings during two conditions:

Condition A: Task performance
Condition B: Rest

Both datasets may contain shared sources of variance:

Global brain state
Movement artifacts
Slow drift

PCA typically identifies these shared components first.

gcPCA instead identifies:

Task-specific neural ensembles
Changes in neural correlations
Population dynamics unique to task performance

In the gcPCA manuscript, this approach was used to identify hippocampal replay in neural recordings. gcPCA extracted patterns corresponding to replayed neural trajectories without requiring prior knowledge of the replay structure.

This demonstrates that gcPCA can reveal meaningful neural population structure in an unsupervised manner.

What gcPCA Produces

gcPCA returns:

gcPC loadings
Scores for each dataset
Objective values

Conceptually:

Loadings → Which features drive differences
Scores → How samples differ across conditions
Objective values → Strength of condition-specific structure

When to Use gcPCA

gcPCA is useful when:

Comparing two experimental conditions
Identifying condition-specific neural activity
Comparing biological states
Analyzing multi-condition experiments
Studying population dynamics

When Not to Use gcPCA

gcPCA may be less useful when:

Only one dataset is available
Conditions are nearly identical
Sample sizes are extremely small
Data are dominated by noise

In these cases, PCA or other dimensionality reduction methods may be more appropriate.

Summary

Generalized contrastive PCA (gcPCA) is a dimensionality reduction method designed for comparing datasets.
It identifies low-dimensional patterns enriched in one condition relative to another, enabling discovery of condition-specific structure in high-dimensional data.

Link to other pages

1. Quickstart Guide
2. Installation
4. Mathematical Formulation
5. Code Reference
6. Input Data Guidelines
7. Interpreting Results