3. Conceptual Overview - SjulsonLab/generalized_contrastive_PCA GitHub Wiki

Conceptual Overview

Motivation

Modern biological experiments frequently generate high-dimensional datasets, such as:

  • Neural population recordings
  • Single-cell RNA sequencing
  • Imaging data
  • Behavioral measurements

These datasets are often collected under two experimental conditions, and the goal is to identify patterns that differ between conditions.

Dimensionality reduction methods such as Principal Component Analysis (PCA) are commonly used to simplify high-dimensional data. PCA identifies directions of maximum variance in a dataset, which often correspond to meaningful structure such as correlated neural populations or gene expression programs.

However, PCA operates on a single dataset. When comparing two conditions, PCA typically identifies shared sources of variance, rather than differences between conditions. As a result, PCA is often poorly suited for contrasting experimental datasets.


The Goal of gcPCA

Generalized contrastive PCA (gcPCA) is designed to identify low-dimensional patterns that differ between two datasets.

Instead of maximizing overall variance, gcPCA finds directions that:

  • Increase variance in one condition
  • Decrease variance in the other condition

This allows gcPCA to isolate condition-specific structure while suppressing shared variation.

Conceptually:

  • PCA finds dominant structure
  • gcPCA finds differential structure

Why Not Use Classification Methods?

Methods such as linear discriminant analysis (LDA) can distinguish between two datasets, but they answer a different question.

Classification methods identify mean differences between conditions, such as:

  • Which neurons fire more in condition A
  • Which genes are upregulated in condition B

gcPCA instead identifies covariance differences, such as:

  • Which neurons become more correlated in condition A
  • Which gene networks co-activate in condition B

These differences often reveal more subtle and biologically meaningful structure.


Relationship to Contrastive PCA (cPCA)

Contrastive PCA (cPCA) was previously proposed to compare datasets, but it requires a hyperparameter that must be manually tuned.

This creates two problems:

  • Multiple possible solutions
  • No objective way to determine which solution is correct

Additionally, cPCA is asymmetric, treating one dataset as foreground and the other as background.

gcPCA addresses these limitations by:

  • Removing hyperparameter tuning
  • Allowing symmetric comparisons
  • Producing a single interpretable solution

Intuition Behind gcPCA v4

gcPCA v4 identifies dimensions that maximize relative differences in variance between two datasets.

Conceptually, gcPCA v4 evaluates:

(variance in A − variance in B)
divided by
(variance in A + variance in B)

This normalization is important because it:

  • Reduces bias toward high-variance dimensions
  • Improves robustness to noise
  • Produces interpretable values between -1 and 1

Interpretation:

  • +1 → variance only in condition A
  • −1 → variance only in condition B
  • 0 → equal variance in both conditions

This makes gcPCA v4 symmetric and easy to interpret. Switching the order of datasets produces the same components with opposite signs.


Neuroscience Example

Consider neural recordings during two conditions:

  • Condition A: Task performance
  • Condition B: Rest

Both datasets may contain shared sources of variance:

  • Global brain state
  • Movement artifacts
  • Slow drift

PCA typically identifies these shared components first.

gcPCA instead identifies:

  • Task-specific neural ensembles
  • Changes in neural correlations
  • Population dynamics unique to task performance

In the gcPCA manuscript, this approach was used to identify hippocampal replay in neural recordings. gcPCA extracted patterns corresponding to replayed neural trajectories without requiring prior knowledge of the replay structure.

This demonstrates that gcPCA can reveal meaningful neural population structure in an unsupervised manner.


What gcPCA Produces

gcPCA returns:

  • gcPC loadings
  • Scores for each dataset
  • Objective values

Conceptually:

  • Loadings → Which features drive differences
  • Scores → How samples differ across conditions
  • Objective values → Strength of condition-specific structure

When to Use gcPCA

gcPCA is useful when:

  • Comparing two experimental conditions
  • Identifying condition-specific neural activity
  • Comparing biological states
  • Analyzing multi-condition experiments
  • Studying population dynamics

When Not to Use gcPCA

gcPCA may be less useful when:

  • Only one dataset is available
  • Conditions are nearly identical
  • Sample sizes are extremely small
  • Data are dominated by noise

In these cases, PCA or other dimensionality reduction methods may be more appropriate.


Summary

Generalized contrastive PCA (gcPCA) is a dimensionality reduction method designed for comparing datasets.
It identifies low-dimensional patterns enriched in one condition relative to another, enabling discovery of condition-specific structure in high-dimensional data.

Link to other pages

1. Quickstart Guide
2. Installation
4. Mathematical Formulation
5. Code Reference
6. Input Data Guidelines
7. Interpreting Results