6. Input Data Guidelines - SjulsonLab/generalized_contrastive_PCA GitHub Wiki

Input Data Guidelines

This page describes recommended data formatting and preprocessing steps for gcPCA.
Following these guidelines helps ensure stable and interpretable results.

Data Format Requirements

gcPCA requires two datasets:

Ra — Condition A
Rb — Condition B

Both datasets must have the following format:

Rows = samples
Columns = features

Matrix shapes:

Ra: (ma × p)
Rb: (mb × p)

Where:

ma, mb = number of samples
p = number of shared features

Example

Neuroscience example:

Rows → trials or time points
Columns → neurons

Ra = trials × neurons (task condition)
Rb = trials × neurons (baseline condition)

Genomics example:

Rows → cells
Columns → genes

Ra = cells × genes (disease)
Rb = cells × genes (control)

Matching Features Between Conditions

This is one of the most common sources of errors.

Requirements:

Same number of features
Same feature order
Same preprocessing pipeline

Incorrect examples:

Different neuron ordering
Missing neurons in one dataset
Different gene sets

gcPCA assumes each column corresponds to the same feature in both datasets. It is okay to have different number of samples across datasets.

Normalization and Preprocessing

gcPCA operates on covariance structure, so preprocessing can affect results.

When to Use Custom Normalization

Users may want to disable normalization when:

Data already normalized
Working with firing rates or standardized signals
Using PCA-reduced data as input

Using PCA Before gcPCA

Applying PCA before gcPCA is not required, but may be helpful when:

Feature dimensionality is extremely high
Small variance dimensions dominate
Numerical stability is a concern

In this case:

1. Apply PCA
2. Use PCA scores as input to gcPCA

This reduces dimensionality while preserving covariance structure.

Sample Size Considerations

gcPCA works well with:

Moderate to large sample sizes
High-dimensional datasets

gcPCA can handle:

Different sample sizes between conditions
p >> n settings (more features than samples)

Balancing sample sizes between conditions is good practice, but not required.

Handling Missing Data

gcPCA does not support missing values.

Before running gcPCA:

Remove samples with missing values
Impute missing values
Interpolate if appropriate

No NaN values allowed in Ra or Rb

Scaling and Units

Because gcPCA analyzes covariance:

Feature scaling affects results
Large-scale features dominate

Examples:

firing rates vs normalized activity
gene counts vs log-transformed counts

Normalization helps ensure balanced contributions.

Common Pitfalls

Feature Mismatch

Different features between conditions:

different neurons
different genes
different channels

This produces incorrect results.

Unequal Preprocessing

Example:

Ra normalized
Rb not normalized

This introduces artificial differences.

Too Few Samples

Small sample sizes can produce:

noisy components
unstable loadings

Highly Noisy Data

If noise dominates:

gcPCs become harder to interpret
consider smoothing or preprocessing

Quick Checklist