6. Input Data Guidelines - SjulsonLab/generalized_contrastive_PCA GitHub Wiki
Input Data Guidelines
This page describes recommended data formatting and preprocessing steps for gcPCA.
Following these guidelines helps ensure stable and interpretable results.
Data Format Requirements
gcPCA requires two datasets:
- Ra — Condition A
- Rb — Condition B
Both datasets must have the following format:
- Rows = samples
- Columns = features
Matrix shapes:
Ra: (ma × p)Rb: (mb × p)
Where:
ma,mb= number of samplesp= number of shared features
Example
Neuroscience example:
- Rows → trials or time points
- Columns → neurons
Ra = trials × neurons (task condition)
Rb = trials × neurons (baseline condition)
Genomics example:
- Rows → cells
- Columns → genes
Ra = cells × genes (disease)
Rb = cells × genes (control)
Matching Features Between Conditions
This is one of the most common sources of errors.
Requirements:
- Same number of features
- Same feature order
- Same preprocessing pipeline
Incorrect examples:
- Different neuron ordering
- Missing neurons in one dataset
- Different gene sets
gcPCA assumes each column corresponds to the same feature in both datasets. It is okay to have different number of samples across datasets.
Normalization and Preprocessing
gcPCA operates on covariance structure, so preprocessing can affect results.
Recommended
- Mean-center features
- Z-score features
Python and R implementations perform normalization by default.
MATLAB normalization can be controlled with optional parameters.
When to Use Custom Normalization
Users may want to disable normalization when:
- Data already normalized
- Working with firing rates or standardized signals
- Using PCA-reduced data as input
Using PCA Before gcPCA
Applying PCA before gcPCA is not required, but may be helpful when:
- Feature dimensionality is extremely high
- Small variance dimensions dominate
- Numerical stability is a concern
In this case:
1. Apply PCA
2. Use PCA scores as input to gcPCA
This reduces dimensionality while preserving covariance structure.
Sample Size Considerations
gcPCA works well with:
- Moderate to large sample sizes
- High-dimensional datasets
gcPCA can handle:
- Different sample sizes between conditions
- p >> n settings (more features than samples)
Balancing sample sizes between conditions is good practice, but not required.
Handling Missing Data
gcPCA does not support missing values.
Before running gcPCA:
- Remove samples with missing values
- Impute missing values
- Interpolate if appropriate
No NaN values allowed in Ra or Rb
Scaling and Units
Because gcPCA analyzes covariance:
- Feature scaling affects results
- Large-scale features dominate
Examples:
- firing rates vs normalized activity
- gene counts vs log-transformed counts
Normalization helps ensure balanced contributions.
Common Pitfalls
Feature Mismatch
Different features between conditions:
- different neurons
- different genes
- different channels
This produces incorrect results.
Unequal Preprocessing
Example:
- Ra normalized
- Rb not normalized
This introduces artificial differences.
Too Few Samples
Small sample sizes can produce:
- noisy components
- unstable loadings
Highly Noisy Data
If noise dominates:
- gcPCs become harder to interpret
- consider smoothing or preprocessing
Quick Checklist
Before running gcPCA:
- Same features in Ra and Rb
- Samples in rows
- Features in columns
- No missing values
- Normalized or appropriately scaled
- Sufficient sample size
Summary
gcPCA works best when datasets:
- Share the same features
- Are consistently preprocessed
- Contain sufficient samples
- Are properly normalized
Following these guidelines improves interpretability and numerical stability.
Links to Other Pages
1. Quickstart Guide
2. Installation
3. Conceptual Overview
4. Mathematical Formulation
5. Code Reference
7. Interpreting Results