Identify Key Residues Driving Ligand Association - k-ngo/CATMD GitHub Wiki

Identify Key Residues Driving Ligand Association

Overview and Methodology

What It Does

This tool identifies residues critical for ligand association by analyzing molecular interactions across simulation frames and applying statistical models to rank their relevance to binding behavior.

How It Works

  • Objective: Highlight residues whose interaction patterns are most strongly associated with overall ligand binding.
  • Process:
    • Loads interaction matrices from the ▶️ (Run First) Extract Pairwise Interaction from Trajectory step.
    • Applies custom scale factors to different interaction types to reflect their relative importance.
    • Aggregates per-frame interaction strengths per residue.
    • Correlates each residue’s interaction profile with a global ligand association metric.
    • Ranks residues using Spearman correlation, Random Forest regression, and Mutual Information.
    • Visualizes key residues using bar plots and heatmaps.

Configuration and Inputs

Prerequisites

  • Requires a loaded trajectory.
  • Requires interaction CSVs generated using the same sel1_name and sel2_name from the extraction step.

Key Configuration Options

  • Selection Labels:

    • sel1_name, sel2_name: Used to define source and target molecules (e.g., protein and ligand).
  • Interaction Mode:

    • interaction_mode: Define the interaction scope:
      • interchain + intrachain: All interactions.
      • interchain: Only between different chains.
      • intrachain: Only within the same chain.
  • Interaction Types and Weighting:

    • Enable or disable interaction types such as:
      • Hydrogen Bonds, Hydrophobic Contacts, Salt Bridges, π–π Stacking, Cation–π.
    • Adjust scale factors to control their contribution to the ligand association score.
  • Filtering Options:

    • pct_hide_threshold: Exclude residues involved in infrequent contacts (e.g., <20% of frames).
  • Statistical Modeling:

    • top_n_residues: Number of top-ranked residues to display.
    • n_estimators: Number of decision trees in the Random Forest regression model. The model learns to predict ligand association intensity based on residue-level interaction patterns using an ensemble of decision trees. Residues that consistently appear in decision paths across trees are deemed more important for driving binding behavior. A higher number (e.g., 100–500) typically yields more stable importance rankings and reduces variance across runs. Lower values make the model faster but may result in noisier or less reliable feature importances.

Output

  • Ranking Results:

    • CSV report of residues with their:
      • Spearman Correlation
      • P-Value
      • Random Forest Importance
      • Mutual Information Score
      • Composite Score (normalized average of all three metrics)
  • Saved Files:

    • CSV file: key_residues_<sel1>_<sel2>.csv
    • Plot images:
      • key_residues_<sel1>_<sel2>.png: Individual ranking bar charts
      • key_residues_heatmap_<sel1>_<sel2>.png: Time-resolved contact heatmap
      • combined_importance_<sel1>_<sel2>.png: Composite score plots by selection group
      • contact_count_heatbar_<sel1>_<sel2>.png: Legend for heatmap intensity
  • Visualizations:

    • Bar charts of top residues by each metric.
    • Time-resolved heatmap showing per-frame interaction intensity for top residues.
    • Colorbar legend with contact intensity scale.

Interpreting the Results

  • What Makes a Residue “Key”?

    • Key residues are those whose interaction activity patterns across time best correlate with the overall ligand association metric and not necessarily those with the highest total number of contacts.
  • Why Contact Count Alone Is Not Enough:

    • A residue might frequently contact the ligand but only when the ligand is loosely bound or transitioning.
    • Conversely, less frequent but highly predictive interactions may reflect binding onset, stabilization, or gating roles, and thus carry higher biological relevance.
    • Statistical models capture temporal coordination and predictive value, not just frequency.
  • Quantifying Ligand Association as a Metric:

    • To determine which residues are most influential in ligand binding, each residue’s interaction activity across simulation frames is statistically compared to an aggregated ligand association metric. This metric serves as a proxy for how "bound" the ligand is over time and is calculated as follows:

      • Ligand association is quantified by computing the total interaction intensity between the ligand and its partner residues at each simulation frame.
      • Specifically, the sum of all scaled interactions involving the ligand (e.g., hydrogen bonds, salt bridges, π–π contacts) is computed for each frame.
      • This results in a time series that reflects how strongly the ligand is interacting at each time point. Peaks in this series indicate stronger or more numerous interactions, interpreted as frames where the ligand is more fully associated with the binding site.
      • Residue-specific time series are then correlated with this global association metric, revealing which residues’ interaction patterns most closely track ligand binding strength.
  • Statistical Measures Used:

    • Spearman Correlation

      • Measures how well a residue’s interaction activity increases or decreases in step with the ligand association score across time.
      • Does not assume linearity; useful for detecting monotonic trends.
      • A high positive value means that the residue is active specifically during strong binding events, not just present throughout the simulation.
    • Random Forest Importance

      • Uses a machine learning model to predict ligand association from residue contact profiles.
      • Captures nonlinear relationships, interaction synergies, and contextual dependencies.
      • Residues with high importance scores contribute significantly to predicting when the ligand is strongly associated.
    • Mutual Information

      • Measures how much information is shared between a residue’s interaction time series and the ligand association metric.
      • Does not rely on monotonic or linear relationships.
      • Especially powerful for identifying residues that convey binding-relevant signals, even if the relationship is complex or discontinuous.
    • These three metrics complement each other:

      • Spearman highlights temporally coordinated interaction patterns.
      • Random Forest surfaces predictive combinations of residues.
      • Mutual Information reveals non-obvious but relevant interaction signals.
  • Composite Score:

    • A composite score is calculated by normalizing and averaging all three metrics, providing a robust consensus for ranking residues driving ligand association.
  • Persistence Threshold:

    • Raising pct_hide_threshold focuses on stable contributors.
    • Lowering it includes more transient but potentially critical residues.
  • Sidechain Labels:

    • Residue names may include * to denote sidechain-originating interactions (from extraction phase).
  • How Can Some Residues Have Half a Contact?

    • The code multiplies interaction values by user-defined scale factors (e.g., hbonds_scale, hydrophobic_scale) before summing them, which can result in non-integer contact counts if the scale factor is less than 1 or if the raw interaction values are fractional.

Example Scenarios

Detecting Binding Hotspots

  • Scenario: Identify amino acids in a binding pocket that stabilize the ligand throughout the trajectory.
  • Observation: Key residues exhibit both frequent and well-timed interactions aligning with ligand dwell time.
  • Interpretation: Candidates for mutational testing or pharmacophore modeling.

Evaluating Gating Residues

  • Scenario: Discover residues outside the direct binding pocket that influence ligand entry or retention.
  • Observation: Low contact frequency but high correlation or mutual information.
  • Interpretation: Suggests conformational switches or allosteric regulation points.

Validating Selectivity Determinants

  • Scenario: Compare toxin binding to different channel isoforms.
  • Observation: Isoform-specific key residues highlight structural motifs underlying specificity.
  • Interpretation: Useful for selectivity engineering or drug design.

Usage Tips

  • Avoid Overweighting:

    • Use scale factors carefully to avoid biasing results toward one interaction type.
  • Cross-Check:

    • Validate top residues by comparing consistency across correlation, Random Forest, and Mutual Information scores.

⚠️ **GitHub.com Fallback** ⚠️