Bias and Overfitting in Road‐Network GNNs - davidlabee/Graph4Air GitHub Wiki

This page discusses how bias and overfitting affect Graph Neural Networks (GNNs), particularly in the context of your thesis: predicting NO₂ levels using road network data with a transductive learning setup.


🔍 1. Bias in GNNs

Bias refers to model assumptions that simplify learning.

1.1 Homophily Bias

GNNs assume that connected nodes have similar values (homophily).
In road networks: "Nearby segments have similar air quality."

✅ Often valid due to spatial correlation in NO₂
❌ Problematic for sudden transitions (e.g., from residential to highway)

Risks:

  • Oversmoothing: Node features become indistinguishable across layers
  • Overrepresentation: Densely sampled regions dominate learning

Mitigation:

  • Use 2–3 GNN layers
  • Apply residual connections or Jumping Knowledge
  • Use multi-resolution graphs to balance local/global patterns

1.2 Message Passing Bias

GNNs rely on graph edges to pass information.

Implications:

  • Isolated segments may be poorly learned
  • Noise from some neighbors can propagate

Solutions used:

  • Graph construction based on physical connectivity (+ optional similarity edges)
  • Multi-resolution graphs to expand message reach

⚠️ 2. Overfitting in Transductive GNNs

Overfitting happens when a model memorizes noise in the training data.

2.1 Leakage Risk in Transductive Setting

In transductive learning:

  • All nodes (train/test) are visible during training
  • Only test target values are masked

⚠️ Test nodes may still receive messages from training neighbors, causing information leakage

✅ Fine for interpolation across the same road network
❌ Not ideal for generalization to unseen areas


🧪 3. Train/Test Split Evaluation

You tested robustness across different splits:

Baseline GCN

Split Pearson R RMSE MAE
80/20 0.71 8.81 6.30
60/40 0.72 8.76 6.26
40/60 0.71 8.73 6.25
20/80 0.71 8.80 6.28

Baseline GAT

Split Pearson R RMSE MAE
80/20 0.76 8.18 5.79
60/40 0.76 8.17 5.77
40/60 0.76 8.14 5.76
20/80 0.75 8.24 5.81

Insight:
Model performance is stable across splits → suggests low variance, but possible transductive leakage.


✅ 4. Summary

  • Your GNN uses a transductive approach: full graph visible during training, targets masked
  • Overfitting is controlled with dropout, early stopping, and graph design
  • External test locations (e.g., Palmes Tubes) are needed for true generalization tests