Bias and Overfitting in Road‐Network GNNs - davidlabee/Graph4Air GitHub Wiki
This page discusses how bias and overfitting affect Graph Neural Networks (GNNs), particularly in the context of your thesis: predicting NO₂ levels using road network data with a transductive learning setup.
🔍 1. Bias in GNNs
Bias refers to model assumptions that simplify learning.
1.1 Homophily Bias
GNNs assume that connected nodes have similar values (homophily).
In road networks: "Nearby segments have similar air quality."
✅ Often valid due to spatial correlation in NO₂
❌ Problematic for sudden transitions (e.g., from residential to highway)
Risks:
- Oversmoothing: Node features become indistinguishable across layers
- Overrepresentation: Densely sampled regions dominate learning
Mitigation:
- Use 2–3 GNN layers
- Apply residual connections or Jumping Knowledge
- Use multi-resolution graphs to balance local/global patterns
1.2 Message Passing Bias
GNNs rely on graph edges to pass information.
Implications:
- Isolated segments may be poorly learned
- Noise from some neighbors can propagate
Solutions used:
- Graph construction based on physical connectivity (+ optional similarity edges)
- Multi-resolution graphs to expand message reach
⚠️ 2. Overfitting in Transductive GNNs
Overfitting happens when a model memorizes noise in the training data.
2.1 Leakage Risk in Transductive Setting
In transductive learning:
- All nodes (train/test) are visible during training
- Only test target values are masked
⚠️ Test nodes may still receive messages from training neighbors, causing information leakage
✅ Fine for interpolation across the same road network
❌ Not ideal for generalization to unseen areas
🧪 3. Train/Test Split Evaluation
You tested robustness across different splits:
Baseline GCN
Split | Pearson R | RMSE | MAE |
---|---|---|---|
80/20 | 0.71 | 8.81 | 6.30 |
60/40 | 0.72 | 8.76 | 6.26 |
40/60 | 0.71 | 8.73 | 6.25 |
20/80 | 0.71 | 8.80 | 6.28 |
Baseline GAT
Split | Pearson R | RMSE | MAE |
---|---|---|---|
80/20 | 0.76 | 8.18 | 5.79 |
60/40 | 0.76 | 8.17 | 5.77 |
40/60 | 0.76 | 8.14 | 5.76 |
20/80 | 0.75 | 8.24 | 5.81 |
Insight:
Model performance is stable across splits → suggests low variance, but possible transductive leakage.
✅ 4. Summary
- Your GNN uses a transductive approach: full graph visible during training, targets masked
- Overfitting is controlled with dropout, early stopping, and graph design
- External test locations (e.g., Palmes Tubes) are needed for true generalization tests