Bias and Overfitting in Road‐Network GNNs - davidlabee/Graph4Air GitHub Wiki

This page discusses how bias and overfitting affect Graph Neural Networks (GNNs), particularly in the context of your thesis: predicting NO₂ levels using road network data with a transductive learning setup.

🔍 1. Bias in GNNs

Bias refers to model assumptions that simplify learning.

1.1 Homophily Bias

GNNs assume that connected nodes have similar values (homophily).
In road networks: "Nearby segments have similar air quality."

✅ Often valid due to spatial correlation in NO₂
❌ Problematic for sudden transitions (e.g., from residential to highway)

Risks:

Oversmoothing: Node features become indistinguishable across layers
Overrepresentation: Densely sampled regions dominate learning

Mitigation:

Use 2–3 GNN layers
Apply residual connections or Jumping Knowledge
Use multi-resolution graphs to balance local/global patterns

1.2 Message Passing Bias

GNNs rely on graph edges to pass information.

Implications:

Isolated segments may be poorly learned
Noise from some neighbors can propagate

Solutions used:

Graph construction based on physical connectivity (+ optional similarity edges)
Multi-resolution graphs to expand message reach

⚠️ 2. Overfitting in Transductive GNNs

Overfitting happens when a model memorizes noise in the training data.

2.1 Leakage Risk in Transductive Setting

In transductive learning:

All nodes (train/test) are visible during training
Only test target values are masked

⚠️ Test nodes may still receive messages from training neighbors, causing information leakage

✅ Fine for interpolation across the same road network
❌ Not ideal for generalization to unseen areas

🧪 3. Train/Test Split Evaluation

You tested robustness across different splits:

Baseline GCN

Split	Pearson R	RMSE	MAE
80/20	0.71	8.81	6.30
60/40	0.72	8.76	6.26
40/60	0.71	8.73	6.25
20/80	0.71	8.80	6.28

Baseline GAT

Split	Pearson R	RMSE	MAE
80/20	0.76	8.18	5.79
60/40	0.76	8.17	5.77
40/60	0.76	8.14	5.76
20/80	0.75	8.24	5.81

Insight:
Model performance is stable across splits → suggests low variance, but possible transductive leakage.

✅ 4. Summary

Your GNN uses a transductive approach: full graph visible during training, targets masked
Overfitting is controlled with dropout, early stopping, and graph design
External test locations (e.g., Palmes Tubes) are needed for true generalization tests