David's Graph Design - davidlabee/Graph4Air GitHub Wiki
Graph Augmentation for Road Network GNNs
In dense urban settings like Amsterdam, road networks are primarily local: most nodes (road segments) are only connected to immediate neighbors. This leads to long-range information bottlenecks in Graph Neural Networks (GNNs). To address this, we developed a feature-similarity-based augmentation pipeline that adds semantic shortcut edges—i.e., connections between distant but similar locations. This reduces the number of hops required for relevant information to propagate through the graph.
Motivation
This approach is based on the homophily principle: nodes that are similar in features often share target values (e.g., NO₂ concentration). By connecting functionally analogous road segments (e.g., two residential roads with similar traffic), the GNN can learn shared patterns more effectively—leading to improved predictive performance.
Conceptual Pipeline
- Group nodes by urban function (e.g. industrial, natural, traffic-heavy).
- For each group:
- Select top nodes by intensity (e.g. highest industrial area proportion).
- Extract feature vectors for each node.
- Run cosine-based kNN to find semantically similar nodes.
- Filter edges using the following:
- Euclidean distance ≥
min_dist
and ≤max_dist
- Graph-hop distance >
hop_thresh
- Not on same road (
ROAD_FID
) - Not already connected
- Per-node and global caps
- Euclidean distance ≥
- Add to graph: All accepted edges are added to the original road topology as virtual edges labeled
feature_sim
.
Efficiency
- kNN in feature space is computed using
sklearn.neighbors.NearestNeighbors
(cosine metric), which is efficient and scalable to large graphs (~50k nodes). - Graph shortest-paths are queried only locally (≤ hop_thresh).
- Total complexity is sub-quadratic due to sparse candidate generation and pruning.
Results
Using grid search over augmentation parameters:
- Best GCN: RMSE ≈ 8.48, R² ≈ 0.524
- Best GAT: RMSE ≈ 8.08, R² ≈ 0.567
- Best parameters: neighbors=500, sim_thresh=0.995, hop_thresh=4, max_dist=4000, per_node_cap=2
This confirms long-range, high-similarity connections improve performance.
Best Performing Results (Top 10 by GCN RMSE)
Rank | Neighbors | Max Dist (m) | Hop Thresh | Max Edges | Cap/Node | GCN RMSE | GCN R² | GAT RMSE | GAT R² |
---|---|---|---|---|---|---|---|---|---|
1 | 500.0 | 4000 | 4.0 | 1000.0 | 2.0 | 8.48 | 0.524 | 8.08 | 0.567 |
2 | 500.0 | 2000 | 4.0 | 1000.0 | 2.0 | 8.61 | 0.509 | 8.23 | 0.552 |
3 | 500.0 | 2000 | 4.0 | 2000.0 | 2.0 | 8.65 | 0.491 | 8.09 | 0.554 |
4 | 1000.0 | 2000 | 4.0 | 1000.0 | 2.0 | 8.66 | 0.511 | 8.23 | 0.559 |
5 | 500.0 | 1000 | 4.0 | 1000.0 | 2.0 | 8.72 | 0.506 | 8.29 | 0.554 |
6 | 1000.0 | 4000 | 4.0 | 1000.0 | 2.0 | 8.74 | 0.490 | 8.08 | 0.564 |
7 | 500.0 | 8000 | 4.0 | 1000.0 | 2.0 | 8.74 | 0.517 | 8.31 | 0.564 |
8 | 500.0 | 1000 | 4.0 | 2000.0 | 2.0 | 8.75 | 0.503 | 8.19 | 0.565 |
9 | 1000.0 | 8000 | 4.0 | 1000.0 | 2.0 | 8.76 | 0.508 | 8.21 | 0.568 |
10 | 1000.0 | 2000 | 4.0 | 2000.0 | 2.0 | 8.81 | 0.495 | 8.17 | 0.565 |
Visual of the newly added edges in Amsterdam
Literature Support
- Homophily: Similar nodes should be connected. (McPherson et al., 2001)
- Over-squashing relief: Alleviate information bottlenecks. (Alon & Yahav, 2021)
- Sparse, semantic augmentation improves performance and avoids over-smoothing. (Topping et al., 2022; Giraldo et al., 2022)
- Practical GNN augmentation in real cities. (Do et al., 2020; Geng et al., 2019; Ma et al., 2024)
Output Files
grid_search_results.csv
: all metricsfinal_params.json
: best parametersroad_segments_with_predictions.geojson
: GCN & GAT predictionspalmes_external_validation.geojson
: external NO₂ validation