David's Graph Design - davidlabee/Graph4Air GitHub Wiki

Graph Augmentation for Road Network GNNs

In dense urban settings like Amsterdam, road networks are primarily local: most nodes (road segments) are only connected to immediate neighbors. This leads to long-range information bottlenecks in Graph Neural Networks (GNNs). To address this, we developed a feature-similarity-based augmentation pipeline that adds semantic shortcut edges—i.e., connections between distant but similar locations. This reduces the number of hops required for relevant information to propagate through the graph.

Motivation

This approach is based on the homophily principle: nodes that are similar in features often share target values (e.g., NO₂ concentration). By connecting functionally analogous road segments (e.g., two residential roads with similar traffic), the GNN can learn shared patterns more effectively—leading to improved predictive performance.

Conceptual Pipeline

  1. Group nodes by urban function (e.g. industrial, natural, traffic-heavy).
  2. For each group:
    • Select top nodes by intensity (e.g. highest industrial area proportion).
    • Extract feature vectors for each node.
    • Run cosine-based kNN to find semantically similar nodes.
  3. Filter edges using the following:
    • Euclidean distance ≥ min_dist and ≤ max_dist
    • Graph-hop distance > hop_thresh
    • Not on same road (ROAD_FID)
    • Not already connected
    • Per-node and global caps
  4. Add to graph: All accepted edges are added to the original road topology as virtual edges labeled feature_sim.

Efficiency

  • kNN in feature space is computed using sklearn.neighbors.NearestNeighbors (cosine metric), which is efficient and scalable to large graphs (~50k nodes).
  • Graph shortest-paths are queried only locally (≤ hop_thresh).
  • Total complexity is sub-quadratic due to sparse candidate generation and pruning.

Results

Using grid search over augmentation parameters:

  • Best GCN: RMSE ≈ 8.48, R² ≈ 0.524
  • Best GAT: RMSE ≈ 8.08, R² ≈ 0.567
  • Best parameters: neighbors=500, sim_thresh=0.995, hop_thresh=4, max_dist=4000, per_node_cap=2

This confirms long-range, high-similarity connections improve performance.

Best Performing Results (Top 10 by GCN RMSE)

Rank Neighbors Max Dist (m) Hop Thresh Max Edges Cap/Node GCN RMSE GCN R² GAT RMSE GAT R²
1 500.0 4000 4.0 1000.0 2.0 8.48 0.524 8.08 0.567
2 500.0 2000 4.0 1000.0 2.0 8.61 0.509 8.23 0.552
3 500.0 2000 4.0 2000.0 2.0 8.65 0.491 8.09 0.554
4 1000.0 2000 4.0 1000.0 2.0 8.66 0.511 8.23 0.559
5 500.0 1000 4.0 1000.0 2.0 8.72 0.506 8.29 0.554
6 1000.0 4000 4.0 1000.0 2.0 8.74 0.490 8.08 0.564
7 500.0 8000 4.0 1000.0 2.0 8.74 0.517 8.31 0.564
8 500.0 1000 4.0 2000.0 2.0 8.75 0.503 8.19 0.565
9 1000.0 8000 4.0 1000.0 2.0 8.76 0.508 8.21 0.568
10 1000.0 2000 4.0 2000.0 2.0 8.81 0.495 8.17 0.565

Visual of the newly added edges in Amsterdam

Literature Support

  • Homophily: Similar nodes should be connected. (McPherson et al., 2001)
  • Over-squashing relief: Alleviate information bottlenecks. (Alon & Yahav, 2021)
  • Sparse, semantic augmentation improves performance and avoids over-smoothing. (Topping et al., 2022; Giraldo et al., 2022)
  • Practical GNN augmentation in real cities. (Do et al., 2020; Geng et al., 2019; Ma et al., 2024)

Output Files

  • grid_search_results.csv: all metrics
  • final_params.json: best parameters
  • road_segments_with_predictions.geojson: GCN & GAT predictions
  • palmes_external_validation.geojson: external NO₂ validation