Potentially handling outliers - davidlabee/Graph4Air GitHub Wiki

Outlier Detection and Handling

Motivation

Mobile sensors occasionally record extremely high NO₂ spikes, e.g. passing behind a diesel truck or entering a tunnel. These spikes are not representative of spatial patterns and can bias training. Therefore, we aim to detect and filter such local outliers.

Method

  1. Compute neighborhood mean: For each node, get the mean NO₂ value of its 1-hop neighbors in the graph.

  2. Residual = actual - neighbor_mean

  3. Compute MAD (Median Absolute Deviation) of residuals

  4. Flag as outlier if:

    $$ |x_i - \text{median}| > k \times \text{MAD} $$

    where k is group-dependent (e.g., highways = 9, local roads = 5).

Implementation

We detect outliers and remove them only from training and testing labels, not from the graph. This preserves connectivity.

train_mask[outliers] = False
test_mask[outliers]  = False

Outlier Detection Code

def detect_outliers(
    G, gdf, target_col='NO2d',
    hop=1,
    default_thresh=3.0,
    group_col=None,
    group_thresholds=None,
    plot=True
):
    nodes = list(G.nodes())
    vals, neigh_means = [], []
    for n in nodes:
        v = gdf.at[n, target_col]
        if pd.isna(v):
            vals.append(np.nan); neigh_means.append(np.nan)
            continue
        sp = nx.single_source_shortest_path_length(G, n, cutoff=hop)
        neigh = [m for m in sp if m != n]
        nbr_vals = gdf.loc[neigh, target_col].dropna().values
        mean_nb  = np.nan if nbr_vals.size==0 else nbr_vals.mean()
        vals.append(v); neigh_means.append(mean_nb)

    vals = np.array(vals)
    neigh_means = np.array(neigh_means)
    valid = ~np.isnan(vals) & ~np.isnan(neigh_means)
    residuals = vals[valid] - neigh_means[valid]
    valid_nodes = np.array(nodes)[valid]

    med = np.median(residuals)
    mad = np.median(np.abs(residuals - med))

    cutoffs = []
    for n in valid_nodes:
        if group_col and group_thresholds:
            grp = gdf.at[n, group_col]
            mult = group_thresholds.get(grp, default_thresh)
        else:
            mult = default_thresh
        cutoffs.append(mult * mad)
    cutoffs = np.array(cutoffs)

    is_outlier = np.abs(residuals - med) > cutoffs
    outliers = valid_nodes[is_outlier].tolist()

    if plot:
        ax = gdf.plot(color='lightgray', linewidth=0.3, figsize=(10,10))
        gdf.loc[outliers].plot(ax=ax, color='red', markersize=6, label='Outliers')
        plt.title("NO₂ Outliers")
        plt.legend()
        plt.axis('equal')
        plt.show()

    print(f"⚠️ Found {len(outliers)} outliers (they won’t be used in train/test).")
    return outliers

Example usage

gdf['is_highway'] = gdf['TRAFMAJOR'] > 20000
group_thresh = { True: 9.0, False: 5.0 }

outliers = detect_outliers(
    G, gdf,
    target_col='NO2d',
    hop=1,
    default_thresh=3.0,
    group_col='is_highway',
    group_thresholds=group_thresh,
    plot=True
)

Visual of the selected nodes marked as outliers

Testing Results

We evaluated performance using the same data, models, and architecture, but once with outliers kept and once with them masked.

Model Outliers Validation Type Pearson RMSE MAE
GCN Yes Internal (80/20) 0.72 8.67 6.20
Yes External (83 pts) 0.74 5.72 4.33 0.18
GCN No Internal (80/20) 0.75 7.73 5.65
No External (83 pts) 0.74 5.21 4.16 0.32
GAT Yes Internal (80/20) 0.74 8.35 5.94
Yes External (83 pts) 0.74 5.72 4.33 0.18
GAT No Internal (80/20) 0.78 7.34 5.33
No External (83 pts) 0.74 5.21 4.16 0.32

Visual Comparison (GAT)

With Outliers

GAT with outliers

Without Outliers

GAT without outliers

Conclusion

Removing outliers improves both internal and external validation performance. These gains are especially notable in RMSE and MAE metrics, suggesting that outliers distort loss and prediction behavior. The approach is non-destructive, affecting only training/test splits—not graph topology.

Next Steps

We will:

  • Test other values for k in outlier filtering
  • Compare with other smoothing-based and neighborhood MAD approaches
  • Investigate if prediction variance decreases with outlier removal

Literature

  • Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346), 383–393.
  • Hubert, M., Rousseeuw, P. J., & Van Aelst, S. (2008). High-breakdown robust multivariate methods. Statistical Science, 23(1), 92–119.