Potentially handling outliers - davidlabee/Graph4Air GitHub Wiki

Outlier Detection and Handling

Motivation

Mobile sensors occasionally record extremely high NO₂ spikes, e.g. passing behind a diesel truck or entering a tunnel. These spikes are not representative of spatial patterns and can bias training. Therefore, we aim to detect and filter such local outliers.

Method

Compute neighborhood mean: For each node, get the mean NO₂ value of its 1-hop neighbors in the graph.
Residual = actual - neighbor_mean
Compute MAD (Median Absolute Deviation) of residuals
Flag as outlier if:

$$ |x_i - \text{median}| > k \times \text{MAD} $$

where k is group-dependent (e.g., highways = 9, local roads = 5).

Implementation

We detect outliers and remove them only from training and testing labels, not from the graph. This preserves connectivity.

train_mask[outliers] = False
test_mask[outliers]  = False

Outlier Detection Code

def detect_outliers(
    G, gdf, target_col='NO2d',
    hop=1,
    default_thresh=3.0,
    group_col=None,
    group_thresholds=None,
    plot=True
):
    nodes = list(G.nodes())
    vals, neigh_means = [], []
    for n in nodes:
        v = gdf.at[n, target_col]
        if pd.isna(v):
            vals.append(np.nan); neigh_means.append(np.nan)
            continue
        sp = nx.single_source_shortest_path_length(G, n, cutoff=hop)
        neigh = [m for m in sp if m != n]
        nbr_vals = gdf.loc[neigh, target_col].dropna().values
        mean_nb  = np.nan if nbr_vals.size==0 else nbr_vals.mean()
        vals.append(v); neigh_means.append(mean_nb)

    vals = np.array(vals)
    neigh_means = np.array(neigh_means)
    valid = ~np.isnan(vals) & ~np.isnan(neigh_means)
    residuals = vals[valid] - neigh_means[valid]
    valid_nodes = np.array(nodes)[valid]

    med = np.median(residuals)
    mad = np.median(np.abs(residuals - med))

    cutoffs = []
    for n in valid_nodes:
        if group_col and group_thresholds:
            grp = gdf.at[n, group_col]
            mult = group_thresholds.get(grp, default_thresh)
        else:
            mult = default_thresh
        cutoffs.append(mult * mad)
    cutoffs = np.array(cutoffs)

    is_outlier = np.abs(residuals - med) > cutoffs
    outliers = valid_nodes[is_outlier].tolist()

    if plot:
        ax = gdf.plot(color='lightgray', linewidth=0.3, figsize=(10,10))
        gdf.loc[outliers].plot(ax=ax, color='red', markersize=6, label='Outliers')
        plt.title("NO₂ Outliers")
        plt.legend()
        plt.axis('equal')
        plt.show()

    print(f"⚠️ Found {len(outliers)} outliers (they won’t be used in train/test).")
    return outliers

Example usage

gdf['is_highway'] = gdf['TRAFMAJOR'] > 20000
group_thresh = { True: 9.0, False: 5.0 }

outliers = detect_outliers(
    G, gdf,
    target_col='NO2d',
    hop=1,
    default_thresh=3.0,
    group_col='is_highway',
    group_thresholds=group_thresh,
    plot=True
)

Visual of the selected nodes marked as outliers

Testing Results

We evaluated performance using the same data, models, and architecture, but once with outliers kept and once with them masked.

Model	Outliers	Validation Type	Pearson	RMSE	MAE	R²
GCN	Yes	Internal (80/20)	0.72	8.67	6.20	—
	Yes	External (83 pts)	0.74	5.72	4.33	0.18
GCN	No	Internal (80/20)	0.75	7.73	5.65	—
	No	External (83 pts)	0.74	5.21	4.16	0.32
GAT	Yes	Internal (80/20)	0.74	8.35	5.94	—
	Yes	External (83 pts)	0.74	5.72	4.33	0.18
GAT	No	Internal (80/20)	0.78	7.34	5.33	—
	No	External (83 pts)	0.74	5.21	4.16	0.32

Visual Comparison (GAT)

With Outliers

GAT with outliers

Without Outliers

GAT without outliers

Conclusion

Removing outliers improves both internal and external validation performance. These gains are especially notable in RMSE and MAE metrics, suggesting that outliers distort loss and prediction behavior. The approach is non-destructive, affecting only training/test splits—not graph topology.

Next Steps

We will:

Test other values for k in outlier filtering
Compare with other smoothing-based and neighborhood MAD approaches
Investigate if prediction variance decreases with outlier removal

Literature

Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346), 383–393.
Hubert, M., Rousseeuw, P. J., & Van Aelst, S. (2008). High-breakdown robust multivariate methods. Statistical Science, 23(1), 92–119.