Potentially handling outliers - davidlabee/Graph4Air GitHub Wiki
Outlier Detection and Handling
Motivation
Mobile sensors occasionally record extremely high NO₂ spikes, e.g. passing behind a diesel truck or entering a tunnel. These spikes are not representative of spatial patterns and can bias training. Therefore, we aim to detect and filter such local outliers.
Method
-
Compute neighborhood mean: For each node, get the mean NO₂ value of its 1-hop neighbors in the graph.
-
Residual = actual - neighbor_mean
-
Compute MAD (Median Absolute Deviation) of residuals
-
Flag as outlier if:
$$ |x_i - \text{median}| > k \times \text{MAD} $$
where k is group-dependent (e.g., highways = 9, local roads = 5).
Implementation
We detect outliers and remove them only from training and testing labels, not from the graph. This preserves connectivity.
train_mask[outliers] = False
test_mask[outliers] = False
Outlier Detection Code
def detect_outliers(
G, gdf, target_col='NO2d',
hop=1,
default_thresh=3.0,
group_col=None,
group_thresholds=None,
plot=True
):
nodes = list(G.nodes())
vals, neigh_means = [], []
for n in nodes:
v = gdf.at[n, target_col]
if pd.isna(v):
vals.append(np.nan); neigh_means.append(np.nan)
continue
sp = nx.single_source_shortest_path_length(G, n, cutoff=hop)
neigh = [m for m in sp if m != n]
nbr_vals = gdf.loc[neigh, target_col].dropna().values
mean_nb = np.nan if nbr_vals.size==0 else nbr_vals.mean()
vals.append(v); neigh_means.append(mean_nb)
vals = np.array(vals)
neigh_means = np.array(neigh_means)
valid = ~np.isnan(vals) & ~np.isnan(neigh_means)
residuals = vals[valid] - neigh_means[valid]
valid_nodes = np.array(nodes)[valid]
med = np.median(residuals)
mad = np.median(np.abs(residuals - med))
cutoffs = []
for n in valid_nodes:
if group_col and group_thresholds:
grp = gdf.at[n, group_col]
mult = group_thresholds.get(grp, default_thresh)
else:
mult = default_thresh
cutoffs.append(mult * mad)
cutoffs = np.array(cutoffs)
is_outlier = np.abs(residuals - med) > cutoffs
outliers = valid_nodes[is_outlier].tolist()
if plot:
ax = gdf.plot(color='lightgray', linewidth=0.3, figsize=(10,10))
gdf.loc[outliers].plot(ax=ax, color='red', markersize=6, label='Outliers')
plt.title("NO₂ Outliers")
plt.legend()
plt.axis('equal')
plt.show()
print(f"⚠️ Found {len(outliers)} outliers (they won’t be used in train/test).")
return outliers
Example usage
gdf['is_highway'] = gdf['TRAFMAJOR'] > 20000
group_thresh = { True: 9.0, False: 5.0 }
outliers = detect_outliers(
G, gdf,
target_col='NO2d',
hop=1,
default_thresh=3.0,
group_col='is_highway',
group_thresholds=group_thresh,
plot=True
)
Visual of the selected nodes marked as outliers
Testing Results
We evaluated performance using the same data, models, and architecture, but once with outliers kept and once with them masked.
Model | Outliers | Validation Type | Pearson | RMSE | MAE | R² |
---|---|---|---|---|---|---|
GCN | Yes | Internal (80/20) | 0.72 | 8.67 | 6.20 | — |
Yes | External (83 pts) | 0.74 | 5.72 | 4.33 | 0.18 | |
GCN | No | Internal (80/20) | 0.75 | 7.73 | 5.65 | — |
No | External (83 pts) | 0.74 | 5.21 | 4.16 | 0.32 | |
GAT | Yes | Internal (80/20) | 0.74 | 8.35 | 5.94 | — |
Yes | External (83 pts) | 0.74 | 5.72 | 4.33 | 0.18 | |
GAT | No | Internal (80/20) | 0.78 | 7.34 | 5.33 | — |
No | External (83 pts) | 0.74 | 5.21 | 4.16 | 0.32 |
Visual Comparison (GAT)
With Outliers
Without Outliers
Conclusion
Removing outliers improves both internal and external validation performance. These gains are especially notable in RMSE and MAE metrics, suggesting that outliers distort loss and prediction behavior. The approach is non-destructive, affecting only training/test splits—not graph topology.
Next Steps
We will:
- Test other values for
k
in outlier filtering - Compare with other smoothing-based and neighborhood MAD approaches
- Investigate if prediction variance decreases with outlier removal
Literature
- Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346), 383–393.
- Hubert, M., Rousseeuw, P. J., & Van Aelst, S. (2008). High-breakdown robust multivariate methods. Statistical Science, 23(1), 92–119.