Flight_Data_Treatment - Flight-Path-Analysis/FlightPathAnalysis GitHub Wiki
Repeated Data Points
As mentioned in the Caveats of Basic Open Sky Data Query, the data from OpenSky is filled with the previous value if not available for a given seconds
One treatment of the data that is done is to simply go over values of
lat and lon, and if there's a repeat, simply delete that data point, which will yield more accurate interpolations in the future.
Starts Late / Ends Early
When attempting to categorize different flight paths, tSNE constantly revealed two clusters separate from the rest, as can be seen below
Looking at flight paths corresponding to those clusters revealed that they were flights that, more often then not, "started late" or "ended early", as can be seen below on the plots for Cluster 2 and Cluster 3.
To address this issue, the
coordinate-distance-thresh attribute was added to the config.yml in order to set a minimum threshold for the starting airport. This is further evident when looking at distributions of starting and ending longitudes and latitudes for the clusters
A good value for it can be inferred from the table below, showing the standard deviation of starting and ending longitude and latitude for the clusters.
A threshold of ~$3\sigma$ of the valid clusters (not 2 and 3) should suffice for most cases. So a default value of
coordinate-distance-thresh = 0.12 should work out well. That is, if the flight had a latitude or longitude greater than that value from the airport's nominal coordinates, it's valid.