Flight_Data_Treatment - Flight-Path-Analysis/FlightPathAnalysis GitHub Wiki

Repeated Data Points

As mentioned in the Caveats of Basic Open Sky Data Query, the data from OpenSky is filled with the previous value if not available for a given seconds image One treatment of the data that is done is to simply go over values of lat and lon, and if there's a repeat, simply delete that data point, which will yield more accurate interpolations in the future.

Starts Late / Ends Early

When attempting to categorize different flight paths, tSNE constantly revealed two clusters separate from the rest, as can be seen below Looking at flight paths corresponding to those clusters revealed that they were flights that, more often then not, "started late" or "ended early", as can be seen below on the plots for Cluster 2 and Cluster 3. image To address this issue, the coordinate-distance-thresh attribute was added to the config.yml in order to set a minimum threshold for the starting airport. This is further evident when looking at distributions of starting and ending longitudes and latitudes for the clusters image A good value for it can be inferred from the table below, showing the standard deviation of starting and ending longitude and latitude for the clusters. A threshold of ~$3\sigma$ of the valid clusters (not 2 and 3) should suffice for most cases. So a default value of coordinate-distance-thresh = 0.12 should work out well. That is, if the flight had a latitude or longitude greater than that value from the airport's nominal coordinates, it's valid.