Basic_OpenSky_Data_Query - Flight-Path-Analysis/FlightPathAnalysis GitHub Wiki
A Quick Guide To OpenSky's Impala Shell
Opensky Network can be accessed through ssh -p 2230 -l andrerg01 data.opensky-network.org (change andrerg01 to whatever username is proper). One needs credentials, it can be acquired by contacting the administrators of the website.
Once ssh-ing there, you are thrown in the Impala shell. Look at A Quick Guide To OpenSky's Impala Shell for more details
The tables ending with _data4 are the most up-to-date and accurate ones, the others are deprecated.
In order to query flight data, we'll need the table flights_data4. Below is a description of the table.
+----------------------------------+----------------------+-----------------------------+
| name | type | comment |
+----------------------------------+----------------------+-----------------------------+
| icao24 | string | Inferred from Parquet file. |
| firstseen | int | Inferred from Parquet file. |
| estdepartureairport | string | Inferred from Parquet file. |
| lastseen | int | Inferred from Parquet file. |
| estarrivalairport | string | Inferred from Parquet file. |
| callsign | string | Inferred from Parquet file. |
| track | array<struct< | Inferred from Parquet file. |
| | time:int, | |
| | latitude:double, | |
| | longitude:double, | |
| | altitude:double, | |
| | heading:float, | |
| | onground:boolean | |
| | >> | |
| serials | array<int> | Inferred from Parquet file. |
| estdepartureairporthorizdistance | int | Inferred from Parquet file. |
| estdepartureairportvertdistance | int | Inferred from Parquet file. |
| estarrivalairporthorizdistance | int | Inferred from Parquet file. |
| estarrivalairportvertdistance | int | Inferred from Parquet file. |
| departureairportcandidatescount | int | Inferred from Parquet file. |
| arrivalairportcandidatescount | int | Inferred from Parquet file. |
| otherdepartureairportcandidates | array<struct< | Inferred from Parquet file. |
| | icao:string, | |
| | horizdistance:int, | |
| | vertdistance:int | |
| | >> | |
| otherarrivalairportcandidates | array<struct< | Inferred from Parquet file. |
| | icao:string, | |
| | horizdistance:int, | |
| | vertdistance:int | |
| | >> | |
| day | int | |
+----------------------------------+----------------------+-----------------------------+
Note that the track entry has issues and I believe is deprecated.
The departure and arrival airports are estimated by the OpenSky network as the closest airport to the aircraft once it stopped broadcasting.
An example of a query of interest would be:
SELECT callsign, icao24, firstseen, lastseen, estdepartureairport, estarrivalairport
FROM flights_data4
WHERE estdepartureairport LIKE '%JFK'
AND (
(day > 1691712000 AND day < 1692835200)
OR
(day > 1692921600 AND day < 1694062800)
)
LIMIT 10;
The example output of this query should be
+----------+--------+------------+------------+---------------------+-------------------+
| callsign | icao24 | firstseen | lastseen | estdepartureairport | estarrivalairport |
+----------+--------+------------+------------+---------------------+-------------------+
| AAL2622 | aa6ccb | 1693257803 | 1693267095 | KJFK | KMIA |
| DAL156 | a14872 | 1693243542 | 1693246060 | KJFK | NULL |
| RAM203J | 0200cb | 1693238426 | 1693259844 | KJFK | NULL |
| JBU1677 | aafca8 | 1693254873 | 1693261148 | KJFK | NULL |
| JBU1759 | ac89f7 | 1693255542 | 1693264886 | KJFK | NULL |
| DAL253 | a8e279 | 1693247187 | 1693266473 | KJFK | SKBO |
| CPA3295 | 780233 | 1693241557 | 1693267016 | KJFK | PANC |
| JBU479 | a721a4 | 1693251523 | 1693261192 | KJFK | NULL |
| ETH513 | 040087 | 1693190608 | 1693193388 | KJFK | NULL |
| JBU1965 | a19d0d | 1693183093 | 1693199632 | KJFK | NULL |
+----------+--------+------------+------------+---------------------+-------------------+
There are some dates that will return an error like the one below:
WARNINGS: Disk I/O error: Error reading from HDFS file: hdfs://nameservice1/user/opensky/tables_v4/flights/day=1692835200/part-r-00169-45edfbff-ab00-4bd8-91dd-bf5de17cc22e.snappy.parquet
Error(255): Unknown error 255
Root cause: BlockMissingException: Could not obtain block: BP-2086186090-192.168.6.170-1416410368441:blk_1719909554_888542574 file=/user/opensky/tables_v4/flights/day=1692835200/part-r-00169-45edfbff-ab00-4bd8-91dd-bf5de17cc22e.snappy.parquet
This means that the file containing day=1692835200 is somehow corrupted or not accessible. A work around is to skip that day by adding the condition day != 1692835200 (or whatever UNIX time is giving you trouble).
That's done automatically by the querying functions.
In order to query the state vectors of a flight, we'll need the table state_vectors_data4. Below is a description of the table.
+---------------+------------+-----------------------------+
| name | type | comment |
+---------------+------------+-----------------------------+
| time | int | Inferred from Parquet file. |
| icao24 | string | Inferred from Parquet file. |
| lat | double | Inferred from Parquet file. |
| lon | double | Inferred from Parquet file. |
| velocity | double | Inferred from Parquet file. |
| heading | double | Inferred from Parquet file. |
| vertrate | double | Inferred from Parquet file. |
| callsign | string | Inferred from Parquet file. |
| onground | boolean | Inferred from Parquet file. |
| alert | boolean | Inferred from Parquet file. |
| spi | boolean | Inferred from Parquet file. |
| squawk | string | Inferred from Parquet file. |
| baroaltitude | double | Inferred from Parquet file. |
| geoaltitude | double | Inferred from Parquet file. |
| lastposupdate | double | Inferred from Parquet file. |
| lastcontact | double | Inferred from Parquet file. |
| serials | array<int> | Inferred from Parquet file. |
| hour | int | |
+---------------+------------+-----------------------------+
An example of a query of interest would be:
SELECT time, lat, lon, velocity, heading, baroaltitude, geoaltitude, onground, hour
FROM state_vectors_data4
WHERE icao24 = 'aa25a6'
AND (time > 1688174225 AND time < 1688178021)
AND (hour > 1688170625 AND hour < 1688181621)
The hour limits is simply 3600 seconds above and below the time. It's important to define the limits in hour, as that's the primary quantity the files are ordered by, and the querying will be looking at useless files otherwise.
The example output of this query should be
+------------+-------------------+--------------------+-------------------+-------------------+-------------------+-------------------+----------+------------+
| time | lat | lon | velocity | heading | baroaltitude | geoaltitude | onground | hour |
+------------+-------------------+--------------------+-------------------+-------------------+-------------------+-------------------+----------+------------+
| 1688174226 | 30.52625753111758 | -91.14334716796873 | 82.22418174666292 | 133.9860212772147 | 76.2 | NULL | false | 1688173200 |
| 1688174227 | 30.52625753111758 | -91.14334716796873 | 82.5951016782392 | 133.7381824611139 | 76.2 | 68.58 | false | 1688173200 |
| 1688174228 | 30.52528381347656 | -91.14214728860293 | 82.95160250131399 | 133.9949139947458 | 83.82000000000001 | 68.58 | false | 1688173200 |
| 1688174229 | 30.52528381347656 | -91.14214728860293 | 82.95160250131399 | 133.9949139947458 | 83.82000000000001 | 76.2 | false | 1688173200 |
| 1688174230 | 30.52423095703125 | -91.14090863396143 | 82.95160250131399 | 133.9949139947458 | 106.68 | 76.2 | false | 1688173200 |
| 1688174231 | 30.52423095703125 | -91.14090863396143 | 82.95160250131399 | 133.9949139947458 | 106.68 | 76.2 | false | 1688173200 |
...
It seems that OpenSky always offers data in a second-by-second basis, and if data was not available for a given second if fills that point with the values of the previous second, resulting in unreliable data.
A simple measure to take care of this is to ignore any times where the latitude and longitude (lat and lon) are unchanged, resulting in the "true" measured data which can be properly interpolated, as shown below

There are some dates that will return an error like the one below:
ERROR: Disk I/O error: Failed to open HDFS file hdfs://nameservice1/user/opensky/tables_v4/state_vectors/hour=1687968000/part-r-00169-a43ca51b-0b0c-40eb-831b-f087ade2ea7c.snappy.parquet
Error(2): No such file or directory
Root cause: RemoteException: File does not exist: /user/opensky/tables_v4/state_vectors/hour=1687968000/part-r-00169-a43ca51b-0b0c-40eb-831b-f087ade2ea7c.snappy.parquet
...
This means that the file containing hour = 1687968000 is somehow corrupted or not accessible. A work around is to skip that hour by adding the condition hour != 1687968000 (or whatever UNIX time is giving you trouble).
That's done automatically by the querying functions.
This could have also happened if no hour range was specified in the query, they are necessary in order to avoid this happening.