geo - nextstrain/flora GitHub Wiki
Geographical data
Current situation
location
is uploaded to the database, which then allows other fields such as country
& region
to be filled in (updated) via files such as this, this and this. While lat/longs are placed in the DB, augur doesn't use this information, instead looking at the fauna files.
Proposed situation
In each row of the isolates
table, data should (must?) have country
(admin0).
Optionally division
(admin1) or subdivision
(admin2) can be provided, but only if the higher admins are also provided - for example, providing "Vancouver" as subdivision
without "WA" as the division isn't allowed.
Separately, a GPS location for the actual sample location can be provided.
GPS data for the admin levels should not be stored in the isolates
table.
Region can be easily mapped from country
and shouldn't be specified.
Q: Should region
be stored in the isolates
table? This would allow different region definitions per pathogen
The geo
data essentially links admin levels to GPS co-ords or shape data.
The encoding of the key must include higher admin levels, e.g. USA-WA-VANCOUVER
not VANCOUVER
.
Q: Should all data be UPPERCASE? lowercase? CamelCase?
There should also a mapping between country
and region
, which could be dataset specific.
Where should this data live?
Since this data is unchanging it stands to reason that it should only exist in one place (which is effectively what happens now - the tsv files in fauna).
I propose to store this data as a series of JSON files (effectively lookups) in a seperate repo (nextstrain/geo
).
This would allow augur to access this data (needed for analysis), and also allow sacra to check if provided admin levels are valid.
The expense would be the added complexity of having another repo.
Data Sources
http://www.gadm.org/ provides worldwide free-for-academic-use shapefiles for all admins down to admin2. But it is missing GPS co-ords, so something else is needed.