Geographical Location - prekijpatel/MetaMiner GitHub Wiki
Location metadata often comes with a high level of inconsistency 🧐. Some users provide only country names, others include a full location path, and many use inconsistent formats (e.g., semicolons, commas, brackets), abbreviations, or even names of the sequencing institutes instead of actual geographic location.
For example, One of the major sources of complexity comes from the use of local-language or alternate names for geographic divisions. For instance, a region in France may appear in its native French (e.g., "Provence-Alpes-Côte d’Azur") or in English (e.g., "Provence Alps")—both referring to the same place. This kind of variation can cause confusion, duplication, or gaps in data analysis if left unstandardized.
With the existing inconsistencies:
-
Grouping the isolates based on location becomes flawed ❌
-
Regional trends or epidemiological interpretations loses accuracy 📉
By standardizing such metadata, MetaMiner makes it possible to create meaningful visualizations like choropleth maps
🗺️ and helps in improving accuracy of large-scale, location-based analyses.
How MetaMiner does it? ✅
MetaMiner uses an algorithm where raw geo_loc
data is parsed into separate entities, and then each of them is resolved step by step to identify the Country
and State
.
1. Parsing Raw Entries 🧩
Each raw location string is broken down into logical components—essentially splitting them into terms based on known delimiters (commas, semicolons, pipes, dashes, underscores, etc.).
For example: "India: MH, Pune, BJ Medical College" becomes candidate terms: ["India", "MH", "Pune", "BJ Medical College"]
This allows us to identify geographical clues even when mixed with non-geographic ones.
2. Country and State Matching using PyCountry 🌍
MetaMiner then uses PyCountry
to identify matches for these terms. PyCountry
supports fuzzy matching across both countries and sub-national entities like states or provinces. It recognizes names in multiple languages, ISO-standardized codes, and common variants—making it ideal for interpreting messy data.
3. Fallback to Nominatim 🧭
If PyCountry
fails to resolve a location, MetaMiner passes the data to Nominatim
, a geolocation API based on OpenStreetMap. This gives us a secondary route for less common or highly ambiguous entries 💪.
4. Standardized Output Fields
Once resolved, MetaMiner structures every location entry with consistent fields:
-
country_name: Full ISO country name
-
country_common_name: Widely used or informal name
-
country_three_lettered_code: ISO 3166-1 alpha-3 code
-
state_name: Name of the sub-national region in English
-
state_iso_code: ISO 3166-2 state/province code (if resolvable)
Note: MetaMiner normalizes locations up to the state/province level (ADM1) only. For some entries, district- or city-level data exist, but it's not consistently available across all entries. Thus, to avoid further complications and maintain simplicity in the algorithm (and for our sanity!😅🧘), we standardize only up to the state level.
Example of Standardization done
Raw Metadata Entry | Country Name | Common Name | Country Code | State Name | State ISO Code |
---|---|---|---|---|---|
India: MH; Pune; BJ Medical College | India | India | IND | Maharashtra | IN-MH |
Brasil: São Paulo | Brazil | Brazil | BRA | São Paulo | BR-SP |
CHN: Beijing Shi | China | China | CHN | Beijing | CN-BJ |
Germany: Bayern | Germany | Germany | DEU | Bavaria | DE-BY |
US: CA | United States | USA | USA | California | US-CA |
France: Provence-Alpes-Côte d’Azur | France | France | FRA | Provence-Alpes-Côte d’Azur | FR-PAC |
South Korea:Seoul | Republic of Korea | South Korea | KOR | Seoul | KR-11 |
India | India | India | IND | - | - |
Unknown; Not available | - | - | - | - | - |
National Genomics Facility, Canada | Canada | Canada | CAN | - | - |
Wuhan Institute of Virology | China | China | CHN | Hubei Sheng | CN-HB |