Geographical Location - prekijpatel/MetaMiner GitHub Wiki

Location metadata often comes with a high level of inconsistency 🧐. Some users provide only country names, others include a full location path, and many use inconsistent formats (e.g., semicolons, commas, brackets), abbreviations, or even names of the sequencing institutes instead of actual geographic location.

For example, One of the major sources of complexity comes from the use of local-language or alternate names for geographic divisions. For instance, a region in France may appear in its native French (e.g., "Provence-Alpes-Côte d’Azur") or in English (e.g., "Provence Alps")—both referring to the same place. This kind of variation can cause confusion, duplication, or gaps in data analysis if left unstandardized.

With the existing inconsistencies:

  • Grouping the isolates based on location becomes flawed ❌

  • Regional trends or epidemiological interpretations loses accuracy 📉

By standardizing such metadata, MetaMiner makes it possible to create meaningful visualizations like choropleth maps🗺️ and helps in improving accuracy of large-scale, location-based analyses.

How MetaMiner does it? ✅

MetaMiner uses an algorithm where raw geo_loc data is parsed into separate entities, and then each of them is resolved step by step to identify the Country and State.

1. Parsing Raw Entries 🧩

Each raw location string is broken down into logical components—essentially splitting them into terms based on known delimiters (commas, semicolons, pipes, dashes, underscores, etc.).

For example: "India: MH, Pune, BJ Medical College" becomes candidate terms: ["India", "MH", "Pune", "BJ Medical College"]

This allows us to identify geographical clues even when mixed with non-geographic ones.

2. Country and State Matching using PyCountry 🌍

MetaMiner then uses PyCountry to identify matches for these terms. PyCountry supports fuzzy matching across both countries and sub-national entities like states or provinces. It recognizes names in multiple languages, ISO-standardized codes, and common variants—making it ideal for interpreting messy data.

3. Fallback to Nominatim 🧭

If PyCountry fails to resolve a location, MetaMiner passes the data to Nominatim, a geolocation API based on OpenStreetMap. This gives us a secondary route for less common or highly ambiguous entries 💪.

4. Standardized Output Fields

Once resolved, MetaMiner structures every location entry with consistent fields:

  • country_name: Full ISO country name

  • country_common_name: Widely used or informal name

  • country_three_lettered_code: ISO 3166-1 alpha-3 code

  • state_name: Name of the sub-national region in English

  • state_iso_code: ISO 3166-2 state/province code (if resolvable)

Note: MetaMiner normalizes locations up to the state/province level (ADM1) only. For some entries, district- or city-level data exist, but it's not consistently available across all entries. Thus, to avoid further complications and maintain simplicity in the algorithm (and for our sanity!😅🧘), we standardize only up to the state level.

Example of Standardization done

Raw Metadata Entry Country Name Common Name Country Code State Name State ISO Code
India: MH; Pune; BJ Medical College India India IND Maharashtra IN-MH
Brasil: São Paulo Brazil Brazil BRA São Paulo BR-SP
CHN: Beijing Shi China China CHN Beijing CN-BJ
Germany: Bayern Germany Germany DEU Bavaria DE-BY
US: CA United States USA USA California US-CA
France: Provence-Alpes-Côte d’Azur France France FRA Provence-Alpes-Côte d’Azur FR-PAC
South Korea:Seoul Republic of Korea South Korea KOR Seoul KR-11
India India India IND - -
Unknown; Not available - - - - -
National Genomics Facility, Canada Canada Canada CAN - -
Wuhan Institute of Virology China China CHN Hubei Sheng CN-HB