Isolation Source - prekijpatel/MetaMiner GitHub Wiki

Metadata describing the isolation source—from where an isolate came—is important for pathogen surveillance, environmental monitoring, and epidemiological studies. Whether an organism was isolated from human stool, river water, soil, or a medical device can significantly influence how we interpret microbial diversity, transmission patterns, antimicrobial resistance (AMR), and other health risks.

But here's the problem: in our experience, the isolation source metadata is the most chaotic and frustrating to deal with. 😵‍💫 People use all sorts of terms to describe the isolation source—from standard medical lingo to full-blown sample backstories.

For example, if an Organism is from Blood from a Septic patient, the information could be there as "blood", "blood stream", "hemoculture", "blood culture", or a descriptive entry, like "blood sample from a patient (age 55) with bacteremia and was given X antibiotic". We, no way, are saying that giving more information is bad, this actually helps understanding the context better, but when we are looking at hundreds or thousands of genome and want to sort them based on certain criteria quickly, these inconsistencies become very challenging.

MetaMiner works through these inconsistencies in isolation source data and arranges the data into structured, semantically rationalized categories.

How MetaMiner does it? 🔍

So, to make sense of this chaos, we worked around multiple metadata fields to derive the structured isolation source. After plenty of trial and error, we narrowed it down to three metadata fields:

  • host,
  • host_disease, and
  • isolation_source

In many cases, values for one or more of these fields are missing or vague. Thus, MetaMiner does not treat these as isolated fields. Instead, it takes into account all available metadata across these three fields and considers them collectively for normalization. The normalization process works as follows:

MetaMiner's custom-built Algorithm:

Every raw data entry (combination of host, host_disease, and isolation_source) is matched against a huge internal dictionary. This dictionary is a map of known terms, synonyms, misspellings, strange abbreviations, and all the weird stuff that we've seen in the wild 🐾. The matching is done using the rapidfuzz library along with empirical thresholds to handle close-but-not-quite matches.

From the raw entry, the algorithm derives a four-leveled classification. The first level broadly assigns each entry to a host—such as Hospital, Animal, Environmental, Laboratory, and so on. The remaining three levels vary depending on the identified host (described in the next section!).

The Local Isolation Source Database (aka our sanity saver😇):

The earlier mentioned algorithm works well, but when you're handling thousands of genomes, matching every entry to the dictionary can take a bit of time. (We’re talking minutes—not hours—but we’d love to get it down to few seconds 🫣) Thus, to speed things up, we have built a local database. This database contains pre-processed combinations of host, host_disease, and isolation_source—essentially a giant key-value map of real-world metadata we've already seen and dealt with.

We had built this by running MetaMiner's algorithm on a fairly large collection of metadata and with a bit of manual curation. So now, when MetaMiner encounters a combination that exists in the local database, it just grabs the pre-normalized result—fast and easy.

If the combination is new, it falls back to the full dictionary-driven normalization algorithm. This hybrid approach is in hope to keep things flexible and fast.

Categorization of Isolation Sources: Four-Level Hierarchy

In order to facilitate consistency in isolation source data, we employ a four-level classification system. Each level provides increasing granularity, beginning with the host and culminating in the specific anatomical or environmental origin of the isolate. This is roughly outlined in the following figure.

Isolation source hierarchy

Now, let's clarify what is there in each of the categories.


Hospital-associated Isolates

Hospital-associated isolates in MetaMiner refer to bacteria derived either from human patients or from hospital environments. The contents of each classification level for this host are detailed below.

Level 1: Host = Hospital-associated

This is the top-level classification that broadly defines the source as being related to a hospital setting. It includes any isolate obtained from patient and hospital infrastructure or equipments. All such isolates are grouped under the umbrella term Hospital-associated.

Level 2: host_category = identified category

Here, MetaMiner distinguishes the origin of the sample within the hospital context. It includes the following three broad categories:

  1. Human Clinical Samples: Isolates obtained from adult or general patient populations.

  2. Pediatric Clinical Samples: Isolates obtained from infants, children, or neonatal ICU cases.

  3. Hospital Environment Samples: Isolates obtained from inanimate sources within the hospital, such as surfaces, drains, catheters, or an equipment.

This distinction may helps in tracking not just infections but also contamination patterns.

Level 3: host_disease = identified_disease

  • For clinical samples (human or pediatric), this corresponds to the disease condition of the patient, like:

    • Urinary Tract Infection (UTI)
    • Bacteremia
    • Meningitis
    • Wound infections, etc.
  • For hospital environmental samples, this level simply puts a larger boundary on the general hospital setting or object from which the sample was collected, like:

    • Infusion and IV equipments
    • Surgical and Therapeutic equipments
    • Medical Imaging Equipments, etc.

This level helps in better linking pathogens with clinical or environmental conditions inside healthcare facilities.

Level 4: Actual isolation_source

This is the finest level of granularity. It defines the exact material or object from which the isolate was cultured, for example:

  • Clinical:

    • Blood
    • Urine
    • Sputum
    • Pus
    • Wound swab
  • Environmental:

    • Drain water
    • Ventilator
    • Catheter
    • Bedside table

Example Table for Hospital-associated Isolates

Raw Metadata Level 1 Level 2 Level 3 Level 4
Homo sapiens, UTI, urine sample Hospital-associated Human Clinical Sample UTIs Urine
Infant, sepsis case, blood Hospital-associated Pediatric Clinical Sample Sepsis/Septic Shock Blood
Wound infection, pus swab from patient Hospital-associated Human Clinical Sample Soft Tissue Infections/Colonization Pus
ventilator tube Hospital-associated Hospital Environment Sample Surgical and Therapeutic Equipment Ventilator

Animal-associated Isolates

Animal-associated isolates refer to those bacterial strains derived from animals—whether domesticated, wild, aquatic, or even insects. Since the animal-world introduces a large variety of possible sources, the categories here are slightly different than that were in Hospital-associated Isolates.

Level 1: host = Animal-associated

This is the top-level that defines the isolate’s origin linked to an animal

Level 2: animal_category = identified_category

This level categorizes animals based on their general ecosystem or role:

  1. Livestock: Domesticated animals reared for agriculture, dairy, or meat—e.g., cattle, poultry, pigs, goats, etc.

  2. Wild: Animals found in natural habitats, including forest-dwelling mammals and birds.

  3. Aquatic: Fish, amphibians, crustaceans, etc.—both freshwater and marine.

  4. Companion Animals: pets like dogs, cats, rodents, etc.

  5. Insects: Insects and bugs, Insects and bugs, Insects and bugs! 😛

Level 3: animal_subcategory = identified_subcategory

This level adds further refinement by identifying specific groups within Level 2. For example:

  • Livestock:

    • Bovine (e.g., cattle, buffalo)
    • Porcine (e.g., pigs)
    • Poultry (e.g., chicken, turkey), etc.
  • Wild:

    • Rodents, reptiles, wild birds, etc.
  • Aquatic:

    • Fish, Shrimp, Squid, etc.
  • Companion

    • Canine, feline, etc.
  • Insects:

    • Beetles, Flies, mosquito, etc.

Level 4: Actual Isolation Source

This level captures the exact biological material or sample site. Examples include:

  • Pus from an infected site
  • Feces or intestine samples
  • Swabs from animal wounds
  • Surface swabs from animal markets

Example Table for Animal-associated Isolates

Raw Metadata Level 1 Level 2 Level 3 Level 4
Boas taurus, abscess, pus Animal-associated Livestock Bovine Pus
Gallus gallus, fecal sample Animal-associated Livestock Poultry Feces
Canine wound swab Animal-associated Companion Canine Swab
chicken meatball Animal-associated Livestock Poultry Meat/Organ
Boot Swab from Poultry Animal-associated Livestock Poultry Breeding/Hospital Environment Sample

Environment-associated Isolates

Environment-associated isolates belong to the most diverse and challenging category for normalization. Plus, in our experience, metadata entries for these isolates tend to be more descriptive—and often more inconsistent—than those for animal- or hospital-associated samples. This category encompasses a wide range of sources, including rivers, soil, air, industrial waste, sewage, and even bioreactors. Because of which, introducing finer granularity would have required an exceptionally large and detailed dictionary, which is currently not feasible. Therefore, for Environment-associated isolates, MetaMiner uses a three-level classification structure.

Level 1: host = Environment-associated

This top-level category identifies that the isolate has been sourced from the natural or built environment, rather than from biological hosts.

Level 2: environment_category = Broad Environmental Category

This level defines broad domains of environmental sources, including:

  • Soil (e.g., agricultural, garden, forest floor)
  • Water (e.g., river, lake, marine, sewage)
  • Air (e.g., indoor air, outdoor air samples)
  • Waste (e.g., landfill, bioreactor sludge)

Level 3: Specific Environmental Source

At this level, entries are refined to capture more granular context, such as:

  • Soil Types:

    • Agricultural soil
    • Forest soil
    • Manure-enriched soil, etc.
  • Residential Areas:

    • Household
    • Care Facility, etc.
  • Food

    • Contaminated Herbs
    • Dairy
    • Contaminated Vegetables
    • Restaurants
    • Animal Feed, etc.

Example Table for Environmental-associated Isolates

Raw Metadata Level 1 Level 2 Level 3
Agricultural soil from paddy field Environment-associated Soil Paddy Soil
River sediment sample Environment-associated Soil Sedimentary Soil
Municipal wastewater sludge Environment-associated Water Wastewater sludge

Laboratory-based Isolates

Laboratory-associated isolates refer to bacterial isolates that originate or are primarily processed in research laboratory environments. These include:

  • Experimental strains used in research
  • Model organisms frequently cultured in labs (e.g., E. coli K-12)
  • Reference strains like ATCC (American Type Culture Collection) strains
  • Genetically modified variants of clinical isolates
  • Strains with synthetic modifications or antibiotic resistance experiments

Unlike other categories, laboratory-associated isolates are treated as a flat classification, meaning they do not have a hierarchy like other categories. This is because metadata for these isolates rarely contain meaningful biological or environmental context beyond their lab-based origin.

Example Table for Laboratory-based Isolates

Raw Metadata Level 1
E. coli K-12 MG1655 Laboratory-associated
ATCC 25922, antimicrobial susceptibility test Laboratory-associated
Synthetic variant of Salmonella Typhi Laboratory-associated

Unknown/Unclassified Isolates

This category is used when the metadata is either missing, unclear, or inconsistent — such that a confident assignment to hospital-, animal-, environment-, or laboratory-associated categories is not possible. These entries typically include:

  • Ambiguous textual data (e.g., "unknown", "not reported", "N/A")
  • Only institute names or partial identifiers
  • Conflicting fields (e.g., human in host but environmental keyword in isolation source)

Normalization of such entries falls under the "Unknown" label by default until manually corrected or clarified by additional metadata.

Example Table

Raw Metadata Level 1
Not available Unknown
Sequenced at VMMC, Urine Unknown
Isolate info not provided Unknown
Fecal Sample Unknown

Note: If you're interested in knowing exactly what terms or categories are included—say, the list of human diseases, animal classifications, or environmental sources—please refer to our codebase. The full dictionary is extensive, and including it here would be both overwhelming and redundant.

Also, a little heads-up: this normalization involves a bunch of carefully thought-out assumptions. Please do take a look at those before jumping to any conclusions. We’ve genuinely tried to make sense of this madness—but hey, if you’ve got smarter or saner ways to handle this, we’d love to hear from you!