Assumptions for Categorizing Isolation Source - prekijpatel/MetaMiner GitHub Wiki

Normalizing the isolation source data was quite difficult because the entries were very inconsistent. To achieve meaningful standardization, we had to make several assumptions and define rules based on recurring patterns observed in the dataset. While some of these assumptions might seem complex when explained, they allowed us to simplify and categorize highly heterogeneous data into formats suitable for downstream analysis.

Below, we present the key assumptions and strategies used in our normalization pipeline, along with examples where relevant.

If you know of better approaches to handle any of the scenarios we've tackled—or others we might have overlooked—we'd love to hear from you. Your suggestions can help us improve the process and make it more robust and useful for everyone.🙌

For host = Hospital- or Animal-associated

Fluids with unclear or vague descriptions are categorized under Uncategorized Fluids

  • For sample descriptions containing terms like "fluid," "secretions," "aspirate," or "bodily fluid," where the exact context or origin was unclear or difficult to determine, these entries were categorized under Uncategorized Fluids.
  • When the context of the fluid type couldn't be easily discerned, grouping them under Uncategorized Fluids helped avoid misclassification and ensured consistency in the dataset.

Note: Some entries that may refer to unusual or less-defined sample sites, which do not yet belong to an established category, are also placed under Uncategorized Fluids. These may require further refinement in the future as additional context is gathered.


Samples without a specified host are categorized as host = Unknown

  • If a sample is provided without a clear indication of its host (e.g., "Urine" without specifying whether it’s from a human or animal), it is categorized as Unknown.
  • This approach helps prevent misclassification, as the exact origin of the sample is unclear.

If only a joint is mentioned without an isolation source, the sample is categorized as Synovial Fluid.

  • If the sample description includes only a joint without specifying the isolation source, it is assumed to be Synovial Fluid. This is assumed because Synovial fluid is commonly associated with joint-related infections or inflammation. For instance,
Host Host Disease Isolation Source Level 1 Level 2 Level 3 Level 4
Homo sapiens Missing Knee Fluid Hospital-associated Human Clinical Sample Bone Infection/Inflammation Synovial Fluid
  • On other scenario, if the sample description includes a specific term like "swab," it is interpreted as being from a soft tissue or external surface, unless context suggests otherwise.
Host Host Disease Isolation Source Level 1 Level 2 Level 3 Level 4
Homo sapiens Missing Knee Swab Hospital-associated Human Clinical Sample Soft Tissue Infections/Colonization Skin Swab
  • This is the only case where host disease is often missing, making assumptions necessary. However, in most other cases, the host_disease provides helpful context for deciding the sample category.

Samples from abscesses, boils, or furuncles are categorized as Pus.

  • Isolation sources described as abscess, abscess fluid, abscess swab, boil, furuncle, or any variant thereof are categorized as pus. This is because all these terms indicate a localized collection of purulent material (pus).

Conjunctivitis implies Eye Swab if sample is unspecified.

  • For both human and animal-associated entries, when the host disease is listed as "conjunctivitis" but the specific isolation source (sample) is not provided, it is assumed to be an Eye Swab. This is based on the common diagnostic practice for conjunctivitis.

For host = Hospital-associated

As long as possible, multi-site/systemic pathogens (like Salmonella) are categorized by site of sample collection, and not the general disease name.

  • When the pathogen is known to cause systemic or multi-organ infections, such as Salmonella, the disease classification is determined by the specific site from which the sample was collected — not the overall pathogen profile. This prevents overgeneralization and anchors the disease context in the biological relevance of the isolation site, enhancing accuracy for downstream analysis.
  • Here's a list of identified diseases based on sample:
Identified Sample(s) Disease Classification
Blood, Plasma, Serum Bacteremia
Urine, Urethral Swab, Bladder UTIs
Sputum, Pleural Fluid, BAL, Tracheal, Nasal, Throat, Laryngeal Respiratory Infections/Illnesses
Pus, Wound Swab, Skin Swab, Skin, Axillary Swab, Inguinal Swab, etc. Soft Tissue Infections/Colonization
Bile Gallbladder and Biliary Tract Disorders
Stool, Peritoneal Fluid, Rectal Swab, Intestinal, Gastric, Abdominal Gastrointestinal Disorders
CSF CNS Infections
Bone, Synovial Fluid, Cartilage Bone Infection/Inflammation
Eye Swab, Tears, Vitreous Humor Eye Infection
Ear Swab Ear Infection
Oral Swab, Saliva, Root Canal, Tooth Oral Infection/Colonization
Cervical/Vaginal Swab, IUDs, Genital Swab, Penile Swab, etc. Reproductive System Infections/Illnesses
Bursa Sample Bursitis
Lymph Node, Lymph Lymph Node Infections
Pericardial Cardiac Disorders
Implant Implant-associated Infections

Note: This is only when host_disease is not clearly defined and given in general terms like, Samlmonella infection.


Handling Multiple Diseases in humans based-on sample indication

  • When multiple diseases are listed for a human host, and the sample indicates only one of the diseases, the disease not indicated by the sample is still recorded under the host_disease field. This ensures that relevant information regarding the potential disease context is retained.

for example, For a case with "Bacteremia and colorectal cancer" where the sample is blood, the disease indicated by the sample (e.g., Blood) is categorized based on the sample type, but colorectal cancer would be included in the host_disease field to retain more information about the possible disease context.

Host Host Disease Isolation Source Level 1 Level 2 Level 3 Level 4
Homo sapiens Bacteremia, Colorectal Cancer Blood Hospital-associated Human Clinical Sample Solid Tumors Blood
Homo sapiens VAP, Cystitis Urine Hospital-associated Human Clinical Sample Respiratory Infections/Illnesses Urine

Although categorizing the host based on either of the disease wouldn't have been wrong, the idea to accomodate as much context as possible. However, if there are more than two diseases available in the raw data of Host Disease, there is nothing we can do about it! (so far!)


*Hemolytic Uremic Syndrome (HUS) is categorized under Hematologic Disorders and not as STEC Infection

HUS is typically a complication of "STEC/EHEC Infections", which are gastrointestinal diseases. However, to avoid confusion with pneumococcal HUS (which is associated with pneumonia and other respiratory issues), HUS is classified under Blood disorders. But, let's say if you are browsing through metadata of Escherichia coli, if you see HUS, it can most likely be inferred as STEC Infection.


HIV and AIDS are not categorized together.

  • HIV is categorized separately, while AIDS (Acquired Immunodeficiency Syndrome) is categorized under Immunodeficiency Disorders. This is to account for the possibility of rare cases where AIDS can develop without HIV.

Complications with Catheters

  • Catheter-related entries are handled in following ways:
    1. Catheter-associated infections: When it’s clearly indicated that the infection is related to the catheter, e.g., catheter-associated bloodstream infection, the host's disease is identified as Catheter-associated Infections.
    2. When catheter is given as isolation source along with some specific host disease, the sample is identified as Catheter.
    3. When catheter is given as isolation source but no disease is indicated, the entire things go from Human Clinical Sample to Hospital Environment Sample.
host host_disease isolation_source level1 level2 level3 level4
Homo sapiens Klebsiella pneumoniae BSI venous catheter Hospital-associated Human Clinical Sample Bacteremia Catheter
Homo sapiens Venous catheter associated infection venous catheter Hospital-associated Hospital Environment Sample Catheter-associated Infections Catheter
Homo sapiens Pyelonephritis, Kidney Failure, Chronic cystitis Catheter(Urine) Hospital-associated Hospital Environment Sample Kidney Disorders Catheter
Homo sapiens Unknown venous catheter Hospital-associated Hospital Environment Sample Infusion and IV Equipment Catheter
Homo sapiens Unknown venous catheter tip Hospital-associated Hospital Environment Sample Infusion and IV Equipment Catheter

Constitution of Hospital Environmental Sample


Abdominal infections are generally categorized under Gastrointestinal Disorders unless otherwise specified.

  • Many abdominal-related infections likely stem from intra-abdominal issues, typically originating from the gastrointestinal (GI) tract. For cases that are clearly linked to the particular organs like, liver or pancreas, they were categorized under Liver Disorders or Pancreatic Disorders. However, for more general conditions like peritonitis, intra-abdominal infections, or where the origin was ambiguous, they were categorized under Gastrointestinal Diseases (GID).

for host = Animal-associated

Food-based samples are categorized based on the origin of the food — animal-derived vs. others.

  • Food-related isolates are classified under either Animal-associated or Environment-associated, depending on the source of the food.
  • If the food is meat or meat-derived (e.g., chicken, beef, pork, mutton), it is considered Animal-associated, as it directly originates from animals.
  • All other food types (e.g., vegetables, fruits, rice, spices, dairy products) are considered Environment-associated, since they are either plant-derived or processed items that may not directly reflect animal hosts.

If the sample is just mammary gland, they are categorized as swabs.

  • When the isolation source is given as mammary gland and no specific sample type is mentioned, it is assumed to be a swab, specifically in the context of bovine (cattle) samples. Of course, the actual sample could be a biopsy, secretion, or aspirate, but swab collection is far more common in bovine mastitis or udder infection cases.

Animal Rectal swabs are categorized under feces-associated samples.

  • When the isolation source is rectal swab from non-human animals, it is considered feces-associated, even if the term "feces" is not directly mentioned.
  • This is because
    • rectal samples typically reflect fecal content, and
    • there is significant diversity in anatomy and how samples are collected for birds, reptiles, wild animals, and insects. They all have differing gastrointestinal structures and terminologies. Instead of creating many specific categories for each animal type and anatomical nuance, these samples are uniformly classified under feces-associated.

Placenta and umbilical cord samples from animals are categorized under tissues.

  • When the isolation source is "placenta" or "umbilical cord" and the host is an animal, the sample is categorized under the general class of tissues. (We are in process of adding them!)
  • However, in Human-associated cases, "placenta" and "umbilical cord" are given their own distinct category, as they are often studied independently in clinical and neonatal microbiology, and thus merit finer granularity.

Deer samples are categorized as Livestock.

  • While it is understood that deer can be either wild or farm-bred, the current normalization does not distinguish between wild and domesticated deer.

All swab samples are categorized simply as Swab in animals.

  • For animal-associated data, all sample types containing "swab" (e.g., rectal swab, nasal swab, skin swab, etc.) are categorized under a single type: Swab.
  • This approach is adopted to simplify the classification due to the high diversity of animal species and anatomical variations across taxa (livestock, wild animals, birds, insects, etc.).

for host = Environment-associated

Estuarine and Mangrove Soils Are Categorized as Coastal Soils.

  • Soil types labeled as "estuarine soil" or "mangrove soil" are grouped under "Coastal Soil" in environmental context. This is due to their proximity to coastal ecosystems and shared saline/water-logged characteristics.

Soil aamples associated with Fruits or Vegetables are categorized as Agricultural Soil.

  • Soil samples linked to fruits or vegetables (whether from farming, cultivation, or harvest environments) are categorized under "Agricultural Soil".