Assumptions for Categorizing Isolation Source - prekijpatel/MetaMiner GitHub Wiki
Normalizing the isolation source data was quite difficult because the entries were very inconsistent. To achieve meaningful standardization, we had to make several assumptions and define rules based on recurring patterns observed in the dataset. While some of these assumptions might seem complex when explained, they allowed us to simplify and categorize highly heterogeneous data into formats suitable for downstream analysis.
Below, we present the key assumptions and strategies used in our normalization pipeline, along with examples where relevant.
If you know of better approaches to handle any of the scenarios we've tackled—or others we might have overlooked—we'd love to hear from you. Your suggestions can help us improve the process and make it more robust and useful for everyone.🙌
host
= Hospital- or Animal-associated
For Fluids with unclear or vague descriptions are categorized under Uncategorized Fluids
- For sample descriptions containing terms like "fluid," "secretions," "aspirate," or "bodily fluid," where the exact context or origin was unclear or difficult to determine, these entries were categorized under
Uncategorized Fluids
. - When the context of the fluid type couldn't be easily discerned, grouping them under
Uncategorized Fluids
helped avoid misclassification and ensured consistency in the dataset.
Note: Some entries that may refer to unusual or less-defined sample sites, which do not yet belong to an established category, are also placed under
Uncategorized Fluids
. These may require further refinement in the future as additional context is gathered.
Samples without a specified host are categorized as host
= Unknown
- If a sample is provided without a clear indication of its host (e.g., "Urine" without specifying whether it’s from a human or animal), it is categorized as
Unknown
. - This approach helps prevent misclassification, as the exact origin of the sample is unclear.
If only a joint
is mentioned without an isolation source, the sample is categorized as Synovial Fluid
.
- If the sample description includes only a joint without specifying the isolation source, it is assumed to be
Synovial Fluid
. This is assumed becauseSynovial fluid
is commonly associated with joint-related infections or inflammation. For instance,
Host | Host Disease | Isolation Source | Level 1 | Level 2 | Level 3 | Level 4 |
---|---|---|---|---|---|---|
Homo sapiens | Missing | Knee Fluid | Hospital-associated | Human Clinical Sample | Bone Infection/Inflammation | Synovial Fluid |
- On other scenario, if the sample description includes a specific term like "swab," it is interpreted as being from a soft tissue or external surface, unless context suggests otherwise.
Host | Host Disease | Isolation Source | Level 1 | Level 2 | Level 3 | Level 4 |
---|---|---|---|---|---|---|
Homo sapiens | Missing | Knee Swab | Hospital-associated | Human Clinical Sample | Soft Tissue Infections/Colonization | Skin Swab |
- This is the only case where host disease is often missing, making assumptions necessary. However, in most other cases, the
host_disease
provides helpful context for deciding the sample category.
Samples from abscesses, boils, or furuncles are categorized as Pus
.
- Isolation sources described as abscess, abscess fluid, abscess swab, boil, furuncle, or any variant thereof are categorized as pus. This is because all these terms indicate a localized collection of purulent material (pus).
Conjunctivitis
implies Eye Swab
if sample is unspecified.
- For both
human and animal-associated
entries, when the host disease is listed as "conjunctivitis" but the specific isolation source (sample) is not provided, it is assumed to be anEye Swab
. This is based on the common diagnostic practice for conjunctivitis.
host
= Hospital-associated
For As long as possible, multi-site/systemic pathogens (like Salmonella) are categorized by site of sample collection, and not the general disease name.
- When the pathogen is known to cause systemic or multi-organ infections, such as Salmonella, the disease classification is determined by the specific site from which the sample was collected — not the overall pathogen profile. This prevents overgeneralization and anchors the disease context in the biological relevance of the isolation site, enhancing accuracy for downstream analysis.
- Here's a list of identified diseases based on sample:
Identified Sample(s) | Disease Classification |
---|---|
Blood, Plasma, Serum | Bacteremia |
Urine, Urethral Swab, Bladder | UTIs |
Sputum, Pleural Fluid, BAL, Tracheal, Nasal, Throat, Laryngeal | Respiratory Infections/Illnesses |
Pus, Wound Swab, Skin Swab, Skin, Axillary Swab, Inguinal Swab, etc. | Soft Tissue Infections/Colonization |
Bile | Gallbladder and Biliary Tract Disorders |
Stool, Peritoneal Fluid, Rectal Swab, Intestinal, Gastric, Abdominal | Gastrointestinal Disorders |
CSF | CNS Infections |
Bone, Synovial Fluid, Cartilage | Bone Infection/Inflammation |
Eye Swab, Tears, Vitreous Humor | Eye Infection |
Ear Swab | Ear Infection |
Oral Swab, Saliva, Root Canal, Tooth | Oral Infection/Colonization |
Cervical/Vaginal Swab, IUDs, Genital Swab, Penile Swab, etc. | Reproductive System Infections/Illnesses |
Bursa Sample | Bursitis |
Lymph Node, Lymph | Lymph Node Infections |
Pericardial | Cardiac Disorders |
Implant | Implant-associated Infections |
Note: This is only when
host_disease
is not clearly defined and given in general terms like, Samlmonella infection.
Handling Multiple Diseases
in humans based-on sample indication
- When multiple diseases are listed for a human host, and the sample indicates only one of the diseases, the disease not indicated by the sample is still recorded under the host_disease field. This ensures that relevant information regarding the potential disease context is retained.
for example, For a case with "Bacteremia and colorectal cancer" where the sample is blood, the disease indicated by the sample (e.g., Blood) is categorized based on the sample type, but colorectal cancer would be included in the host_disease field to retain more information about the possible disease context.
Host | Host Disease | Isolation Source | Level 1 | Level 2 | Level 3 | Level 4 |
---|---|---|---|---|---|---|
Homo sapiens | Bacteremia, Colorectal Cancer | Blood | Hospital-associated | Human Clinical Sample | Solid Tumors | Blood |
Homo sapiens | VAP, Cystitis | Urine | Hospital-associated | Human Clinical Sample | Respiratory Infections/Illnesses | Urine |
Although categorizing the host based on either of the disease wouldn't have been wrong, the idea to accomodate as much context as possible. However, if there are more than two diseases available in the raw data of Host Disease
, there is nothing we can do about it! (so far!)
*Hemolytic Uremic Syndrome (HUS) is categorized under Hematologic Disorders
and not as STEC Infection
HUS is typically a complication of "STEC/EHEC Infections", which are gastrointestinal diseases. However, to avoid confusion with pneumococcal HUS (which is associated with pneumonia and other respiratory issues), HUS is classified under Blood disorders. But, let's say if you are browsing through metadata of Escherichia coli, if you see HUS, it can most likely be inferred as STEC Infection.
HIV
and AIDS
are not categorized together.
- HIV is categorized separately, while AIDS (Acquired Immunodeficiency Syndrome) is categorized under
Immunodeficiency Disorders
. This is to account for the possibility of rare cases where AIDS can develop without HIV.
Complications with Catheters
- Catheter-related entries are handled in following ways:
- Catheter-associated infections: When it’s clearly indicated that the infection is related to the catheter, e.g., catheter-associated bloodstream infection, the host's disease is identified as
Catheter-associated Infections
. - When catheter is given as
isolation source
along with some specifichost disease
, the sample is identified asCatheter
. - When catheter is given as
isolation source
but no disease is indicated, the entire things go fromHuman Clinical Sample
toHospital Environment Sample
.
- Catheter-associated infections: When it’s clearly indicated that the infection is related to the catheter, e.g., catheter-associated bloodstream infection, the host's disease is identified as
host | host_disease | isolation_source | level1 | level2 | level3 | level4 |
---|---|---|---|---|---|---|
Homo sapiens | Klebsiella pneumoniae BSI | venous catheter | Hospital-associated | Human Clinical Sample | Bacteremia | Catheter |
Homo sapiens | Venous catheter associated infection | venous catheter | Hospital-associated | Hospital Environment Sample | Catheter-associated Infections | Catheter |
Homo sapiens | Pyelonephritis, Kidney Failure, Chronic cystitis | Catheter(Urine) | Hospital-associated | Hospital Environment Sample | Kidney Disorders | Catheter |
Homo sapiens | Unknown | venous catheter | Hospital-associated | Hospital Environment Sample | Infusion and IV Equipment | Catheter |
Homo sapiens | Unknown | venous catheter tip | Hospital-associated | Hospital Environment Sample | Infusion and IV Equipment | Catheter |
Constitution of Hospital Environmental Sample
Abdominal infections are generally categorized under Gastrointestinal Disorders
unless otherwise specified.
-
Many abdominal-related infections likely stem from intra-abdominal issues, typically originating from the gastrointestinal (GI) tract. For cases that are clearly linked to the particular organs like, liver or pancreas, they were categorized under
Liver Disorders
orPancreatic Disorders
. However, for more general conditions like peritonitis, intra-abdominal infections, or where the origin was ambiguous, they were categorized under Gastrointestinal Diseases (GID).
host
= Animal-associated
for Food-based samples are categorized based on the origin of the food — animal-derived vs. others.
- Food-related isolates are classified under either Animal-associated or Environment-associated, depending on the source of the food.
- If the food is meat or meat-derived (e.g., chicken, beef, pork, mutton), it is considered Animal-associated, as it directly originates from animals.
- All other food types (e.g., vegetables, fruits, rice, spices, dairy products) are considered Environment-associated, since they are either plant-derived or processed items that may not directly reflect animal hosts.
If the sample is just mammary gland
, they are categorized as swabs.
- When the isolation source is given as mammary gland and no specific sample type is mentioned, it is assumed to be a swab, specifically in the context of bovine (cattle) samples. Of course, the actual sample could be a biopsy, secretion, or aspirate, but swab collection is far more common in bovine mastitis or udder infection cases.
Animal Rectal swabs
are categorized under feces
-associated samples.
- When the isolation source is rectal swab from non-human animals, it is considered feces-associated, even if the term "feces" is not directly mentioned.
- This is because
- rectal samples typically reflect fecal content, and
- there is significant diversity in anatomy and how samples are collected for birds, reptiles, wild animals, and insects. They all have differing gastrointestinal structures and terminologies. Instead of creating many specific categories for each animal type and anatomical nuance, these samples are uniformly classified under
feces
-associated.
Placenta
and umbilical cord
samples from animals are categorized under tissues
.
- When the isolation source is "placenta" or "umbilical cord" and the host is an animal, the sample is categorized under the general class of
tissues
. (We are in process of adding them!) - However, in
Human-associated
cases, "placenta" and "umbilical cord" are given their own distinct category, as they are often studied independently in clinical and neonatal microbiology, and thus merit finer granularity.
Deer samples are categorized as Livestock
.
- While it is understood that deer can be either wild or farm-bred, the current normalization does not distinguish between wild and domesticated deer.
All swab samples are categorized simply as Swab
in animals.
- For
animal-associated
data, all sample types containing "swab" (e.g., rectal swab, nasal swab, skin swab, etc.) are categorized under a single type:Swab
. - This approach is adopted to simplify the classification due to the high diversity of animal species and anatomical variations across taxa (livestock, wild animals, birds, insects, etc.).
host
= Environment-associated
for Estuarine
and Mangrove Soils
Are Categorized as Coastal Soils
.
- Soil types labeled as "estuarine soil" or "mangrove soil" are grouped under "Coastal Soil" in environmental context. This is due to their proximity to coastal ecosystems and shared saline/water-logged characteristics.
Soil aamples associated with Fruits
or Vegetables
are categorized as Agricultural Soil
.
- Soil samples linked to fruits or vegetables (whether from farming, cultivation, or harvest environments) are categorized under "Agricultural Soil".