Home - prekijpatel/MetaMiner GitHub Wiki
The Problem
As the volume of publicly available genomes continues to grow, researchers increasingly rely on large-scale comparative studies to address important biological questions. Whether the goal is to investigate the spread of antimicrobial resistance, trace host-specific adaptations, or construct high-quality pan-genomes, working with hundreds or even thousands of genomes has become standard practice.
However, a major challenge often arises at the very first step: identifying and retrieving the relevant genomes for a study. This is largely because the associated metadata can vary significantly.
For example, imagine you're looking for E. coli genomes from stool samples. The metadata entries you encounter might include “Fecal sample,” “feces,” “stool,” “gut,” “faecal matter,” “intestinal content,” or “patient feces.” These terms may all refer to the same source, yet they won’t necessarily appear together in a search.
And that’s just one example.
Take a look at what is typically encountered:
Entry | Isolation source | L50 | Host | Geo_loc | Host disease | Sequencing technology | Coverage |
---|---|---|---|---|---|---|---|
A | stool sample | 96 | Human | USA:CA | NA | Illumina NovaSeq | 50x |
B | fecal material | 87 | Homo sapiens | United States: Hospital | diarrhea | HiSeq2000 | Not provided |
C | gut | 2 | NA | Delhi, India | NA | Oxford Nanopore MinION | 18x |
D | GIT content (mouse) | 1 | Mus musculus | GER-Bavaria | NA | PacBio RSII | 100x |
E | human_stool | 115 | Homo Sapien | India: Maharashtra | fever (reported) | Illumina MiSeq | 0x |
F | Sample from intestine | 1 | h. sapiens | New York, USA | gastroenteritis | Illumina + Nanopore (hybrid) | 200x |
G | feces | 93 | Homo sapiens | unknown | suspected cholera | Not stated | 75x |
H | faecal (unsure source) | 100 | human | UK (rural village) | NA | Illumina NextSeq | 500000x |
I | fecal swab | 42 | Human | China-Taiwan | none | BGI | 60x |
J | lower gut isolate | 105 | Homo sapiens | São Paulo, Brazil | enteritis | Ion Torrent | 30x |
Even within this small subset, inconsistencies are evident: spelling variants ("stool" vs "feces" vs "faecal"), inconsistent or malformed host names ("Human," "h. sapiens," "Homo Sapien"), and ambiguous source descriptions ("gut," "Sample from intestine"). Mapping these to proper ontologies becomes increasingly difficult. In more complex datasets, these issues compound rapidly. And when you're just starting out, this messiness can ruin the momentum.
The consequences of poor metadata are not merely procedural — they can compromise the integrity of downstream analyses.
For instance, a study aiming to include only human-associated E. coli strains may inadvertently include mouse or environmental isolates due to vague metadata. Conversely, a relevant genome might be excluded because the host field says "Homo_Sapien" instead of "Homo sapiens," or the isolation source is recorded as “Lung Fluid” rather than “Pleural Fluid.” These small inconsistencies can accumulate, leading to biased datasets, misleading trends, or incorrect conclusions.
To avoid these problems, the best (and often only😅) solution is manual curation — painstakingly reviewing records one by one. However, this can quickly become an unexpectedly laborious and frustrating process given the number of genomes.
Yes, there are ongoing efforts to improve metadata submission for newer genomes. But the older entries — which make up a significant part of the data — remain messy, unstructured, and difficult to work with.
Adding to the challenge, the metadata often doesn’t allow for multi-parameter sorting in a straightforward way.
Let’s say you’re looking for E. coli genomes isolated from fecal samples with an L50 score below 100. You might find 500 fecal isolates, only to discover that just 25 meet your assembly criteria. You’ll need to adjust your filters repeatedly — loosening or tightening thresholds — to arrive at a usable subset. Doing this across multiple parameters, using plain-text metadata files, quickly becomes tedious and error-prone.
Altogether, these make the process of sorting the necessary genome subset from entire pool an uphill battle.
Solution MetaMiner offers
MetaMiner is built to take the edge off this initial chaos 🧹.
It helps you retrieve genome metadata in bulk from public databases and organizes it in a way that’s easier to work with — without losing the original raw data 🧾. Instead of scrolling through inconsistent entries, you get a cleaned-up, normalized version alongside the original, so nothing is lost — just made more usable.
It simply (okay, not that simply 😅) brings scattered, inconsistent metadata into a more unified form. For example:
-
Isolation sources like “blood sample,” “human blood,” and “blood isolate” all map to a single category, "Blood"🩸
-
Geographic locations are structured into country and state fields 🌍
-
Sequencing platforms and other technical metadata are similarly organized 🧬
MetaMiner also features an intuitive, Dash-powered interactive dashboard called meta-mined!. This interface enables you to explore, filter, and analyze genome metadata across multiple parameters — all in one place. Instead of relying on spreadsheets or custom scripts, you get a streamlined, visual overview of the entire dataset which may help you make decisions faster with more clarity🎯.
Important caveats
While MetaMiner does a lot to simplify metadata exploration, it’s important to keep in mind a few limitations.
Genome metadata — especially from public repositories — is wildly inconsistent. Since many fields are filled through free-form text, the same concept can be written in dozens of different ways. We've done our best to normalize and group similar entries with our educated guesses and patterns drawn from noisy inputs. This is helpful but definitely not foolproof. Currently, MetaMiner uses a combination of rule-based logic and manual curation (and soon, we might even bring in LLMs 😎). This means certain assumptions have been made — especially when categorizing the tricky fields like isolation source — in order to bring structure to the chaos. Please take a moment to review these assumptions before diving into detailed filtering.
Also, while the normalization is designed to make filtering easier, it does not replace domain expertise. So, please cross-check your results. We recommend taking a bird's-eye view of the filtered datasets before drawing conclusions or proceeding with downstream analysis. This tool is meant to give you a strong head start — to reduce manual wrangling and speed up the initial exploration process.
Collaborations 🤝
MetaMiner was born out of a real need. We're not software engineers by training—we're microbiologists trying to bridge a gap we've personally felt. MetaMiner reflects that spirit: practical, imperfect, and built from the perspective of people who’ve wrestled with messy metadata one too many times.
That said, there’s still a lot MetaMiner could do. From smoother visuals to a snappier interface and more efficient pipelines — we have got a wish list 😅. But honestly, some of the most game-changing ideas often come from fresh eyes and different perspectives — maybe yours. So if you're a bioinformatician, a fellow microbiologist, a developer, or someone in between, we’d love to hear from you. Whether it’s feedback, feature ideas, bug reports, or collaboration proposals — we are all ears.