Sequencing Technology - prekijpatel/MetaMiner GitHub Wiki
In raw metadata, sequencing technology is often entered in inconsistent ways - informal names, abbreviations or tool-specific identifiers.
For example: "illumina myseq", "miseq”, "short-read solexa", or "miSeq" all refer to the same Illumina platform. Similarly, "nanopore", "minion", or "oxford" might all refer to Oxford Nanopore Technologies.
Without bringing these entries to unified terminologies, grouping or filtering the entire data by sequencing platform is simply laborious.
How MetaMiner does it?
MetaMiner uses a keyword-based matching strategy to normalize sequencing platform entries:
-
Keyword Lists: For each major sequencing platform, a curated list of possible keywords or variations is maintained. These lists include common abbreviations, misspellings, and alternate names found in public datasets.
-
Iterative Search: Each metadata entry is scanned for these keywords. If a keyword is found, the entry is mapped to a standardized name (e.g., _"Illumina" or "Oxford Nanopore").
-
Multiple Technologies: If a metadata entry indicates multiple sequencing technologies (e.g., hybrid assemblies), MetaMiner combines the matching platforms using " and " to reflect the hybrid nature (e.g., "Illumina and Oxford Nanopore").
-
Fallback Handling: If no known keyword is matched or no Sequencing Technology was defined in raw metadata, the entry wil be labeled as "Unknown".
This process ensures that all downstream analysis and visualizations rely on a consistent and clean representation of sequencing technologies.
Example of Normalisation done
Raw Metadata Entry | Normalized Sequencing Platform |
---|---|
miseq | Illumina |
illumina myseq | Illumina |
MiSeq (Illumina) | Illumina |
MinION/GridION | Oxford Nanopore |
nanopore | Oxford Nanopore |
Oxford | Oxford Nanopore |
PacBio RSII | PacBio |
Pacific Biosciences | PacBio |
illumina and nanopore | Illumina and Oxford Nanopore |
myseq; minion | Illumina and Oxford Nanopore |
nanopore + pacbio | Oxford Nanopore and PacBio |
BGI | BGI |
BGISEQ-500 | BGI |
Oxford + Illumina + PacBio | Illumina and Oxford Nanopore and PacBio |