Sequencing Technology - prekijpatel/MetaMiner GitHub Wiki

In raw metadata, sequencing technology is often entered in inconsistent ways - informal names, abbreviations or tool-specific identifiers.

For example: "illumina myseq", "miseq”, "short-read solexa", or "miSeq" all refer to the same Illumina platform. Similarly, "nanopore", "minion", or "oxford" might all refer to Oxford Nanopore Technologies.

Without bringing these entries to unified terminologies, grouping or filtering the entire data by sequencing platform is simply laborious.

How MetaMiner does it?

MetaMiner uses a keyword-based matching strategy to normalize sequencing platform entries:

  1. Keyword Lists: For each major sequencing platform, a curated list of possible keywords or variations is maintained. These lists include common abbreviations, misspellings, and alternate names found in public datasets.

  2. Iterative Search: Each metadata entry is scanned for these keywords. If a keyword is found, the entry is mapped to a standardized name (e.g., _"Illumina" or "Oxford Nanopore").

  3. Multiple Technologies: If a metadata entry indicates multiple sequencing technologies (e.g., hybrid assemblies), MetaMiner combines the matching platforms using " and " to reflect the hybrid nature (e.g., "Illumina and Oxford Nanopore").

  4. Fallback Handling: If no known keyword is matched or no Sequencing Technology was defined in raw metadata, the entry wil be labeled as "Unknown".

This process ensures that all downstream analysis and visualizations rely on a consistent and clean representation of sequencing technologies.

Example of Normalisation done

Raw Metadata Entry Normalized Sequencing Platform
miseq Illumina
illumina myseq Illumina
MiSeq (Illumina) Illumina
MinION/GridION Oxford Nanopore
nanopore Oxford Nanopore
Oxford Oxford Nanopore
PacBio RSII PacBio
Pacific Biosciences PacBio
illumina and nanopore Illumina and Oxford Nanopore
myseq; minion Illumina and Oxford Nanopore
nanopore + pacbio Oxford Nanopore and PacBio
BGI BGI
BGISEQ-500 BGI
Oxford + Illumina + PacBio Illumina and Oxford Nanopore and PacBio