Auto Species Matching - ajmoore143/KEGGBLAST GitHub Wiki

Auto Species Matching

KEGGBLAST’s fuzzy matching allows you to type approximate species names (or common misspellings) and still get the correct KEGG ID. Here’s how:

  1. Load All KEGG Species

    • On first run, load_species_data() hits KEGG REST (http://rest.kegg.jp/list/organism) and builds a DataFrame with columns:
      • organism_code (e.g. hsa)
      • taxonomy_id (e.g. 9606)
      • scientific_name (e.g. Homo sapiens)
      • common_name (e.g. human)
    • Caches this as ~/.keggblast_species_cache.csv for future runs.
  2. Fuzzy Match Algorithm

    • map_species_from_single_input(...) takes your raw string (e.g. "hamo sapiens") and computes Levenshtein distance against every scientific_name in the DataFrame.
    • Picks the entry with the smallest distance. If multiple ties, it picks the one with highest KEGG usage frequency.
    • Returns:
      • matched_name (exact string from the table, e.g. "Homo sapiens")
      • species_id (e.g. hsa)
      • gene_list (list of genes under that species from your parsed KO table)
  3. Examples

Input:   "hamo sapiens"
Match:   "Homo sapiens"  → KEGG ID `hsa`

Input:   "E. coliii"  
Match:   "Escherichia coli" → KEGG ID `eco`

Input:   "arabidos thaliana"  
Match:   "Arabidopsis thaliana" → KEGG ID `ath`