Auto Species Matching - ajmoore143/KEGGBLAST GitHub Wiki

Auto Species Matching

KEGGBLAST’s fuzzy matching allows you to type approximate species names (or common misspellings) and still get the correct KEGG ID. Here’s how:

Load All KEGG Species
- On first run, load_species_data() hits KEGG REST (http://rest.kegg.jp/list/organism) and builds a DataFrame with columns:
  - organism_code (e.g. hsa)
  - taxonomy_id (e.g. 9606)
  - scientific_name (e.g. Homo sapiens)
  - common_name (e.g. human)
- Caches this as ~/.keggblast_species_cache.csv for future runs.
Fuzzy Match Algorithm
- map_species_from_single_input(...) takes your raw string (e.g. "hamo sapiens") and computes Levenshtein distance against every scientific_name in the DataFrame.
- Picks the entry with the smallest distance. If multiple ties, it picks the one with highest KEGG usage frequency.
- Returns:
  - matched_name (exact string from the table, e.g. "Homo sapiens")
  - species_id (e.g. hsa)
  - gene_list (list of genes under that species from your parsed KO table)
Examples

Input:   "hamo sapiens"
Match:   "Homo sapiens"  → KEGG ID `hsa`

Input:   "E. coliii"  
Match:   "Escherichia coli" → KEGG ID `eco`

Input:   "arabidos thaliana"  
Match:   "Arabidopsis thaliana" → KEGG ID `ath`