Auto Species Matching - ajmoore143/KEGGBLAST GitHub Wiki
Auto Species Matching
KEGGBLAST’s fuzzy matching allows you to type approximate species names (or common misspellings) and still get the correct KEGG ID. Here’s how:
-
Load All KEGG Species
- On first run,
load_species_data()
hits KEGG REST (http://rest.kegg.jp/list/organism
) and builds a DataFrame with columns:organism_code
(e.g.hsa
)taxonomy_id
(e.g.9606
)scientific_name
(e.g.Homo sapiens
)common_name
(e.g.human
)
- Caches this as
~/.keggblast_species_cache.csv
for future runs.
- On first run,
-
Fuzzy Match Algorithm
map_species_from_single_input(...)
takes your raw string (e.g."hamo sapiens"
) and computes Levenshtein distance against everyscientific_name
in the DataFrame.- Picks the entry with the smallest distance. If multiple ties, it picks the one with highest KEGG usage frequency.
- Returns:
matched_name
(exact string from the table, e.g."Homo sapiens"
)species_id
(e.g.hsa
)gene_list
(list of genes under that species from your parsed KO table)
-
Examples
Input: "hamo sapiens"
Match: "Homo sapiens" → KEGG ID `hsa`
Input: "E. coliii"
Match: "Escherichia coli" → KEGG ID `eco`
Input: "arabidos thaliana"
Match: "Arabidopsis thaliana" → KEGG ID `ath`