Usage - ajmoore143/KEGGBLAST GitHub Wiki
Below is a complete, end-to-end pipeline example. Copy–paste each section into your Python script or notebook.
from keggblast.utils import (
fetch_kegg_orthology,
parse_gene_table,
load_species_data,
map_species_from_single_input,
extract_genes_for_species,
fetch_gene_entry,
extract_sequence,
write_fasta_file
)
from keggblast.blast_gget import run_gget_blast_all
from keggblast.blast_ncbi import run_ncbi_blast_all
-
fetch_kegg_orthology("KXXXXX") → retrieves the raw KO entry (text) for “KXXXXX”
-
parse_gene_table(ko_entry) → parses that entry into a Pandas DataFrame of gene IDs
-
load_species_data() → loads (or auto-caches) the KEGG species dictionary (name ↔ KEGG ID)
-
map_species_from_single_input(...) → fuzzy-matches your input string (e.g. “hamo sapiens”) to a KEGG ID (e.g. hsa) and returns the list of genes for that species
-
fetch_gene_entry(...), extract_sequence(...), write_fasta_file(...) → download each gene’s AASEQ/NTSEQ and write FASTA files
-
run_gget_blast_all(...) / run_ncbi_blast_all(...) → blast each FASTA file against nr (or your chosen DB), with optional taxonomy filters
ko_entry = fetch_kegg_orthology("K09252")
gene_df = parse_gene_table(ko_entry)
- gene_df is a DataFrame with columns like entry_id, gene_id, species_code, etc.
species_df = load_species_data()
matched_name, species_id, gene_list = map_species_from_single_input(
species_df, gene_df, user_input="hamo sapiens"
)
-
Input: anything resembling a Latin species name (even if misspelled)
-
Output:
-
matched_name: the corrected scientific name (e.g., “homo sapiens”)
-
species_id: KEGG species code (e.g., hsa)
-
gene_list: list of gene identifiers for that species under KO K09252
-
import os
output_dir = f"fasta_output/{species_id}"
os.makedirs(output_dir, exist_ok=True)
for gene in gene_list:
entry = fetch_gene_entry(f"{species_id}:{gene}")
aa_seq = extract_sequence(entry, "AASEQ")
nt_seq = extract_sequence(entry, "NTSEQ")
if aa_seq:
write_fasta_file(f"{output_dir}/{gene}_amino.fasta", gene, aa_seq)
if nt_seq:
write_fasta_file(f"{output_dir}/{gene}_gene.fasta", gene, nt_seq)
-
Creates a folder fasta_output/hsa/, then:
-
_amino.fasta (AA sequence)
-
_gene.fasta (NT sequence)
-
run_gget_blast_all(
program="blastp",
database="nr",
query_folder=f"fasta_output/{species_id}",
output_folder="blast_results_gget"
)
-
Pros: Quick JSON output, no API key needed
-
Cons: Cannot restrict by taxonomy
run_ncbi_blast_all(
program="blastp",
database="nr",
query_folder=f"fasta_output/{species_id}",
output_folder="blast_results_ncbi",
tax_query="txid4751[ORGN]" # Example: Fungi
)
-
Pros: You can do txid####[ORGN] filters (e.g., bacteria, fungi, plants)
-
Cons: Requires NCBI_API_KEY (environment variable) for higher rate limits