Usage & Examples

Below is a complete, end-to-end pipeline example. Copy–paste each section into your Python script or notebook.

1. Import Tools

from keggblast.utils import (
    fetch_kegg_orthology,
    parse_gene_table,
    load_species_data,
    map_species_from_single_input,
    extract_genes_for_species,
    fetch_gene_entry,
    extract_sequence,
    write_fasta_file
)
from keggblast.blast_gget import run_gget_blast_all
from keggblast.blast_ncbi import run_ncbi_blast_all

What these do:

fetch_kegg_orthology("KXXXXX") → retrieves the raw KO entry (text) for “KXXXXX”
parse_gene_table(ko_entry) → parses that entry into a Pandas DataFrame of gene IDs
load_species_data() → loads (or auto-caches) the KEGG species dictionary (name ↔ KEGG ID)
map_species_from_single_input(...) → fuzzy-matches your input string (e.g. “hamo sapiens”) to a KEGG ID (e.g. hsa) and returns the list of genes for that species
fetch_gene_entry(...), extract_sequence(...), write_fasta_file(...) → download each gene’s AASEQ/NTSEQ and write FASTA files
run_gget_blast_all(...) / run_ncbi_blast_all(...) → blast each FASTA file against nr (or your chosen DB), with optional taxonomy filters

2. Fetch KO Entry & Parse

ko_entry = fetch_kegg_orthology("K09252")
gene_df = parse_gene_table(ko_entry)

gene_df is a DataFrame with columns like entry_id, gene_id, species_code, etc.

3. Match Species

species_df = load_species_data()
matched_name, species_id, gene_list = map_species_from_single_input(
    species_df, gene_df, user_input="hamo sapiens"
)

Input: anything resembling a Latin species name (even if misspelled)
Output:
- matched_name: the corrected scientific name (e.g., “homo sapiens”)
- species_id: KEGG species code (e.g., hsa)
- gene_list: list of gene identifiers for that species under KO K09252

4. Download FASTA Sequences

import os

output_dir = f"fasta_output/{species_id}"
os.makedirs(output_dir, exist_ok=True)

for gene in gene_list:
    entry = fetch_gene_entry(f"{species_id}:{gene}")
    aa_seq = extract_sequence(entry, "AASEQ")
    nt_seq = extract_sequence(entry, "NTSEQ")

    if aa_seq:
        write_fasta_file(f"{output_dir}/{gene}_amino.fasta", gene, aa_seq)
    if nt_seq:
        write_fasta_file(f"{output_dir}/{gene}_gene.fasta", gene, nt_seq)

Creates a folder fasta_output/hsa/, then:
- _amino.fasta (AA sequence)
- _gene.fasta (NT sequence)

5. Run BLAST

Option A: gget BLAST (JSON, no tax filter)

run_gget_blast_all(
    program="blastp",
    database="nr",
    query_folder=f"fasta_output/{species_id}",
    output_folder="blast_results_gget"
)

Pros: Quick JSON output, no API key needed
Cons: Cannot restrict by taxonomy

Option B: NCBI BLAST (JSON, optional tax filter)

run_ncbi_blast_all(
    program="blastp",
    database="nr",
    query_folder=f"fasta_output/{species_id}",
    output_folder="blast_results_ncbi",
    tax_query="txid4751[ORGN]"  # Example: Fungi
)

Pros: You can do txid####[ORGN] filters (e.g., bacteria, fungi, plants)
Cons: Requires NCBI_API_KEY (environment variable) for higher rate limits

Usage - ajmoore143/KEGGBLAST GitHub Wiki

Usage & Examples

1. Import Tools

What these do:

2. Fetch KO Entry & Parse

3. Match Species

4. Download FASTA Sequences

5. Run BLAST

Option A: gget BLAST (JSON, no tax filter)

Option B: NCBI BLAST (JSON, optional tax filter)

⚠️ GitHub.com Fallback ⚠️

Usage - ajmoore143/KEGGBLAST GitHub Wiki

Usage & Examples

1. Import Tools

What these do:

2. Fetch KO Entry & Parse

3. Match Species

4. Download FASTA Sequences

5. Run BLAST

Option A: gget BLAST (JSON, no tax filter)

Option B: NCBI BLAST (JSON, optional tax filter)

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️