Usage - ajmoore143/KEGGBLAST GitHub Wiki

Usage & Examples

Below is a complete, end-to-end pipeline example. Copy–paste each section into your Python script or notebook.

1. Import Tools

from keggblast.utils import (
    fetch_kegg_orthology,
    parse_gene_table,
    load_species_data,
    map_species_from_single_input,
    extract_genes_for_species,
    fetch_gene_entry,
    extract_sequence,
    write_fasta_file
)
from keggblast.blast_gget import run_gget_blast_all
from keggblast.blast_ncbi import run_ncbi_blast_all

What these do:

  • fetch_kegg_orthology("KXXXXX") → retrieves the raw KO entry (text) for “KXXXXX”

  • parse_gene_table(ko_entry) → parses that entry into a Pandas DataFrame of gene IDs

  • load_species_data() → loads (or auto-caches) the KEGG species dictionary (name ↔ KEGG ID)

  • map_species_from_single_input(...) → fuzzy-matches your input string (e.g. “hamo sapiens”) to a KEGG ID (e.g. hsa) and returns the list of genes for that species

  • fetch_gene_entry(...), extract_sequence(...), write_fasta_file(...) → download each gene’s AASEQ/NTSEQ and write FASTA files

  • run_gget_blast_all(...) / run_ncbi_blast_all(...) → blast each FASTA file against nr (or your chosen DB), with optional taxonomy filters

2. Fetch KO Entry & Parse

ko_entry = fetch_kegg_orthology("K09252")
gene_df = parse_gene_table(ko_entry)
  • gene_df is a DataFrame with columns like entry_id, gene_id, species_code, etc.

3. Match Species

species_df = load_species_data()
matched_name, species_id, gene_list = map_species_from_single_input(
    species_df, gene_df, user_input="hamo sapiens"
)
  • Input: anything resembling a Latin species name (even if misspelled)

  • Output:

    • matched_name: the corrected scientific name (e.g., “homo sapiens”)

    • species_id: KEGG species code (e.g., hsa)

    • gene_list: list of gene identifiers for that species under KO K09252

4. Download FASTA Sequences

import os

output_dir = f"fasta_output/{species_id}"
os.makedirs(output_dir, exist_ok=True)

for gene in gene_list:
    entry = fetch_gene_entry(f"{species_id}:{gene}")
    aa_seq = extract_sequence(entry, "AASEQ")
    nt_seq = extract_sequence(entry, "NTSEQ")

    if aa_seq:
        write_fasta_file(f"{output_dir}/{gene}_amino.fasta", gene, aa_seq)
    if nt_seq:
        write_fasta_file(f"{output_dir}/{gene}_gene.fasta", gene, nt_seq)
  • Creates a folder fasta_output/hsa/, then:

    • _amino.fasta (AA sequence)

    • _gene.fasta (NT sequence)

5. Run BLAST

Option A: gget BLAST (JSON, no tax filter)

run_gget_blast_all(
    program="blastp",
    database="nr",
    query_folder=f"fasta_output/{species_id}",
    output_folder="blast_results_gget"
)
  • Pros: Quick JSON output, no API key needed

  • Cons: Cannot restrict by taxonomy

Option B: NCBI BLAST (JSON, optional tax filter)

run_ncbi_blast_all(
    program="blastp",
    database="nr",
    query_folder=f"fasta_output/{species_id}",
    output_folder="blast_results_ncbi",
    tax_query="txid4751[ORGN]"  # Example: Fungi
)
  • Pros: You can do txid####[ORGN] filters (e.g., bacteria, fungi, plants)

  • Cons: Requires NCBI_API_KEY (environment variable) for higher rate limits

⚠️ **GitHub.com Fallback** ⚠️