FASTA Extraction Logic - ajmoore143/KEGGBLAST GitHub Wiki

FASTA Extraction Logic

KEGGBLAST’s goal is to gather both amino acid (AASEQ) and nucleotide (NTSEQ) sequences for each gene in a KO entry, specifically for the matched species. Here’s how it works under the hood:

  1. Fetch KO → Gene Table

    • fetch_kegg_orthology("KXXXXX") returns a text blob.
    • parse_gene_table(...) builds a DataFrame where each row is a gene ID + species.
  2. Match Species

    • load_species_data() either reads from a local cache or hits KEGG once to get all (species_name ↔ KEGG_ID) pairs.
    • map_species_from_single_input(...) uses fuzzy search (edit distance) to compare your input (e.g., “hamo sapiens”) against all available KEGG species names.
    • Once the best match (lowest edit distance) is found, it extracts that species row from the DataFrame.
  3. For Each Gene:

    • fetch_gene_entry("<species_id>:<gene_id>") → raw “gene entry text” (which contains labeled blocks like AASEQ and NTSEQ).
    • extract_sequence(entry, "AASEQ") → pulls out the block of text starting at AASEQ and ending at the next blank line or block header.
    • extract_sequence(entry, "NTSEQ") → similarly for nucleotides.
    • If a block is missing (some genes might lack AASEQ or NTSEQ), KEGGBLAST simply skips writing that FASTA.
  4. Write FASTA

    • write_fasta_file(path, gene_id, sequence_string)
    • Internally formats a FASTA header: ">gene_id" plus the raw sequence (wrapped at 60 chars/line by default).
⚠️ **GitHub.com Fallback** ⚠️