FASTA Extraction Logic

KEGGBLAST’s goal is to gather both amino acid (AASEQ) and nucleotide (NTSEQ) sequences for each gene in a KO entry, specifically for the matched species. Here’s how it works under the hood:

Fetch KO → Gene Table
- fetch_kegg_orthology("KXXXXX") returns a text blob.
- parse_gene_table(...) builds a DataFrame where each row is a gene ID + species.
Match Species
- load_species_data() either reads from a local cache or hits KEGG once to get all (species_name ↔ KEGG_ID) pairs.
- map_species_from_single_input(...) uses fuzzy search (edit distance) to compare your input (e.g., “hamo sapiens”) against all available KEGG species names.
- Once the best match (lowest edit distance) is found, it extracts that species row from the DataFrame.
For Each Gene:
- fetch_gene_entry("<species_id>:<gene_id>") → raw “gene entry text” (which contains labeled blocks like AASEQ and NTSEQ).
- extract_sequence(entry, "AASEQ") → pulls out the block of text starting at AASEQ and ending at the next blank line or block header.
- extract_sequence(entry, "NTSEQ") → similarly for nucleotides.
- If a block is missing (some genes might lack AASEQ or NTSEQ), KEGGBLAST simply skips writing that FASTA.
Write FASTA
- write_fasta_file(path, gene_id, sequence_string)
- Internally formats a FASTA header: ">gene_id" plus the raw sequence (wrapped at 60 chars/line by default).

FASTA Extraction Logic - ajmoore143/KEGGBLAST GitHub Wiki

FASTA Extraction Logic

⚠️ GitHub.com Fallback ⚠️

FASTA Extraction Logic - ajmoore143/KEGGBLAST GitHub Wiki

FASTA Extraction Logic

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️