FASTA Extraction Logic - ajmoore143/KEGGBLAST GitHub Wiki
KEGGBLAST’s goal is to gather both amino acid (AASEQ) and nucleotide (NTSEQ) sequences for each gene in a KO entry, specifically for the matched species. Here’s how it works under the hood:
-
Fetch KO → Gene Table
-
fetch_kegg_orthology("KXXXXX")
returns a text blob. -
parse_gene_table(...)
builds a DataFrame where each row is a gene ID + species.
-
-
Match Species
-
load_species_data()
either reads from a local cache or hits KEGG once to get all(species_name ↔ KEGG_ID)
pairs. -
map_species_from_single_input(...)
uses fuzzy search (edit distance) to compare your input (e.g., “hamo sapiens”) against all available KEGG species names. - Once the best match (lowest edit distance) is found, it extracts that species row from the DataFrame.
-
-
For Each Gene:
-
fetch_gene_entry("<species_id>:<gene_id>")
→ raw “gene entry text” (which contains labeled blocks likeAASEQ
andNTSEQ
). -
extract_sequence(entry, "AASEQ")
→ pulls out the block of text starting atAASEQ
and ending at the next blank line or block header. -
extract_sequence(entry, "NTSEQ")
→ similarly for nucleotides. - If a block is missing (some genes might lack
AASEQ
orNTSEQ
), KEGGBLAST simply skips writing that FASTA.
-
-
Write FASTA
write_fasta_file(path, gene_id, sequence_string)
- Internally formats a FASTA header:
">gene_id"
plus the raw sequence (wrapped at 60 chars/line by default).