Fasta Tools - ZhaoL-Bio/BioToolKits GitHub Wiki

extract_fasta.py

This Python script allows for the filtering of FASTA files to extract sequences for specific genes, supporting fuzzy matching of gene names (e.g., "geneA" matches "geneA", "geneA.1", "geneA.2", etc.). It also includes an option to only output the longest sequence for each gene match.

Prerequisites

  • Python 3.6 or higher
  • Biopython library

Installation

pip install biopython

Usage

The script is run from the command line with the following syntax:

python extract_fasta.py <input_fasta> <gene_names_file> <output_fasta> [--longest]
  • <input_fasta>: Path to the input FASTA file.
  • <gene_names_file>: Path to a file containing a list of gene names to filter by, one per line.
  • <output_fasta>: Path to the output file where filtered sequences will be saved.
  • --longest: Optional flag to only output the longest sequence for each gene match.