SeqBuddy - mendessoares/BuddySuite GitHub Wiki

___ ## A friend to take care of your sequence files SeqBuddy is a command line program and Python3 API for quickly and easily reading, writing, analyzing, and manipulating sequence files in common formats including FASTA, GenBank, and NEXUS. There is an emphasis on simplicity and interoperability, as formats are automatically detected and input can be file paths or handles, pipes, or even plain text typed right into your terminal window. The SeqBuddy tools can be broadly grouped into two classes; tools that manipulate your data and return a new sequence file and tools that perform some analysis and return a non-sequence result. Each of the 50+ tools currently implemented in the command line UI have been documented in these wiki-pages, including use cases to demonstrate the tools in action. The flags chosen are hopefully rational, and care has been taken to minimize the number of positional arguments to make the learning curve as shallow as possible.

Command line sequence manipulation tools

Functions

Function Flag Parameters Brief Description
annotate -ano <name> <location> [strand] [qualifiers] [regex_pattern] Add a feature (annotation) to selected sequences.
ave_seq_length -asl ['clean'] Find the average length of all sequences in an input file
back_translate -btr None Convert amino acid sequences into codons. Select mode/species with -p flag [{'random', 'optimized'}] [{'human', 'mouse', 'yeast', 'ecoli'}]
bl2seq -bl2s None All-by-all blast among sequences using bl2seq. Only Returns top hit from each search
blast -bl <BLAST database> BLAST your sequence file using common blast settings, return the hits from blastdb
clean_seq -cs ['strict'] [replacement character] Strip out non-sequence characters, such as stops (*) and gaps (-)
complement -cmp None Return complement of nucleotide sequence
concat_seqs -cts ['clean'] Concatenate a bunch of sequences into a single solid string
count_codons -cc ['concatenate'] Return codon frequency statistics.
count_residues -cr None Generate a table of sequence compositions.
delete_features -df <regex> [regex ...] Remove specified features from all records
delete_large -dlg <threshold (int)> Delete sequences with length above threshold
delete_metadata -dm None Remove meta-data from file (only IDs are retained)
delete_records -dr <regex> [regex ...] [path] [cols (int)] Remove records from a file (deleted IDs are sent to stderr)
delete_repeats -drp [scope {'all', 'ids', 'seqs'}] [columns (int)] Strip out repeat records (ids and/or identical sequences)
delete_small -dsm <threshold (int)> Delete sequences with length below threshold
extract_regions -er <positions (str)> [positions] ... Pull out sub-sequences
find_CpG -fcpg None Predict regions under strong purifying selection based on high CpG content
find_pattern -fp <regex> [regex ...] ['ambig'] Search for sub-sequences, returning match start positions.
find_repeats -frp [columns (int)] Identify whether a file contains repeat sequences and/or sequence ids
find_restriction_sites -frs [enzymes {'commercial', 'all', <specific>} ...] [min cuts (int)] [max cuts (int)] [order {'position', 'alpha'}] Returns a dictionary of all of the restriction sites and their indices for each sequence in the file
group_by_prefix -gbp [Split Pattern [Split pattern ...]] [length (int)] [out dir] Sort sequences into separate files based on prefix
group_by_regex -stf <regex> [regex ...] [Out dir (path)] Group sequences by ID into new files based on some search criteria
guess_alphabet -ga None Return the alphabet type found in the input file
guess_format -gf None Guess the flatfile format of the input file
hash_seq_ids -hsi [hash length (int)] Rename all identifiers to random hashes
insert_seq -is <sequence> <location {front, rear, index (int)}> Insert a sequence at the desired location
isoelectric_point -ip None Calculate isoelectric points
list_features -lf None Print a pretty list of sequence annotations
list_ids -li [columns (int)] Output list of sequence identifiers in one (default) or more columns
lowercase -lc None Convert sequences to lowercase
make_ids_unique -miu [separator (string)] [padding (int)] Add a number at the end of replicate ids to make them unique
map_features_nucl2prot -fn2p None Transfer annotations from cDNA/mRNA sequences onto protein sequences
map_features_prot2nucl -fp2n None Transfer annotations from protein sequences onto cDNA/mRNA sequences
merge -mrg None Group a sequence files together
molecular_weight -mw None Computes the molecular weight of each sequence
num_seqs -ns None Counts how many sequences are present
order_features_alphabetically -ofa ['rev'] Change the output order of sequence features, based on sequence position
order_features_by_position -ofp ['rev'] Change the output order of sequence features, based on sequence position
order_ids -oi ['rev'] Sort all sequences by id in alpha-numeric order (reverse with 'rev')
order_ids_randomly -oir None Randomly reorder the position of records in the file
pull_random_record -prr [number (int)] Extract random sequence(s)
pull_records -pr <regex> [regex ...]['full'][path] Get all the records with ids containing a given string
pull_record_ends -pre <amount (int)> Get the ends of all sequences
purge -prg <Max BLAST bit-score (int)> Delete sequences with high similarity
rename_ids -ri <regex> <subs (str)> [num] ['store'] Replace a pattern in IDs with a new string
replace_subseq -rs <regex> [regex ...] [replacement] Replace a sequence pattern with something new
reverse_complement -rc None Return reverse complement of nucleotide sequences
reverse_transcribe -r2d None Convert RNA sequences to DNA
screw_formats -sf <new format> Change the file format to something else
select_frame -sfr <frame {1, 2, 3}> Change the reading frame of sequences by deleting characters from the front
shuffle_seqs -ss None Randomly reorder primary sequence
translate -tr None Convert coding sequences into amino acid sequences
translate6frames -tr6 None Translate nucleotide sequences into all six reading frames
transcribe -d2r None Convert DNA sequences to RNA
transmembrane_domains -tmd [Job ID] Identify and annotate transmembrane domains using the TOPCONS web service
uppercase -uc None Convert sequences to uppercase

Modifying flags

Flag Brief Description
-f --format Force read a specific BioPython format. This may allow you to use some of the tools on some formats not auto-read by SeqBuddy (no promises)
-i --in_place Rewrites the FIRST input file with the final output. Be careful!
-k --keep_temp Specify a directory to store files produced by an alignment tool (for generate_alignment)
-o --out_format Specify the format you want the output returned in
-q --quiet Suppress stderr messages (not fully implemented yet)
⚠️ **GitHub.com Fallback** ⚠️