SB Find pattern - mendessoares/BuddySuite GitHub Wiki
--find_pattern, -fp
Description
Search for all occurrences of sub-sequence(s) within the input sequences. The start positions of all matches are returned to stderr, and depending on the output format selected, the matches will either be represented directly in the sequences using UPPERCASE (non-matches sequence will be in lowercase), or as annotated 'match' features (GenBank and EMBL format).
Arguments
One or more sequence patterns ( str )
Simple strings (case insensitive) are acceptable input, but regular expressions are also understood for more advanced searches.
ambig ( exact string )
This feature will be released with V1.2. Currently available in the development branch of the git repo.
Optional: Both nucleotide and protein sequences have ambiguity codes (see below), which can be used in place of (or in combination with) regular expressions if desired. SeqBuddy will treat all characters as literal by default, which means the regular expression ATGN{194,1994}(TGA|TAG|TAA)
will look for a sequence with a start codon, followed by 194 to 1994 literal 'N' characters, followed by a stop codon. In this case it would probably makes more sense for the 'N' character to be written [ATCG], which would match any open reading frames between 200 and 2000 residues long. Simply pass in the argument 'ambig' to allow ambiguous characters to represent any of their subset of residues (see example 3).
Nucleotide Code: Bases:
---------------- -----
R.................A or G
Y.................C or T/U
S.................G or C
W.................A or T/U
K.................G or T/U
M.................A or C
B.................C or G or T/U
D.................A or G or T/U
H.................A or C or T/U
V.................A or C or G
N/X...............any base
Amino Acid Code: Three letter Code: Amino Acids:
---------------- ------------------ -----------
B.................Asx.................Aspartic acid or Asparagine
Z.................Glx.................Glutamine or Glutamic acid
X.....................................Any amino acid
Examples
Input file: Drosophila.fa
>Dme-Panxδ3
GFIKIDNMVFRCHYRITAILFTCCIIVTANNLIGDPISCIIPMHVINTFCWITYTYTVAG
PGLEKHSYYQWVPFVLFFQGLMFYVPHWVWKMDGKIRMITGVDDRDRILKYFVNNTHNGY
SFYFFCELLNFINVIVNIFMVDKFLGGAFMSYGTDVLKFSNMDQDRFDPMIEIFPRLTKC
TFHKFGPSGSVQKHDTLCVLALNILNEKIYIFLWFWFIILATISGVAVLYSVVITRTIRK
EGDFLILHFLSQNLSTRSYSDMLQ
>Dme-Panxδ2
MDVFGSVKGLLKIDQVDNNVFRMHYKATVIILIAFSLLVTSRQYIGDPIDCIVEIPLGVM
DTYCWIYSTFTVPEGRDVQPGSEKYHKYYQWVCFVLFFQAILFYVPRYLWKSWEGGRLKM
LVDLSVNDKDRKIVDYFGNLNRHNFYAFFFVCEALNFVNVIGQIYFVDFFLDGEFSTYGS
DVLKFTELEPDERIDPMARVFPKVTKCTFHKYGPSGSVQTHDGLCVLPLNIVNEKIYVFL
WFWFIILSIMSISLIYRIAVAPKLRHLLLRARSRAESEVEVAIGDWFLLYQLGKNIDPLI
YKEVISDLEMG
>Dme-Panxδ4
MAAVKPLSKYLQFKVHIYDAIFTLHSKVTVALLLACTFLLSSKQYFGDPIQCFGDKDMDA
FCWIYGAYLQCAVSKVVENYITYYQWVVLVLLLESFVFYMPAFLWKIWEGGRLKHLCDFK
RTHRVLVNYFETHFRYFVYVFCEILNLSISILNFLLLDVFFGGFWGRYRNALYNQWIAVF
PKCAKCEYKGGPSGSSNIYDYLCLLPLNILNEKIFAFLWIWFILAMLISLKFLYRLAVLY
PMRLQLLRPKKHLQVALNCSFGDWFVLMRVGNNISPELFRKLLEEL
Usage example 1
$: sb Drosophila.fa -fp "LLL" "LLY"
Output
#### 5 matches found across 4 sequences for pattern 'LLL' ####
Dme-Panxδ3: None
Dme-Panxδ2: 266
Dme-Panxδ4: 31, 90, 154
#### 1 matches found across 4 sequences for pattern 'LLY' ####
Dme-Panxδ3: None
Dme-Panxδ2: 285
Dme-Panxδ4: None
>Dme-Panxδ3
gfikidnmvfrchyritailftcciivtannligdpisciipmhvintfcwitytytvag
pglekhsyyqwvpfvlffqglmfyvphwvwkmdgkirmitgvddrdrilkyfvnnthngy
sfyffcellnfinvivnifmvdkflggafmsygtdvlkfsnmdqdrfdpmieifprltkc
tfhkfgpsgsvqkhdtlcvlalnilnekiyiflwfwfiilatisgvavlysvvitrtirk
egdflilhflsqnlstrsysdmlq
>Dme-Panxδ2
mdvfgsvkgllkidqvdnnvfrmhykatviiliafsllvtsrqyigdpidciveiplgvm
dtycwiystftvpegrdvqpgsekyhkyyqwvcfvlffqailfyvprylwksweggrlkm
lvdlsvndkdrkivdyfgnlnrhnfyafffvcealnfvnvigqiyfvdffldgefstygs
dvlkftelepderidpmarvfpkvtkctfhkygpsgsvqthdglcvlplnivnekiyvfl
wfwfiilsimsisliyriavapklrLLLarsraesevevaigdwLLYlgknidpliykev
isdlemg
>Dme-Panxδ4
maavkplskylqfkvhiydaiftlhskvtvLLLctfllsskqyfgdpiqcfgdkdmdafc
wiygaylqcavskvvenyityyqwvvlLLLsfvfympaflwkiweggrlkhlcdfkrthr
vlvnyfethfryfvyvfceilnlsisilnLLLvffggfwgryrnalynqwiavfpkcakc
eykggpsgssniydylcllplnilnekifaflwiwfilamlislkflyrlavlypmrlql
lrpkkhlqvalncsfgdwfvlmrvgnnispelfrklleel
Usage example 2
$: sb Drosophila.fa -fp "[LIY]{3}" -o genbank
Output
#### 17 matches found across 4 sequences for pattern '[LIY]{3}' ####
Dme-Panxδ3: 208, 217, 244
Dme-Panxδ2: 29, 244, 253, 266, 287, 298
Dme-Panxδ4: 31, 90, 154
LOCUS Dme-Panxδ3 258 aa UNK 01-JAN-1980
DEFINITION Dme-Panxδ3
ACCESSION Dme-Panxδ3
VERSION Dme-Panxδ3
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
match 209..211
/added_by="SeqBuddy"
/regex="[LIY]{3}"
match 218..220
/added_by="SeqBuddy"
/regex="[LIY]{3}"
match 245..247
/added_by="SeqBuddy"
/regex="[LIY]{3}"
ORIGIN
1 gfikidnmvf rchyritail ftcciivtan nligdpisci ipmhvintfc witytytvag
61 pglekhsyyq wvpfvlffqg lmfyvphwvw kmdgkirmit gvddrdrilk yfvnnthngy
121 sfyffcelln finvivnifm vdkflggafm sygtdvlkfs nmdqdrfdpm ieifprltkc
181 tfhkfgpsgs vqkhdtlcvl alnilneiyi lwfwiiltis gvavlysvvi trtirkegdl
241 ilflsqnlst rsysdmlq
//
LOCUS Dme-Panxδ2 299 aa UNK 01-JAN-1980
DEFINITION Dme-Panxδ2
ACCESSION Dme-Panxδ2
VERSION Dme-Panxδ2
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
match 30..32
/added_by="SeqBuddy"
/regex="[LIY]{3}"
match 245..247
/added_by="SeqBuddy"
/regex="[LIY]{3}"
match 254..256
/added_by="SeqBuddy"
/regex="[LIY]{3}"
match 267..269
/added_by="SeqBuddy"
/regex="[LIY]{3}"
match 288..290
/added_by="SeqBuddy"
/regex="[LIY]{3}"
match 299..301
/added_by="SeqBuddy"
/regex="[LIY]{3}"
ORIGIN
1 mdvfgsvkgl lkidqvdnnv frmhykatii lafsllvtsr qyigdpidci veiplgvmdt
61 ycwiystftv pegrdvqpgs ekyhkyyqwv cfvlffqail fyvprylwks weggrlkmlv
121 dlsvndkdrk ivdyfgnlnr hnfyafffvc ealnfvnvig qiyfvdffld gefstygsdv
181 lkftelepde ridpmarvfp kvtkctfhky gpsgsvqthd glcvlplniv nekiyvflwf
241 wiilimsili yiavapklrl llarsraese vevaigdwll ylgknidliy evisdlemg
//
LOCUS Dme-Panxδ4 280 aa UNK 01-JAN-1980
DEFINITION Dme-Panxδ4
ACCESSION Dme-Panxδ4
VERSION Dme-Panxδ4
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
match 32..34
/added_by="SeqBuddy"
/regex="[LIY]{3}"
match 91..93
/added_by="SeqBuddy"
/regex="[LIY]{3}"
match 155..157
/added_by="SeqBuddy"
/regex="[LIY]{3}"
ORIGIN
1 maavkplsky lqfkvhiyda iftlhskvtv lllctfllss kqyfgdpiqc fgdkdmdafc
61 wiygaylqca vskvvenyit yyqwvvllll sfvfympafl wkiweggrlk hlcdfkrthr
121 vlvnyfethf ryfvyvfcei lnlsisilnl llvffggfwg ryrnalynqw iavfpkcakc
181 eykggpsgss niydylcllp lnilnekifa flwiwfilam lislkflyrl avlypmrlql
241 lrpkkhlqva lncsfgdwfv lmrvgnnisp elfrklleel
//
Usage example 3
Include the argument ambig
to search with IUPAC ambiguity codes instead of literal letters.
$: sb Drosophila.fa -fp "[bz]x{50,100}[bz]" "ambig"
Output
#### 7 matches found across 3 sequences for pattern '[bz]x{50,100}[bz]' ####
Dme-Panxδ3: 5, 113
Dme-Panxδ2: 1, 113, 218
Dme-Panxδ4: 11, 117
>Dme-Panxδ3
gfikiDNMVFRCHYRITAILFTCCIIVTANNLIGDPISCIIPMHVINTFCWITYTYTVAG
PGLEKHSYYQWVPFVLFFQGLMFYVPHWVWKMDGKIRMITGVDDRDrilkyfvNNTHNGY
SFYFFCELLNFINVIVNIFMVDKFLGGAFMSYGTDVLKFSNMDQDRFDPMIEIFPRLTKC
TFHKFGPSGSVQKHDTLCVLALNILNEkiyiflwfwfiilatisgvavlysvvitrtirk
egdflilhflsqnlstrsysdmlq
>Dme-Panxδ2
mDVFGSVKGLLKIDQVDNNVFRMHYKATVIILIAFSLLVTSRQYIGDPIDCIVEIPLGVM
DTYCWIYSTFTVPEGRDVQPGSEKYHKYYQWVCFVLFFQailfyvprylwkswEGGRLKM
LVDLSVNDKDRKIVDYFGNLNRHNFYAFFFVCEALNFVNVIGQIYFVDFFLDGEFSTYGS
DVLKFTELEPDERIDpmarvfpkvtkctfhkygpsgsvQTHDGLCVLPLNIVNEKIYVFL
WFWFIILSIMSISLIYRIAVAPKLRHLLLRARSRAESEVEVAIGDWFLLYQLGKNIDPLI
YKEVISDLEmg
>Dme-Panxδ4
maavkplskylQFKVHIYDAIFTLHSKVTVALLLACTFLLSSKQYFGDPIQCFGDKDMDA
FCWIYGAYLQCAVSKVVENYITYYQWVVLVLLLESFVFYMPAFLWKIWEggrlkhlcDFK
RTHRVLVNYFETHFRYFVYVFCEILNLSISILNFLLLDVFFGGFWGRYRNALYNQWIAVF
PKCAKCEYKGGPSGSSNIYDYLCLLPLNILNEkifaflwiwfilamlislkflyrlavly
pmrlqllrpkkhlqvalncsfgdwfvlmrvgnnispelfrklleel