SB Find pattern - mendessoares/BuddySuite GitHub Wiki

--find_pattern, -fp

Description

Search for all occurrences of sub-sequence(s) within the input sequences. The start positions of all matches are returned to stderr, and depending on the output format selected, the matches will either be represented directly in the sequences using UPPERCASE (non-matches sequence will be in lowercase), or as annotated 'match' features (GenBank and EMBL format).

Arguments

One or more sequence patterns ( str )

Simple strings (case insensitive) are acceptable input, but regular expressions are also understood for more advanced searches.

ambig ( exact string )

This feature will be released with V1.2. Currently available in the development branch of the git repo.

Optional: Both nucleotide and protein sequences have ambiguity codes (see below), which can be used in place of (or in combination with) regular expressions if desired. SeqBuddy will treat all characters as literal by default, which means the regular expression ATGN{194,1994}(TGA|TAG|TAA) will look for a sequence with a start codon, followed by 194 to 1994 literal 'N' characters, followed by a stop codon. In this case it would probably makes more sense for the 'N' character to be written [ATCG], which would match any open reading frames between 200 and 2000 residues long. Simply pass in the argument 'ambig' to allow ambiguous characters to represent any of their subset of residues (see example 3).

Nucleotide Code:  Bases:
----------------  -----
R.................A or G
Y.................C or T/U
S.................G or C
W.................A or T/U
K.................G or T/U
M.................A or C
B.................C or G or T/U
D.................A or G or T/U
H.................A or C or T/U
V.................A or C or G
N/X...............any base

Amino Acid Code:  Three letter Code:  Amino Acids:
----------------  ------------------  -----------
B.................Asx.................Aspartic acid or Asparagine
Z.................Glx.................Glutamine or Glutamic acid
X.....................................Any amino acid

Examples

Input file: Drosophila.fa

>Dme-Panxδ3
GFIKIDNMVFRCHYRITAILFTCCIIVTANNLIGDPISCIIPMHVINTFCWITYTYTVAG
PGLEKHSYYQWVPFVLFFQGLMFYVPHWVWKMDGKIRMITGVDDRDRILKYFVNNTHNGY
SFYFFCELLNFINVIVNIFMVDKFLGGAFMSYGTDVLKFSNMDQDRFDPMIEIFPRLTKC
TFHKFGPSGSVQKHDTLCVLALNILNEKIYIFLWFWFIILATISGVAVLYSVVITRTIRK
EGDFLILHFLSQNLSTRSYSDMLQ
>Dme-Panxδ2
MDVFGSVKGLLKIDQVDNNVFRMHYKATVIILIAFSLLVTSRQYIGDPIDCIVEIPLGVM
DTYCWIYSTFTVPEGRDVQPGSEKYHKYYQWVCFVLFFQAILFYVPRYLWKSWEGGRLKM
LVDLSVNDKDRKIVDYFGNLNRHNFYAFFFVCEALNFVNVIGQIYFVDFFLDGEFSTYGS
DVLKFTELEPDERIDPMARVFPKVTKCTFHKYGPSGSVQTHDGLCVLPLNIVNEKIYVFL
WFWFIILSIMSISLIYRIAVAPKLRHLLLRARSRAESEVEVAIGDWFLLYQLGKNIDPLI
YKEVISDLEMG
>Dme-Panxδ4
MAAVKPLSKYLQFKVHIYDAIFTLHSKVTVALLLACTFLLSSKQYFGDPIQCFGDKDMDA
FCWIYGAYLQCAVSKVVENYITYYQWVVLVLLLESFVFYMPAFLWKIWEGGRLKHLCDFK
RTHRVLVNYFETHFRYFVYVFCEILNLSISILNFLLLDVFFGGFWGRYRNALYNQWIAVF
PKCAKCEYKGGPSGSSNIYDYLCLLPLNILNEKIFAFLWIWFILAMLISLKFLYRLAVLY
PMRLQLLRPKKHLQVALNCSFGDWFVLMRVGNNISPELFRKLLEEL

Usage example 1

$: sb Drosophila.fa -fp "LLL" "LLY"

Output

#### 5 matches found across 4 sequences for pattern 'LLL' ####
Dme-Panxδ3: None
Dme-Panxδ2: 266
Dme-Panxδ4: 31, 90, 154

#### 1 matches found across 4 sequences for pattern 'LLY' ####
Dme-Panxδ3: None
Dme-Panxδ2: 285
Dme-Panxδ4: None

>Dme-Panxδ3
gfikidnmvfrchyritailftcciivtannligdpisciipmhvintfcwitytytvag
pglekhsyyqwvpfvlffqglmfyvphwvwkmdgkirmitgvddrdrilkyfvnnthngy
sfyffcellnfinvivnifmvdkflggafmsygtdvlkfsnmdqdrfdpmieifprltkc
tfhkfgpsgsvqkhdtlcvlalnilnekiyiflwfwfiilatisgvavlysvvitrtirk
egdflilhflsqnlstrsysdmlq
>Dme-Panxδ2
mdvfgsvkgllkidqvdnnvfrmhykatviiliafsllvtsrqyigdpidciveiplgvm
dtycwiystftvpegrdvqpgsekyhkyyqwvcfvlffqailfyvprylwksweggrlkm
lvdlsvndkdrkivdyfgnlnrhnfyafffvcealnfvnvigqiyfvdffldgefstygs
dvlkftelepderidpmarvfpkvtkctfhkygpsgsvqthdglcvlplnivnekiyvfl
wfwfiilsimsisliyriavapklrLLLarsraesevevaigdwLLYlgknidpliykev
isdlemg
>Dme-Panxδ4
maavkplskylqfkvhiydaiftlhskvtvLLLctfllsskqyfgdpiqcfgdkdmdafc
wiygaylqcavskvvenyityyqwvvlLLLsfvfympaflwkiweggrlkhlcdfkrthr
vlvnyfethfryfvyvfceilnlsisilnLLLvffggfwgryrnalynqwiavfpkcakc
eykggpsgssniydylcllplnilnekifaflwiwfilamlislkflyrlavlypmrlql
lrpkkhlqvalncsfgdwfvlmrvgnnispelfrklleel

Usage example 2

$: sb Drosophila.fa -fp "[LIY]{3}" -o genbank

Output

#### 17 matches found across 4 sequences for pattern '[LIY]{3}' ####
Dme-Panxδ3: 208, 217, 244
Dme-Panxδ2: 29, 244, 253, 266, 287, 298
Dme-Panxδ4: 31, 90, 154

LOCUS       Dme-Panxδ3               258 aa                     UNK 01-JAN-1980
DEFINITION  Dme-Panxδ3
ACCESSION   Dme-Panxδ3
VERSION     Dme-Panxδ3
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     match           209..211
                     /added_by="SeqBuddy"
                     /regex="[LIY]{3}"
     match           218..220
                     /added_by="SeqBuddy"
                     /regex="[LIY]{3}"
     match           245..247
                     /added_by="SeqBuddy"
                     /regex="[LIY]{3}"
ORIGIN
        1 gfikidnmvf rchyritail ftcciivtan nligdpisci ipmhvintfc witytytvag
       61 pglekhsyyq wvpfvlffqg lmfyvphwvw kmdgkirmit gvddrdrilk yfvnnthngy
      121 sfyffcelln finvivnifm vdkflggafm sygtdvlkfs nmdqdrfdpm ieifprltkc
      181 tfhkfgpsgs vqkhdtlcvl alnilneiyi lwfwiiltis gvavlysvvi trtirkegdl
      241 ilflsqnlst rsysdmlq
//
LOCUS       Dme-Panxδ2               299 aa                     UNK 01-JAN-1980
DEFINITION  Dme-Panxδ2
ACCESSION   Dme-Panxδ2
VERSION     Dme-Panxδ2
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     match           30..32
                     /added_by="SeqBuddy"
                     /regex="[LIY]{3}"
     match           245..247
                     /added_by="SeqBuddy"
                     /regex="[LIY]{3}"
     match           254..256
                     /added_by="SeqBuddy"
                     /regex="[LIY]{3}"
     match           267..269
                     /added_by="SeqBuddy"
                     /regex="[LIY]{3}"
     match           288..290
                     /added_by="SeqBuddy"
                     /regex="[LIY]{3}"
     match           299..301
                     /added_by="SeqBuddy"
                     /regex="[LIY]{3}"
ORIGIN
        1 mdvfgsvkgl lkidqvdnnv frmhykatii lafsllvtsr qyigdpidci veiplgvmdt
       61 ycwiystftv pegrdvqpgs ekyhkyyqwv cfvlffqail fyvprylwks weggrlkmlv
      121 dlsvndkdrk ivdyfgnlnr hnfyafffvc ealnfvnvig qiyfvdffld gefstygsdv
      181 lkftelepde ridpmarvfp kvtkctfhky gpsgsvqthd glcvlplniv nekiyvflwf
      241 wiilimsili yiavapklrl llarsraese vevaigdwll ylgknidliy evisdlemg
//
LOCUS       Dme-Panxδ4               280 aa                     UNK 01-JAN-1980
DEFINITION  Dme-Panxδ4
ACCESSION   Dme-Panxδ4
VERSION     Dme-Panxδ4
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     match           32..34
                     /added_by="SeqBuddy"
                     /regex="[LIY]{3}"
     match           91..93
                     /added_by="SeqBuddy"
                     /regex="[LIY]{3}"
     match           155..157
                     /added_by="SeqBuddy"
                     /regex="[LIY]{3}"
ORIGIN
        1 maavkplsky lqfkvhiyda iftlhskvtv lllctfllss kqyfgdpiqc fgdkdmdafc
       61 wiygaylqca vskvvenyit yyqwvvllll sfvfympafl wkiweggrlk hlcdfkrthr
      121 vlvnyfethf ryfvyvfcei lnlsisilnl llvffggfwg ryrnalynqw iavfpkcakc
      181 eykggpsgss niydylcllp lnilnekifa flwiwfilam lislkflyrl avlypmrlql
      241 lrpkkhlqva lncsfgdwfv lmrvgnnisp elfrklleel
//

Usage example 3

Include the argument ambig to search with IUPAC ambiguity codes instead of literal letters.

$: sb Drosophila.fa -fp "[bz]x{50,100}[bz]" "ambig"

Output

#### 7 matches found across 3 sequences for pattern '[bz]x{50,100}[bz]' ####
Dme-Panxδ3: 5, 113
Dme-Panxδ2: 1, 113, 218
Dme-Panxδ4: 11, 117

>Dme-Panxδ3
gfikiDNMVFRCHYRITAILFTCCIIVTANNLIGDPISCIIPMHVINTFCWITYTYTVAG
PGLEKHSYYQWVPFVLFFQGLMFYVPHWVWKMDGKIRMITGVDDRDrilkyfvNNTHNGY
SFYFFCELLNFINVIVNIFMVDKFLGGAFMSYGTDVLKFSNMDQDRFDPMIEIFPRLTKC
TFHKFGPSGSVQKHDTLCVLALNILNEkiyiflwfwfiilatisgvavlysvvitrtirk
egdflilhflsqnlstrsysdmlq
>Dme-Panxδ2
mDVFGSVKGLLKIDQVDNNVFRMHYKATVIILIAFSLLVTSRQYIGDPIDCIVEIPLGVM
DTYCWIYSTFTVPEGRDVQPGSEKYHKYYQWVCFVLFFQailfyvprylwkswEGGRLKM
LVDLSVNDKDRKIVDYFGNLNRHNFYAFFFVCEALNFVNVIGQIYFVDFFLDGEFSTYGS
DVLKFTELEPDERIDpmarvfpkvtkctfhkygpsgsvQTHDGLCVLPLNIVNEKIYVFL
WFWFIILSIMSISLIYRIAVAPKLRHLLLRARSRAESEVEVAIGDWFLLYQLGKNIDPLI
YKEVISDLEmg
>Dme-Panxδ4
maavkplskylQFKVHIYDAIFTLHSKVTVALLLACTFLLSSKQYFGDPIQCFGDKDMDA
FCWIYGAYLQCAVSKVVENYITYYQWVVLVLLLESFVFYMPAFLWKIWEggrlkhlcDFK
RTHRVLVNYFETHFRYFVYVFCEILNLSISILNFLLLDVFFGGFWGRYRNALYNQWIAVF
PKCAKCEYKGGPSGSSNIYDYLCLLPLNILNEkifaflwiwfilamlislkflyrlavly
pmrlqllrpkkhlqvalncsfgdwfvlmrvgnnispelfrklleel