AB Concatenate Alignments - mendessoares/BuddySuite GitHub Wiki
--concat_alignments, -cta
Description
Concatenates two or more alignments into a single alignment.
Records from each alignment are grouped together based on some shared identifier in their record IDs (e.g., an organism name), and each identifier must be present exactly 0 or 1 times in each alignment. As explained further below, there is a high degree of flexibility in how you specify how sequences should be grouped together: Auto-detection, fixed length prefix/suffix, or regular expression.
Arguments
If you pass in no arguments, this tool will analyze the IDs of each sequence and select a prefix with the minimum length necessary to ensure unique identification within each alignment, and then use these prefixes to group records among alignments (see example 1).
Grouping pattern ( regex or int )
Optional. Passing in a positive integer will use a fixed-length prefix from each record ID to group sequences among alignments. If the defining string is at the end of each sequence ID, pass in a negative number to specify a fixed-length suffix. Alternatively, passing in a regular expression allows for very precise control of the groupings in cases were a simple prefix/suffix is insufficient.
Alignment names ( regex or int )
Optional. The position of each subsequence is annotated onto the final concatenated sequence by AlignBuddy, and this information will be written to certain rich formats like GenBank and EMBL. By default the original record ID will be added as the annotation, although, this can be overridden with some sub-identifier (specified by integer or regular expression) if you prefer. This works the same as the grouping pattern described above, but note that order these arguments is passed in matters, so you cannot specify an alignment name without first specifying a grouping pattern.
Examples
Input file: Panx_C-term.physr
3 62
Bfo-Panxα1 DPHYKKVYYKIGTSGRVILNVLASSISPACFQEIMNNVCPRLIRAHVSRKGRNLGDDPNL--
Hca-Panxα1 --HYKKVYYKIGTSGRVILNVIASSIAPSAFQEIMNNVCPRLIRTHVSKKGRNLIDDPDLIS
Mle-Panxα1 DPHYKKVYYKIGTSGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL--
3 68
Bfo-Panxα4 -----EIIQVMTDNTNPLFFSKIFNELTNLLIETSSDQAGKVVENLAMQG-DEDTIVDLDTSSSRT--
Hca-Panxα4 -------LQVLMANTHPVIFTRIFDELTFRLVTKASMD-CEAVKNLQAEGQIGETAIDLEPNLGKAVG
Mle-Panxα4 GAGGREIVQILTDNSNPLLFSKIFDDLTNLLITTSKN--ADVIENLSKL---DSSVIELGSKDSI---
3 61
Bfo-Panxα8 GDSKLKYIYFNCGTTGRTYLHLIAKNINPRIFEQLIIKLKNDLVEEKNKQHLKQTK-EMPV
Hca-Panxα8 -DNKLKYIYFNCGTTGRTYLHLIANNVNPRVFEQLVIRLSKDLVEEKNKAHLKKAEGEANV
Mle-Panxα8 ENSKLKFIYFNCGTTGRTYLHLIAKNVNPRIFEQLIIKLSADLVEEKNKQHLKGSK-DILV
Usage example 1
Pass in zero arguments and AlignBuddy will detect the shortest possible identifier for each new concatenated sequence (in this case, "B", "H", and "M").
$: alb Panx_C-term.physr -cta
Output
3 191
B DPHYKKVYYKIGTSGRVILNVLASSISPACFQEIMNNVCPRLIRAHVSRKGRNLGDDPNL-------EIIQVMTDNTNPLFFSKIFNELTNLLIETSSDQAGKVVENLAMQG-DEDTIVDLDTSSSRT--GDSKLKYIYFNCGTTGRTYLHLIAKNINPRIFEQLIIKLKNDLVEEKNKQHLKQTK-EMPV
H --HYKKVYYKIGTSGRVILNVIASSIAPSAFQEIMNNVCPRLIRTHVSKKGRNLIDDPDLIS-------LQVLMANTHPVIFTRIFDELTFRLVTKASMD-CEAVKNLQAEGQIGETAIDLEPNLGKAVG-DNKLKYIYFNCGTTGRTYLHLIANNVNPRVFEQLVIRLSKDLVEEKNKAHLKKAEGEANV
M DPHYKKVYYKIGTSGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL--GAGGREIVQILTDNSNPLLFSKIFDDLTNLLITTSKN--ADVIENLSKL---DSSVIELGSKDSI---ENSKLKFIYFNCGTTGRTYLHLIAKNVNPRIFEQLIIKLSADLVEEKNKQHLKGSK-DILV
Usage example 2
Group records by the three letter prefix found in each ID by passing in a positive integer as the first argument.
$: alb Panx_C-term.physr -cta 3
Output
3 191
Bfo DPHYKKVYYKIGTSGRVILNVLASSISPACFQEIMNNVCPRLIRAHVSRKGRNLGDDPNL-------EIIQVMTDNTNPLFFSKIFNELTNLLIETSSDQAGKVVENLAMQG-DEDTIVDLDTSSSRT--GDSKLKYIYFNCGTTGRTYLHLIAKNINPRIFEQLIIKLKNDLVEEKNKQHLKQTK-EMPV
Hca --HYKKVYYKIGTSGRVILNVIASSIAPSAFQEIMNNVCPRLIRTHVSKKGRNLIDDPDLIS-------LQVLMANTHPVIFTRIFDELTFRLVTKASMD-CEAVKNLQAEGQIGETAIDLEPNLGKAVG-DNKLKYIYFNCGTTGRTYLHLIANNVNPRVFEQLVIRLSKDLVEEKNKAHLKKAEGEANV
Mle DPHYKKVYYKIGTSGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL--GAGGREIVQILTDNSNPLLFSKIFDDLTNLLITTSKN--ADVIENLSKL---DSSVIELGSKDSI---ENSKLKFIYFNCGTTGRTYLHLIAKNVNPRIFEQLIIKLSADLVEEKNKQHLKGSK-DILV
Usage example 3
Use a regular expression to group records instead of a set-length prefix. Here, the two letter species code is the unique component of the IDs that groups are based on.
$: alb Panx_C-term.physr -cta "[a-z]{2}-Panx"
Output
3 191
fo-Panx DPHYKKVYYKIGTSGRVILNVLASSISPACFQEIMNNVCPRLIRAHVSRKGRNLGDDPNL-------EIIQVMTDNTNPLFFSKIFNELTNLLIETSSDQAGKVVENLAMQG-DEDTIVDLDTSSSRT--GDSKLKYIYFNCGTTGRTYLHLIAKNINPRIFEQLIIKLKNDLVEEKNKQHLKQTK-EMPV
ca-Panx --HYKKVYYKIGTSGRVILNVIASSIAPSAFQEIMNNVCPRLIRTHVSKKGRNLIDDPDLIS-------LQVLMANTHPVIFTRIFDELTFRLVTKASMD-CEAVKNLQAEGQIGETAIDLEPNLGKAVG-DNKLKYIYFNCGTTGRTYLHLIANNVNPRVFEQLVIRLSKDLVEEKNKAHLKKAEGEANV
le-Panx DPHYKKVYYKIGTSGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL--GAGGREIVQILTDNSNPLLFSKIFDDLTNLLITTSKN--ADVIENLSKL---DSSVIELGSKDSI---ENSKLKFIYFNCGTTGRTYLHLIAKNVNPRIFEQLIIKLSADLVEEKNKQHLKGSK-DILV
Usage example 4
If the group pattern does not find a match in a given alignment, than gaps are filled in for that component of the concatenated alignment.
$: alb Panx_C-term.physr -cta ".*1|..."
6 191
Bfo-Panxα1 DPHYKKVYYKIGTSGRVILNVLASSISPACFQEIMNNVCPRLIRAHVSRKGRNLGDDPNL-----------------------------------------------------------------------------------------------------------------------------------
Hca-Panxα1 --HYKKVYYKIGTSGRVILNVIASSIAPSAFQEIMNNVCPRLIRTHVSKKGRNLIDDPDLIS---------------------------------------------------------------------------------------------------------------------------------
Mle-Panxα1 DPHYKKVYYKIGTSGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL-----------------------------------------------------------------------------------------------------------------------------------
Bfo -------------------------------------------------------------------EIIQVMTDNTNPLFFSKIFNELTNLLIETSSDQAGKVVENLAMQG-DEDTIVDLDTSSSRT--GDSKLKYIYFNCGTTGRTYLHLIAKNINPRIFEQLIIKLKNDLVEEKNKQHLKQTK-EMPV
Hca ---------------------------------------------------------------------LQVLMANTHPVIFTRIFDELTFRLVTKASMD-CEAVKNLQAEGQIGETAIDLEPNLGKAVG-DNKLKYIYFNCGTTGRTYLHLIANNVNPRVFEQLVIRLSKDLVEEKNKAHLKKAEGEANV
Mle --------------------------------------------------------------GAGGREIVQILTDNSNPLLFSKIFDDLTNLLITTSKN--ADVIENLSKL---DSSVIELGSKDSI---ENSKLKFIYFNCGTTGRTYLHLIAKNVNPRIFEQLIIKLSADLVEEKNKQHLKGSK-DILV
Usage example 5
The location of each component of the concatenated alignment is stored when using this tool, and will be annotated as a feature if outputting to a rich format like GenBank or EMBL.
$: alb Panx_C-term.physr -cta 3 -o genbank
Output
LOCUS Bfo 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Bfo
VERSION Bfo
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Bfo-Panxα1 1..62
Bfo-Panxα4 63..130
Bfo-Panxα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln vlassispac fqeimnnvcp rlirahvsrk grnlgddpnl
61 -------eii qvmtdntnpl ffskifnelt nllietssdq agkvvenlam qg-dedtivd
121 ldtsssrt-- gdsklkyiyf ncgttgrtyl hliakninpr ifeqliiklk ndlveeknkq
181 hlkqtk-emp v
//
LOCUS Hca 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Hca
VERSION Hca
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Hca-Panxα1 1..62
Hca-Panxα4 63..130
Hca-Panxα8 131..191
ORIGIN
1 --hykkvyyk igtsgrviln viassiapsa fqeimnnvcp rlirthvskk grnliddpdl
61 is-------l qvlmanthpv iftrifdelt frlvtkasmd -ceavknlqa egqigetaid
121 lepnlgkavg -dnklkyiyf ncgttgrtyl hliannvnpr vfeqlvirls kdlveeknka
181 hlkkaegean v
//
LOCUS Mle 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Mle
VERSION Mle
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Mle-Panxα1 1..62
Mle-Panxα4 63..130
Mle-Panxα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln mlaasisptc fqeimnnvcp rlirahvskk grnlgddpll
61 --gaggreiv qiltdnsnpl lfskifddlt nllittskn- -advienlsk l---dssvie
121 lgskdsi--- ensklkfiyf ncgttgrtyl hliaknvnpr ifeqliikls adlveeknkq
181 hlkgsk-dil v
//
Usage example 6
Note in the above example that each component is annotated with the original sequence ID. You can restrict this by passing in an integer as the second argument, and this number of trailing characters will be used as the alignment name. Passing in a negative number will take the characters from the front of the sequence ID (this is opposite of the Group Pattern argument).
$: alb Panx_C-term.physr -cta ".{3}" 6
Output
LOCUS Bfo 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Bfo
VERSION Bfo
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Panxα1 1..62
Panxα4 63..130
Panxα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln vlassispac fqeimnnvcp rlirahvsrk grnlgddpnl
61 -------eii qvmtdntnpl ffskifnelt nllietssdq agkvvenlam qg-dedtivd
121 ldtsssrt-- gdsklkyiyf ncgttgrtyl hliakninpr ifeqliiklk ndlveeknkq
181 hlkqtk-emp v
//
LOCUS Hca 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Hca
VERSION Hca
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Panxα1 1..62
Panxα4 63..130
Panxα8 131..191
ORIGIN
1 --hykkvyyk igtsgrviln viassiapsa fqeimnnvcp rlirthvskk grnliddpdl
61 is-------l qvlmanthpv iftrifdelt frlvtkasmd -ceavknlqa egqigetaid
121 lepnlgkavg -dnklkyiyf ncgttgrtyl hliannvnpr vfeqlvirls kdlveeknka
181 hlkkaegean v
//
LOCUS Mle 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Mle
VERSION Mle
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Panxα1 1..62
Panxα4 63..130
Panxα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln mlaasisptc fqeimnnvcp rlirahvskk grnlgddpll
61 --gaggreiv qiltdnsnpl lfskifddlt nllittskn- -advienlsk l---dssvie
121 lgskdsi--- ensklkfiyf ncgttgrtyl hliaknvnpr ifeqliikls adlveeknkq
181 hlkgsk-dil v
//
Usage example 7
Alignment names can also be specified with a regular expression. If no match is found, then the name reverts to the whole record ID.
$: alb temp.del -cta 3 "Panxα[1-5]" -o gb
Output
LOCUS Bfo 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Bfo
VERSION Bfo
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Panxα1 1..62
Panxα4 63..130
Bfo-Panxα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln vlassispac fqeimnnvcp rlirahvsrk grnlgddpnl
61 -------eii qvmtdntnpl ffskifnelt nllietssdq agkvvenlam qg-dedtivd
121 ldtsssrt-- gdsklkyiyf ncgttgrtyl hliakninpr ifeqliiklk ndlveeknkq
181 hlkqtk-emp v
//
LOCUS Hca 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Hca
VERSION Hca
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Panxα1 1..62
Panxα4 63..130
Hca-Panxα8 131..191
ORIGIN
1 --hykkvyyk igtsgrviln viassiapsa fqeimnnvcp rlirthvskk grnliddpdl
61 is-------l qvlmanthpv iftrifdelt frlvtkasmd -ceavknlqa egqigetaid
121 lepnlgkavg -dnklkyiyf ncgttgrtyl hliannvnpr vfeqlvirls kdlveeknka
181 hlkkaegean v
//
LOCUS Mle 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION Mle
VERSION Mle
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Panxα1 1..62
Panxα4 63..130
Mle-Panxα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln mlaasisptc fqeimnnvcp rlirahvskk grnlgddpll
61 --gaggreiv qiltdnsnpl lfskifddlt nllittskn- -advienlsk l---dssvie
121 lgskdsi--- ensklkfiyf ncgttgrtyl hliaknvnpr ifeqliikls adlveeknkq
181 hlkgsk-dil v
//
Usage example 8
If you want to get really fancy with your regular expressions, include parentheses groups. Only the matches within the parentheses will be used in the final names.
$: alb Panx_C-term.physr -cta "^(.).{3}([^0-9]+)" "(P)anx(α.*)"
Output
LOCUS BPanxα 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION BPanxα
VERSION BPanxα
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Pα1 1..62
Pα4 63..130
Pα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln vlassispac fqeimnnvcp rlirahvsrk grnlgddpnl
61 -------eii qvmtdntnpl ffskifnelt nllietssdq agkvvenlam qg-dedtivd
121 ldtsssrt-- gdsklkyiyf ncgttgrtyl hliakninpr ifeqliiklk ndlveeknkq
181 hlkqtk-emp v
//
LOCUS HPanxα 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION HPanxα
VERSION HPanxα
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Pα1 1..62
Pα4 63..130
Pα8 131..191
ORIGIN
1 --hykkvyyk igtsgrviln viassiapsa fqeimnnvcp rlirthvskk grnliddpdl
61 is-------l qvlmanthpv iftrifdelt frlvtkasmd -ceavknlqa egqigetaid
121 lepnlgkavg -dnklkyiyf ncgttgrtyl hliannvnpr vfeqlvirls kdlveeknka
181 hlkkaegean v
//
LOCUS MPanxα 191 aa UNK 01-JAN-1980
DEFINITION .
ACCESSION MPanxα
VERSION MPanxα
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
Pα1 1..62
Pα4 63..130
Pα8 131..191
ORIGIN
1 dphykkvyyk igtsgrviln mlaasisptc fqeimnnvcp rlirahvskk grnlgddpll
61 --gaggreiv qiltdnsnpl lfskifddlt nllittskn- -advienlsk l---dssvie
121 lgskdsi--- ensklkfiyf ncgttgrtyl hliaknvnpr ifeqliikls adlveeknkq
181 hlkgsk-dil v
//