AB Clean sequence - mendessoares/BuddySuite GitHub Wiki
--clean_seq, -cs
Description
Remove all non-alignment characters from input. This will include any spaces, numbers, stop characters (e.g. '*'), etc., but not dashed gap characters ('-'). Passing in the word 'strict' will also replace ambiguous/degenerate characters in nucleotide sequences with 'N'.
Nucleotide sequences: ATGCURYWSMKHBVDNX will be retained. If 'strict' is specified, only ATGCXNU will be retained.
Protein sequences: ACDEFGHIKLMNPQRSTVWXY will be retained. Using the 'strict' command has no effect.
Arguments
'strict' ( exact string )
Optional. By default, ambiguous nucleotide characters will be retained (i.e., the degenerate alphabet), but these can cause issues for some downstream analysis. Include the word 'strict' to replace ambiguous characters with a unified character ('N' by default).
Replacement character ( char )
Optional. If 'N' is not the desired replacement character for degenerate residues, specify a different one.
Examples
Input file: Mle-Panx_align.fa
>Mle-Panxα1
MYWIFEICQEIKRAQSCRKFAIDGPFDWTNRIIMPTLMVICCFLQTFTFMFGSNISCIGF
EKLERNFVEEYCWTQGIYTSKAAYNMP-LHTPYPGIAPCVPEYDPVTQKYWLPCG----V
EEEDKAYHLWYQWVPFYFLAVAVGYYLPFLILKGSKLHQVKPLITYLMNQRNLETDPNHL
VGKLSHWIFRQLVYSRFAATSTIRMYWHDWGLVLLVCSVKILYLTVSLIHLFATAKMFHI
GNWFTYGIMFARR---SNSHTTHVKDVFFPKMVACKIETWSFTGKNHLHGMCVLALNVMN
QYLFLIVWYVNVIIIFLNSISCIYTIVKFCSPNIVHHRIVNSSSLDDHHDFTRMFGYVGP
SGRIILAKMSEHMPGYMLKQVAKKVTEKIDIENEKNRGRAPTIKFTKVNGQPSELARQPL
MHLNALMLGMVPQNLPEPKIQNIQRSQKKVRFLV*
>Mle-Panxα11
M--LISSLVQFSRLSPFKEITIDDGWDQLNRSFMFVLMVICGTIVTVRQHTGNIISCNGF
TKYDGSFSEDYCWTQGLYTIREAYHVSDVNVPYPGV---IPEEIPLCLGDNC---DKLAN
SNTTRVYHLWYQWIPFYFWLASAAFFLPYLIYKRYGFGDIKPLIHMLYNPLDGDEGVKAD
SEKASIWLYHRFS-IYMNEHSMYANFMERHGIGILVIAIKVMYLIISVLLMVMTAMMFEL
ADFKQYGIVWAQQWPDPPANVTGIKDLLFPKMVACEIKRWGPTGLEDENGMCVLAPNVIN
QYIFLILWWALVFTIVSNVFNVLAGVIRIVFIYGSYRRMLASAFLRDDPHYKKVYYKIGT
SGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDD----------------
------------PLL*-------------------
Usage example 1
Convert protein stop characters into gaps
$: alb Mle-Panx_align.fa -cs
Output
>Mle-Panxα1
MYWIFEICQEIKRAQSCRKFAIDGPFDWTNRIIMPTLMVICCFLQTFTFMFGSNISCIGF
EKLERNFVEEYCWTQGIYTSKAAYNMP-LHTPYPGIAPCVPEYDPVTQKYWLPCG----V
EEEDKAYHLWYQWVPFYFLAVAVGYYLPFLILKGSKLHQVKPLITYLMNQRNLETDPNHL
VGKLSHWIFRQLVYSRFAATSTIRMYWHDWGLVLLVCSVKILYLTVSLIHLFATAKMFHI
GNWFTYGIMFARR---SNSHTTHVKDVFFPKMVACKIETWSFTGKNHLHGMCVLALNVMN
QYLFLIVWYVNVIIIFLNSISCIYTIVKFCSPNIVHHRIVNSSSLDDHHDFTRMFGYVGP
SGRIILAKMSEHMPGYMLKQVAKKVTEKIDIENEKNRGRAPTIKFTKVNGQPSELARQPL
MHLNALMLGMVPQNLPEPKIQNIQRSQKKVRFLV-
>Mle-Panxα11
M--LISSLVQFSRLSPFKEITIDDGWDQLNRSFMFVLMVICGTIVTVRQHTGNIISCNGF
TKYDGSFSEDYCWTQGLYTIREAYHVSDVNVPYPGV---IPEEIPLCLGDNC---DKLAN
SNTTRVYHLWYQWIPFYFWLASAAFFLPYLIYKRYGFGDIKPLIHMLYNPLDGDEGVKAD
SEKASIWLYHRFS-IYMNEHSMYANFMERHGIGILVIAIKVMYLIISVLLMVMTAMMFEL
ADFKQYGIVWAQQWPDPPANVTGIKDLLFPKMVACEIKRWGPTGLEDENGMCVLAPNVIN
QYIFLILWWALVFTIVSNVFNVLAGVIRIVFIYGSYRRMLASAFLRDDPHYKKVYYKIGT
SGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDD----------------
------------PLL--------------------
Input file: ambiguous_cds.fa
>ML47742a.
ATGTTAGACATACTTTCAAAGTTTTGCTGAGTTACTCCTTTTAAAGGTATAACGATAGAT
RRRRRRRRRRRRCAACTCAATCGGAGTTTTATGTTCGTCCTGCTCGTTGTCATGGGAACG
YCTGTCACTGTCCGGCAATACACCGGCAGTGTCATCAGTTGTGACGGCTTCAAAAAGTTT
WGATCCACTTTTGCGGAGGATTACTGTTGTCCCCAGGGACTGTACACAGTTTTAGAAGGA
SATGAACCAGTCAGACTCAAGTTCCCTTACCCAGGCCTCCTTCCAGACGAGGCACCACCC
MGTACGACGGTACGAGGTTAAAGT------------------CCAGACCCTGATCAGTTG
KTGTCACCGACGCGGATATCCCACCTATGGTACCAGTGGGTCCCTTTTTACTTCTGGTTG
HCGGCTGCTGCCTTCTTCATGCCCTACCTTCTGTACA------TTGGCATGGGAGATATC
BAGCCTCTCGTGAG------ACACAATCCAGTAGAATCAGACCAGGAGTTAAAGAAGATG
VCAGACAAGGCTGCAACATGGCTGTTCTACAAGTTTGACCTGTACATGAGCGAACAGTCG
DTCCTAGCAAGTCTCACCAGAAAACACGGTCTTGGTCTATCCATGGTCTTTGTAAAGATC
NTATACGCCGCAGTGTCGTTCGGGTGTTTCCTCCTGACCGCTGAGATGTTCTCAATTGGA
XATTTTAAAACCTATGGATCAGAATGGATCAAGAAGTTAAAGTTGGAAGATAATCTAGCT
TAG---------------------------------------------------------
Usage example 2
Restrict alignment characters to the unambiguous character set and 'N'
$: alb ambiguous_cds.fa -cs strict
Output
>ML47742a.
ATGTTAGACATACTTTCAAAGTTTTGCTGAGTTACTCCTTTTAAAGGTATAACGATAGAT
NNNNNNNNNNNNCAACTCAATCGGAGTTTTATGTTCGTCCTGCTCGTTGTCATGGGAACG
NCTGTCACTGTCCGGCAATACACCGGCAGTGTCATCAGTTGTGACGGCTTCAAAAAGTTT
NGATCCACTTTTGCGGAGGATTACTGTTGTCCCCAGGGACTGTACACAGTTTTAGAAGGA
NATGAACCAGTCAGACTCAAGTTCCCTTACCCAGGCCTCCTTCCAGACGAGGCACCACCC
NGTACGACGGTACGAGGTTAAAGT------------------CCAGACCCTGATCAGTTG
NTGTCACCGACGCGGATATCCCACCTATGGTACCAGTGGGTCCCTTTTTACTTCTGGTTG
NCGGCTGCTGCCTTCTTCATGCCCTACCTTCTGTACA------TTGGCATGGGAGATATC
NAGCCTCTCGTGAG------ACACAATCCAGTAGAATCAGACCAGGAGTTAAAGAAGATG
NCAGACAAGGCTGCAACATGGCTGTTCTACAAGTTTGACCTGTACATGAGCGAACAGTCG
NTCCTAGCAAGTCTCACCAGAAAACACGGTCTTGGTCTATCCATGGTCTTTGTAAAGATC
NTATACGCCGCAGTGTCGTTCGGGTGTTTCCTCCTGACCGCTGAGATGTTCTCAATTGGA
NATTTTAAAACCTATGGATCAGAATGGATCAAGAAGTTAAAGTTGGAAGATAATCTAGCT
TAG---------------------------------------------------------
Usage example 3
Replace ambiguous characters with 'X' instead of 'N'
$: alb ambiguous_cds.fa -cs strict X
Output
>ML47742a.
ATGTTAGACATACTTTCAAAGTTTTGCTGAGTTACTCCTTTTAAAGGTATAACGATAGAT
XXXXXXXXXXXXCAACTCAATCGGAGTTTTATGTTCGTCCTGCTCGTTGTCATGGGAACG
XCTGTCACTGTCCGGCAATACACCGGCAGTGTCATCAGTTGTGACGGCTTCAAAAAGTTT
XGATCCACTTTTGCGGAGGATTACTGTTGTCCCCAGGGACTGTACACAGTTTTAGAAGGA
XATGAACCAGTCAGACTCAAGTTCCCTTACCCAGGCCTCCTTCCAGACGAGGCACCACCC
XGTACGACGGTACGAGGTTAAAGT------------------CCAGACCCTGATCAGTTG
XTGTCACCGACGCGGATATCCCACCTATGGTACCAGTGGGTCCCTTTTTACTTCTGGTTG
XCGGCTGCTGCCTTCTTCATGCCCTACCTTCTGTACA------TTGGCATGGGAGATATC
XAGCCTCTCGTGAG------ACACAATCCAGTAGAATCAGACCAGGAGTTAAAGAAGATG
XCAGACAAGGCTGCAACATGGCTGTTCTACAAGTTTGACCTGTACATGAGCGAACAGTCG
XTCCTAGCAAGTCTCACCAGAAAACACGGTCTTGGTCTATCCATGGTCTTTGTAAAGATC
XTATACGCCGCAGTGTCGTTCGGGTGTTTCCTCCTGACCGCTGAGATGTTCTCAATTGGA
XATTTTAAAACCTATGGATCAGAATGGATCAAGAAGTTAAAGTTGGAAGATAATCTAGCT
TAG---------------------------------------------------------