SB Clean sequence - mendessoares/BuddySuite GitHub Wiki
--clean_seq, -cs
Description
Remove all non-sequence characters from input. This will include any spaces, numbers, gap characters (e.g. '-'), stop characters (e.g. '*'), etc. Passing in the word 'strict' will also replace ambiguous/degenerate characters in nucleotide sequences with 'N'.
Nucleotide sequences: ATGCURYWSMKHBVDNX will be retained. If 'strict' is specified, only ATGCXNU will be retained.
Protein sequences: ACDEFGHIKLMNPQRSTVWXY will be retained. Using the 'strict' command has no effect.
Arguments
'strict' ( exact string )
Optional. By default, ambiguous nucleotide characters will be retained (i.e., the degenerate alphabet), but these can cause issues for some downstream analysis. Include the word 'strict' to replace ambiguous characters with a unified character ('N' by default).
Replacement character ( char )
Optional. If 'N' is not the desired replacement character for degenerate residues, specify a different one.
Examples
Input file: Mle-Panx_align.fa
>Mle-Panxα1
MYWIFEICQEIKRAQSCRKFAIDGPFDWTNRIIMPTLMVICCFLQTFTFMFGSNISCIGF
EKLERNFVEEYCWTQGIYTSKAAYNMP-LHTPYPGIAPCVPEYDPVTQKYWLPCG----V
EEEDKAYHLWYQWVPFYFLAVAVGYYLPFLILKGSKLHQVKPLITYLMNQRNLETDPNHL
VGKLSHWIFRQLVYSRFAATSTIRMYWHDWGLVLLVCSVKILYLTVSLIHLFATAKMFHI
GNWFTYGIMFARR---SNSHTTHVKDVFFPKMVACKIETWSFTGKNHLHGMCVLALNVMN
QYLFLIVWYVNVIIIFLNSISCIYTIVKFCSPNIVHHRIVNSSSLDDHHDFTRMFGYVGP
SGRIILAKMSEHMPGYMLKQVAKKVTEKIDIENEKNRGRAPTIKFTKVNGQPSELARQPL
MHLNALMLGMVPQNLPEPKIQNIQRSQKKVRFLV*
>Mle-Panxα11
M--LISSLVQFSRLSPFKEITIDDGWDQLNRSFMFVLMVICGTIVTVRQHTGNIISCNGF
TKYDGSFSEDYCWTQGLYTIREAYHVSDVNVPYPGV---IPEEIPLCLGDNC---DKLAN
SNTTRVYHLWYQWIPFYFWLASAAFFLPYLIYKRYGFGDIKPLIHMLYNPLDGDEGVKAD
SEKASIWLYHRFS-IYMNEHSMYANFMERHGIGILVIAIKVMYLIISVLLMVMTAMMFEL
ADFKQYGIVWAQQWPDPPANVTGIKDLLFPKMVACEIKRWGPTGLEDENGMCVLAPNVIN
QYIFLILWWALVFTIVSNVFNVLAGVIRIVFIYGSYRRMLASAFLRDDPHYKKVYYKIGT
SGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDD----------------
------------PLL*-------------------
Usage example 1
$: sb Mle-Panx_align.fa -cs
Output
>Mle-Panxα1
MYWIFEICQEIKRAQSCRKFAIDGPFDWTNRIIMPTLMVICCFLQTFTFMFGSNISCIGF
EKLERNFVEEYCWTQGIYTSKAAYNMPLHTPYPGIAPCVPEYDPVTQKYWLPCGVEEEDK
AYHLWYQWVPFYFLAVAVGYYLPFLILKGSKLHQVKPLITYLMNQRNLETDPNHLVGKLS
HWIFRQLVYSRFAATSTIRMYWHDWGLVLLVCSVKILYLTVSLIHLFATAKMFHIGNWFT
YGIMFARRSNSHTTHVKDVFFPKMVACKIETWSFTGKNHLHGMCVLALNVMNQYLFLIVW
YVNVIIIFLNSISCIYTIVKFCSPNIVHHRIVNSSSLDDHHDFTRMFGYVGPSGRIILAK
MSEHMPGYMLKQVAKKVTEKIDIENEKNRGRAPTIKFTKVNGQPSELARQPLMHLNALML
GMVPQNLPEPKIQNIQRSQKKVRFLV
>Mle-Panxα11
MLISSLVQFSRLSPFKEITIDDGWDQLNRSFMFVLMVICGTIVTVRQHTGNIISCNGFTK
YDGSFSEDYCWTQGLYTIREAYHVSDVNVPYPGVIPEEIPLCLGDNCDKLANSNTTRVYH
LWYQWIPFYFWLASAAFFLPYLIYKRYGFGDIKPLIHMLYNPLDGDEGVKADSEKASIWL
YHRFSIYMNEHSMYANFMERHGIGILVIAIKVMYLIISVLLMVMTAMMFELADFKQYGIV
WAQQWPDPPANVTGIKDLLFPKMVACEIKRWGPTGLEDENGMCVLAPNVINQYIFLILWW
ALVFTIVSNVFNVLAGVIRIVFIYGSYRRMLASAFLRDDPHYKKVYYKIGTSGRVILNML
AASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL
Input file: ambiguous_cds.fa
>ML47742a.
ATGTTAGACATACTTTCAAAGTTTTGCTGAGTTACTCCTTTTAAAGGTATAACGATAGAT
RRRRRRRRRRRRCAACTCAATCGGAGTTTTATGTTCGTCCTGCTCGTTGTCATGGGAACG
YCTGTCACTGTCCGGCAATACACCGGCAGTGTCATCAGTTGTGACGGCTTCAAAAAGTTT
WGATCCACTTTTGCGGAGGATTACTGTTGTCCCCAGGGACTGTACACAGTTTTAGAAGGA
SATGAACCAGTCAGACTCAAGTTCCCTTACCCAGGCCTCCTTCCAGACGAGGCACCACCC
MGTACGACGGTACGAGGTTAAAGT------------------CCAGACCCTGATCAGTTG
KTGTCACCGACGCGGATATCCCACCTATGGTACCAGTGGGTCCCTTTTTACTTCTGGTTG
HCGGCTGCTGCCTTCTTCATGCCCTACCTTCTGTACA------TTGGCATGGGAGATATC
BAGCCTCTCGTGAG------ACACAATCCAGTAGAATCAGACCAGGAGTTAAAGAAGATG
VCAGACAAGGCTGCAACATGGCTGTTCTACAAGTTTGACCTGTACATGAGCGAACAGTCG
DTCCTAGCAAGTCTCACCAGAAAACACGGTCTTGGTCTATCCATGGTCTTTGTAAAGATC
NTATACGCCGCAGTGTCGTTCGGGTGTTTCCTCCTGACCGCTGAGATGTTCTCAATTGGA
XATTTTAAAACCTATGGATCAGAATGGATCAAGAAGTTAAAGTTGGAAGATAATCTAGCT
TAG---------------------------------------------------------
Usage example 2
$: sb ambiguous_cds.fa -cs
Output
>ML47742a.
ATGTTAGACATACTTTCAAAGTTTTGCTGAGTTACTCCTTTTAAAGGTATAACGATAGAT
RRRRRRRRRRRRCAACTCAATCGGAGTTTTATGTTCGTCCTGCTCGTTGTCATGGGAACG
YCTGTCACTGTCCGGCAATACACCGGCAGTGTCATCAGTTGTGACGGCTTCAAAAAGTTT
WGATCCACTTTTGCGGAGGATTACTGTTGTCCCCAGGGACTGTACACAGTTTTAGAAGGA
SATGAACCAGTCAGACTCAAGTTCCCTTACCCAGGCCTCCTTCCAGACGAGGCACCACCC
MGTACGACGGTACGAGGTTAAAGTCCAGACCCTGATCAGTTGKTGTCACCGACGCGGATA
TCCCACCTATGGTACCAGTGGGTCCCTTTTTACTTCTGGTTGHCGGCTGCTGCCTTCTTC
ATGCCCTACCTTCTGTACATTGGCATGGGAGATATCBAGCCTCTCGTGAGACACAATCCA
GTAGAATCAGACCAGGAGTTAAAGAAGATGVCAGACAAGGCTGCAACATGGCTGTTCTAC
AAGTTTGACCTGTACATGAGCGAACAGTCGDTCCTAGCAAGTCTCACCAGAAAACACGGT
CTTGGTCTATCCATGGTCTTTGTAAAGATCNTATACGCCGCAGTGTCGTTCGGGTGTTTC
CTCCTGACCGCTGAGATGTTCTCAATTGGAXATTTTAAAACCTATGGATCAGAATGGATC
AAGAAGTTAAAGTTGGAAGATAATCTAGCTTAG
Usage example 3
$: sb ambiguous_cds.fa -cs strict
Output
>ML47742a.
ATGTTAGACATACTTTCAAAGTTTTGCTGAGTTACTCCTTTTAAAGGTATAACGATAGAT
NNNNNNNNNNNNCAACTCAATCGGAGTTTTATGTTCGTCCTGCTCGTTGTCATGGGAACG
NCTGTCACTGTCCGGCAATACACCGGCAGTGTCATCAGTTGTGACGGCTTCAAAAAGTTT
NGATCCACTTTTGCGGAGGATTACTGTTGTCCCCAGGGACTGTACACAGTTTTAGAAGGA
NATGAACCAGTCAGACTCAAGTTCCCTTACCCAGGCCTCCTTCCAGACGAGGCACCACCC
NGTACGACGGTACGAGGTTAAAGTCCAGACCCTGATCAGTTGNTGTCACCGACGCGGATA
TCCCACCTATGGTACCAGTGGGTCCCTTTTTACTTCTGGTTGNCGGCTGCTGCCTTCTTC
ATGCCCTACCTTCTGTACATTGGCATGGGAGATATCNAGCCTCTCGTGAGACACAATCCA
GTAGAATCAGACCAGGAGTTAAAGAAGATGNCAGACAAGGCTGCAACATGGCTGTTCTAC
AAGTTTGACCTGTACATGAGCGAACAGTCGNTCCTAGCAAGTCTCACCAGAAAACACGGT
CTTGGTCTATCCATGGTCTTTGTAAAGATCNTATACGCCGCAGTGTCGTTCGGGTGTTTC
CTCCTGACCGCTGAGATGTTCTCAATTGGANATTTTAAAACCTATGGATCAGAATGGATC
AAGAAGTTAAAGTTGGAAGATAATCTAGCTTAG
Usage example 4
$: sb ambiguous_cds.fa -cs strict X
Output
>ML47742a.
ATGTTAGACATACTTTCAAAGTTTTGCTGAGTTACTCCTTTTAAAGGTATAACGATAGAT
XXXXXXXXXXXXCAACTCAATCGGAGTTTTATGTTCGTCCTGCTCGTTGTCATGGGAACG
XCTGTCACTGTCCGGCAATACACCGGCAGTGTCATCAGTTGTGACGGCTTCAAAAAGTTT
XGATCCACTTTTGCGGAGGATTACTGTTGTCCCCAGGGACTGTACACAGTTTTAGAAGGA
XATGAACCAGTCAGACTCAAGTTCCCTTACCCAGGCCTCCTTCCAGACGAGGCACCACCC
XGTACGACGGTACGAGGTTAAAGTCCAGACCCTGATCAGTTGXTGTCACCGACGCGGATA
TCCCACCTATGGTACCAGTGGGTCCCTTTTTACTTCTGGTTGXCGGCTGCTGCCTTCTTC
ATGCCCTACCTTCTGTACATTGGCATGGGAGATATCXAGCCTCTCGTGAGACACAATCCA
GTAGAATCAGACCAGGAGTTAAAGAAGATGXCAGACAAGGCTGCAACATGGCTGTTCTAC
AAGTTTGACCTGTACATGAGCGAACAGTCGXTCCTAGCAAGTCTCACCAGAAAACACGGT
CTTGGTCTATCCATGGTCTTTGTAAAGATCXTATACGCCGCAGTGTCGTTCGGGTGTTTC
CTCCTGACCGCTGAGATGTTCTCAATTGGAXATTTTAAAACCTATGGATCAGAATGGATC
AAGAAGTTAAAGTTGGAAGATAATCTAGCTTAG