SB Group by regex - mendessoares/BuddySuite GitHub Wiki
--group_by_regex, -gbr
Description
Group sequences together based on shared characteristics in record IDs using regular expressions. The groups are written to files in the current working directory or some other pre-existing directory.
Arguments
Regular expression(s) ( regex )
One or more regular expressions can be used to specify how to group IDs. If there are multiple matches in the ID, only the first match is used, and any records that do not contain a match will be sent to a separate file called 'Unknown'.
The pattern "^.*$" can be used to separate every record into its own file.
Output directory ( path )
Optional. By default, all new files will be written to the current working directory. If you wish to send the output elsewhere, provide a path to an existing directory (new directories will not be created for you).
Examples
Input file: C-terms.fa
>Dme~Panxδ1
YKLLGSLKSYLKWQIQTDNAVFRLHNSFTTVLLLTCSLIITATQYVGQPI
>Dme~Panxδ2
MDVFGSVKGLLKIDQVDNNVFRMHYKATVIILIAFSLLVTSRQYIGDPID
>Dme~Panxδ3
GFIKIDNMVFRCHYRITAILFTCCIIVTANNLIGDPISCIIPMHVINTFC
>Dme~Panxδ4
MAAVKPLSKYLQFKVHIYDAIFTLHSKVTVALLLACTFLLSSKQYFGDPI
>Mle-Panxα1
MYWIFEICQEIKRAQSCRKFAIDGPFDWTNRIIMPTLMVICCFLQTFTFM
>Mle-Panxα5
MIYWVWAVFKRMAPFKVVTLDDRWDQMNRSFMMPLTMSFAYLIDYGIIAG
>Mle-Panxα6
MLLEILANFKGATPFKEIVLDDKWDQINRCYMFLLCVIFGTVVTFRQYTG
>Mle-Panxα9
MLDILSKFKGVTPFKGITIDDGWDQLNRSFMFVLLVVMGTTVTVRQYTGS
Usage example 1
Simple regular expression matching the characters "P", "a", "n", "x", followed by one more character (dot operator '.')
$: sb C-terms.fa -gbr "Panx."
Output
New file: /path/to/cwd/Panxδ.fa
New file: /path/to/cwd/Panxα.fa
Usage example 2
Regular expression that does not match all IDs
$: sb C-terms.fa -gbr "Panx.[1-3]
Output
New file: /path/to/cwd/Unknown.fa
New file: /path/to/cwd/Panxδ1.fa
New file: /path/to/cwd/Panxδ2.fa
New file: /path/to/cwd/Panxδ3.fa
New file: /path/to/cwd/Panxα1.fa
Usage example 3
Multiple regular expressions
$: sb C-terms.fa -gbr "Dme.*δ" "Panx[αδ]"
Output
New file: /path/to/cwd/Dme~Panxδ.fa
New file: /path/to/cwd/Panxα.fa
Usage example 4
Use parentheses notation to extract parts of your match (results from multiple sets of parentheses are concatenated)
$: sb C-terms.fa -gbr "([MD]).*([αδ])"
Output
New file: /path/to/cwd/Mα.fa
New file: /path/to/cwd/Dδ.fa
Usage example 5
Write every single record out to its own file by passing in the empty string ""
$: sb C-terms.fa -gbr "^.*$"
Output
New file: /path/to/cwd/Dme~Panxδ1.fa
New file: /path/to/cwd/Dme~Panxδ2.fa
New file: /path/to/cwd/Dme~Panxδ3.fa
New file: /path/to/cwd/Dme~Panxδ4.fa
New file: /path/to/cwd/Mle-Panxα1.fa
New file: /path/to/cwd/Mle-Panxα5.fa
New file: /path/to/cwd/Mle-Panxα6.fa
New file: /path/to/cwd/Mle-Panxα9.fa
Usage example 6
Specify a pre-existing folder to change where the files are written to
$: sb C-terms.fa -gbr "~/foo/bar/" "Mle"
Output
New file: /home/foo/bar/Unknown.fa
New file: /home/foo/bar/Mle.fa