SB Group by regex - mendessoares/BuddySuite GitHub Wiki

--group_by_regex, -gbr

Description

Group sequences together based on shared characteristics in record IDs using regular expressions. The groups are written to files in the current working directory or some other pre-existing directory.

Arguments

Regular expression(s) ( regex )

One or more regular expressions can be used to specify how to group IDs. If there are multiple matches in the ID, only the first match is used, and any records that do not contain a match will be sent to a separate file called 'Unknown'.

The pattern "^.*$" can be used to separate every record into its own file.

Output directory ( path )

Optional. By default, all new files will be written to the current working directory. If you wish to send the output elsewhere, provide a path to an existing directory (new directories will not be created for you).

Examples

Input file: C-terms.fa

>Dme~Panxδ1
YKLLGSLKSYLKWQIQTDNAVFRLHNSFTTVLLLTCSLIITATQYVGQPI
>Dme~Panxδ2
MDVFGSVKGLLKIDQVDNNVFRMHYKATVIILIAFSLLVTSRQYIGDPID
>Dme~Panxδ3
GFIKIDNMVFRCHYRITAILFTCCIIVTANNLIGDPISCIIPMHVINTFC
>Dme~Panxδ4
MAAVKPLSKYLQFKVHIYDAIFTLHSKVTVALLLACTFLLSSKQYFGDPI
>Mle-Panxα1
MYWIFEICQEIKRAQSCRKFAIDGPFDWTNRIIMPTLMVICCFLQTFTFM
>Mle-Panxα5
MIYWVWAVFKRMAPFKVVTLDDRWDQMNRSFMMPLTMSFAYLIDYGIIAG
>Mle-Panxα6
MLLEILANFKGATPFKEIVLDDKWDQINRCYMFLLCVIFGTVVTFRQYTG
>Mle-Panxα9
MLDILSKFKGVTPFKGITIDDGWDQLNRSFMFVLLVVMGTTVTVRQYTGS

Usage example 1

Simple regular expression matching the characters "P", "a", "n", "x", followed by one more character (dot operator '.')

$: sb C-terms.fa -gbr "Panx."

Output

New file: /path/to/cwd/Panxδ.fa
New file: /path/to/cwd/Panxα.fa

Usage example 2

Regular expression that does not match all IDs

$: sb C-terms.fa -gbr "Panx.[1-3]

Output

New file: /path/to/cwd/Unknown.fa
New file: /path/to/cwd/Panxδ1.fa
New file: /path/to/cwd/Panxδ2.fa
New file: /path/to/cwd/Panxδ3.fa
New file: /path/to/cwd/Panxα1.fa

Usage example 3

Multiple regular expressions

$: sb C-terms.fa -gbr "Dme.*δ" "Panx[αδ]"

Output

New file: /path/to/cwd/Dme~Panxδ.fa
New file: /path/to/cwd/Panxα.fa

Usage example 4

Use parentheses notation to extract parts of your match (results from multiple sets of parentheses are concatenated)

$: sb C-terms.fa -gbr "([MD]).*([αδ])"

Output

New file: /path/to/cwd/Mα.fa
New file: /path/to/cwd/Dδ.fa

Usage example 5

Write every single record out to its own file by passing in the empty string ""

$: sb C-terms.fa -gbr "^.*$"

Output

New file: /path/to/cwd/Dme~Panxδ1.fa
New file: /path/to/cwd/Dme~Panxδ2.fa
New file: /path/to/cwd/Dme~Panxδ3.fa
New file: /path/to/cwd/Dme~Panxδ4.fa
New file: /path/to/cwd/Mle-Panxα1.fa
New file: /path/to/cwd/Mle-Panxα5.fa
New file: /path/to/cwd/Mle-Panxα6.fa
New file: /path/to/cwd/Mle-Panxα9.fa

Usage example 6

Specify a pre-existing folder to change where the files are written to

$: sb C-terms.fa -gbr "~/foo/bar/" "Mle"

Output

New file: /home/foo/bar/Unknown.fa
New file: /home/foo/bar/Mle.fa