group_by_cluster - biologyguy/RD-MCL GitHub Wiki
Given a set of clusters, it can be nice to compile those clusters into individual sequence files, multiple sequence
alignments, consensus sequences, or lists of metadata. All of this can be done with group_by_cluster
.
$: group_by_cluster rdmcl_dir <args>
rdmcl_dir: Path to the directory with all of the RD-MCL output files.
args: All flagged arguments are explained in detail below.
Switch between output types. The valid options explained a bit more below (default=list)
list → Default value. Appends the description line of every sequence to its ID
group_0_5
BOL-PanxαD Bo_species|m.22 and m.4 and m.14|ML25998|866p 2.
Bab-PanxαE Be_abyssicola|m.28|ML25998|352p 2.
Bfo-PanxαI Ba_fosteri|m.74|ML25998|352 2.
Dgl-PanxαD Dr_glandiformis|m.27 and m.51|ML25998|406 2.
group_0_6
BOL-PanxαH Bo_species|m.33|ML07312|1211 2.
Dgl-PanxαH Dr_glandiformis|m.38 and m.2|ML07312|587p 2.
Edu-PanxαC Eu_dunlapae|m.6 and m.26|ML07312|450 2.
group_0_7
Bfo-PanxαG Ba_fosteri|m.53|ML078817 2.
Dgl-PanxαA Dr_glandiformis|m.1|ML078817| 2.
group_0_19
Hru-PanxαC Ha_rubra|m.15 and m.42|ML078817|p 2.
seqs, sequences → Group cluster sequence records together
>BOL-PanxαD group_0_5 BOL-PanxαD Bo_species|m.22 and m.4 and m.14|ML25998|866p 2.
MVLDLISGNFKNLLQIKSVSIDDQWDQLNRTYLVMFCILSGTIMTFKQNLGSIIHCIGDS
RSGEGSFAEVHDTFVQDYCAAQGLYTVKEV
>Bab-PanxαE group_0_5 Bab-PanxαE Be_abyssicola|m.28|ML25998|352p 2.
MVVDLISGNFKGLFAVKSVSIDDGWDQLNRNYMVMFCIMSGTIMTLRQNLGTIINCVGDS
ARDTGANFANDNDNFVTDYCSAQGLFTLMSWPDEI
>Bfo-PanxαI group_0_5 Bfo-PanxαI Ba_fosteri|m.74|ML25998|352 2.
MVFEIISGNFKSLLTVKSISIDDKWDQCNRTYLVMFCIFSGTIMTLRQQLGSIIHCMGHV
GNENAKEGDFVETNNVFVNDYCSAQGLYTYKEMYTLSWPD
>Dgl-PanxαD group_0_5 Dgl-PanxαD Dr_glandiformis|m.27 and m.51|ML25998|406 2.
MVLDLISGNFKTFFAIKSVSIDDKWDQLNRNYLVMFCILSGSIMTLRQNLGSIIECIGDT
SGDKDFANENSVFVSDYCSAVPWPSEIPY
>BOL-PanxαH group_0_6 BOL-PanxαH Bo_species|m.33|ML07312|1211 2.
MVLEVLALFPRLAPFKVITLDDVWDQWNRSFMFIMTVLFGSIVTIRSYTGSVIECDGFLK
VPVEFAKDYCWTQGIYTLREGYDYHSSILPYPGVFP
>Dgl-PanxαH group_0_6 Dgl-PanxαH Dr_glandiformis|m.38 and m.2|ML07312|587p 2.
MVLEVLALFPRLAPFKVITLDDGWDQWNRSFMFIVCVLFGSVVTIRSYTGSVIECDGFIK
VPPDFAKDYCWTQGIYTLLEGYDYHTDMLPYPGVFPEDAP
>Edu-PanxαC group_0_6 Edu-PanxαC Eu_dunlapae|m.6 and m.26|ML07312|450 2.
MVLEILALFPRLAPFKVITLDDGWDQLNRSFMFILCVLFGSIVTIRCYTGSVIECDGFVK
VPDEFAKDYCWTQGIYTIKEAYDVPGSSIPYPGIAP
>Bfo-PanxαG group_0_7 Bfo-PanxαG Ba_fosteri|m.53|ML078817 2.
MAYFLATGLEKMRGAIPFKDSIDDTIGQINRNTMTRVMGMWAVLSTFTQLIGENISCLSF
KKFSRDFAQQFCWTQGMYTNIAPCVT
>Dgl-PanxαA group_0_7 Dgl-PanxαA Dr_glandiformis|m.1|ML078817| 2.
MYWFYEIHQQIARGNNSRKNAMDDPPDWLSRILMPMLMFIFFTLSTFTQLIGQPISCLGF
QKFNREFAEQYCWTQGMFTDRRSYLTYPGITPCVREW
>Hru-PanxαC group_0_19 Hru-PanxαC Ha_rubra|m.15 and m.42|ML078817|p 2.
TILDEVRKAHGYKKHAIDGPAEWMNRIFVPMLMTVFFIISTISLLVGQPVSCVGFDKDDM
GFAEEYCWTQGIFTNRRAYDMTGSIPYPGVLDTK
aln, alignment → Create multiple sequences alignments for each cluster
4 106
BOL-PanxαD MVLDLISGNF KNLLQIKSVS IDDQWDQLNR TYLVMFCILS GTIMTFKQNL
Bab-PanxαE MVVDLISGNF KGLFAVKSVS IDDGWDQLNR NYMVMFCIMS GTIMTLRQNL
Bfo-PanxαI MVFEIISGNF KSLLTVKSIS IDDKWDQCNR TYLVMFCIFS GTIMTLRQQL
Dgl-PanxαD MVLDLISGNF KTFFAIKSVS IDDKWDQLNR NYLVMFCILS GSIMTLRQNL
GSIIHC---I GDSRSGE--G SFAEVHDTFV QDYCAAQGLY T-----V---
GTIINC---V GDS-ARDTGA NFANDNDNFV TDYCSAQGLF T-----LMSW
GSIIHCMGHV GNENAKE--G DFVETNNVFV NDYCSAQGLY TYKEMYTLSW
GSIIEC---I GDT-SGD--K DFANENSVFV SDYCSA---- -------VPW
-KEV--
PDEI--
PD----
PSEIPY
3 100
BOL-PanxαH MVLEVLALFP RLAPFKVITL DDVWDQWNRS FMFIMTVLFG SIVTIRSYTG
Dgl-PanxαH MVLEVLALFP RLAPFKVITL DDGWDQWNRS FMFIVCVLFG SVVTIRSYTG
Edu-PanxαC MVLEILALFP RLAPFKVITL DDGWDQLNRS FMFILCVLFG SIVTIRCYTG
SVIECDGFLK VPVEFAKDYC WTQGIYTLRE GYDYHSSILP YPGVFP----
SVIECDGFIK VPPDFAKDYC WTQGIYTLLE GYDYHTDMLP YPGVFPEDAP
SVIECDGFVK VPDEFAKDYC WTQGIYTIKE AYDVPGSSIP YPGIAP----
2 97
Bfo-PanxαG MAYFLATGLE KMRGAIPFKD SIDDTIGQIN RNTMTRVMGM WAVLSTFTQL
Dgl-PanxαA MYWFYEIHQQ IARGNNSRKN AMDDPPDWLS RILMPMLMFI FFTLSTFTQL
IGENISCLSF KKFSRDFAQQ FCWTQGMYTN ---------I APCVT--
IGQPISCLGF QKFNREFAEQ YCWTQGMFTD RRSYLTYPGI TPCVREW
1 94
Hru-PanxαC TILDEVRKAH GYKKHAIDGP AEWMNRIFVP MLMTVFFIIS TISLLVGQPV
SCVGFDKDDM GFAEEYCWTQ GIFTNRRAYD MTGSIPYPGV LDTK
cons, consensus → Create multiple sequence alignments and then compress them into a weighted consensus sequence
>group_0_5
MVLDLISGNFKSLLAVKSVSIDDKWDQLNRTYLVMFCILSGTIMTLRQNLGSIIHCVGDS
AGEGDFAETNDVFVNDYCSAQGLYTTLSWPDEI
>group_0_6
MVLEVLALFPRLAPFKVITLDDGWDQWNRSFMFILCVLFGSIVTIRSYTGSVIECDGFVK
VPDEFAKDYCWTQGIYTLKEGYDYHGSSLPYPGVFP
>group_0_7
MXXFXXXXXXXXRGXXXXKXXXDDXXXXXXRXXMXXXMXXXXXLSTFTQLIGXXISCLXF
XKFXRXFAXQXCWTQGMXTXXXXXXXXXXIXPCVXXX
>group_0_19
TILDEVRKAHGYKKHAIDGPAEWMNRIFVPMLMTVFFIISTISLLVGQPVSCVGFDKDDM
GFAEEYCWTQGIFTNRRAYDMTGSIPYPGVLDTK
A file in any supported format
containing sequence records for all sequence IDs present in the cluster_file. Using this flag will
override the default behaviour of looking for the input_seq.fa
file in rdmcl_dir
.
$: group_by_cluster rdmcl_dir -s seq_file
If using alignment
mode, you can specify the alignment program to use.
$: group_by_cluster rdmcl_dir aln -a '/path/to/mafft'
$: group_by_cluster rdmcl_dir aln -a clustalo
You can restrict which groups are returned by naming them explicitly
$: group_by_cluster rdmcl_dir -g group_0_5 group_0_6 group_0_7 group_0_19
Restrict which groups are returned to only those smaller than or equal to the given value
$: group_by_cluster rdmcl_dir -max 50
Restrict which groups are returned to only those larger than or equal to the given value
$: group_by_cluster rdmcl_dir -max 5
If using alignment
mode, you can clean up gappy columns by running trimal.
The following values are valid:
- all (remove all columns with any gaps)
- gappyout (auto thresholding algorithm)
- int (specify max number of gaps per column)
- float (specify max percentage of columns with gaps, must be ≥0 and ≤1)
You can pass in multiple values in decreasing stringency to get the best results. The program will not apply trimal if any sequence is reduced to zero non-gap residues, or the average overall length of every sequence is reduced by more than 50%.
$: group_by_cluster rdmcl_dir -trm 'all' 3 0.5 'gappyout'
If passing in a sequence file with the -s flag, and the sequences do not contain the taxa prefix required by
RD-MCL, you can pass in the --strip_taxa
flag to remove the prefix from the final_clusters.txt
file.
$: group_by_cluster rdmcl_dir -st
RD-MCL removes reciprocal best hit cliques of sequences at the beginning of a run to reduce noise introduced by recent gene duplications within individual taxa. These sequences are then replaced into the appropriate cluster at the end, but sometimes it's nice to leave them out of further downstream analyses as well. This flag will leave only a single representative sequence for each paralog clique.
$: group_by_cluster rdmcl_dir -ep
Appends the size of each orthogroup (in parentheses) to its respective group name. This only has an effect on the consensus
and list
modes.
$: group_by_cluster rdmcl_dir -ic cons
>group_0_1_0(15)
MVIDILSGFKGITPFKGITLDDGWDQINRSFMFVLCVLMGTVVTVRQYAGGIISCDGFTK
YSGSFSEDYCWTQGLYTIKEAYDHLLANVPYPGVIPEEIPACIERELINGGKVSCPDPED
VKPPTRVYHLWYQWVPFYFWLAAAAFFFPYLIYKHFGVGDLKPLIQMLHNPIVDEGDQNA
MAEKASMWLFYKLN
>group_0_1_1(7)
VAMGVETFLSFGGTHLSRFLPTASTVDDVGIQTNRSLLVMILMVFGATVTLNTYIGNPIS
CIGFDKVNDKDKNFPLDYCWTQGLYTIKEVYDDTSGKIPYPGIIPEDIPACLGRVGCVEK
EVKPFTRVYHLWYQWVPFYFWL
>group_0_2(2)
MYFFIMXXTEEVRKAHNCRXXXXXXPADWLNRIFMPTLMIIFCFINLSQMWSQDDANISC
VGFKDYKDFAEEYCWTQGIYTNRLAYHLPEGXVPYPGVVPCVGVLDPRSGGTRFKCSAAG
KEEDHXYHLWXQWVPFXYT
Specify a directory where you would like the output to be written. A new file will be created for each orthogroup (in practice, this is usually most useful when using the alignment
mode).
$: group_by_cluster rdmcl_dir aln -w '/path/to/outdir'