compare_homolog_groups - biologyguy/RD-MCL GitHub Wiki

Colour code the final clusters from RD-MCL compared to expected clusters

This program is primarily for development purposes, allowing us to quickly visualize how well RD-MCL has performed on a well described data set.

Generalized usage

$: compare_homolog_groups true_clusters_file rdmcl_clusters_file <args>

true_clusters_file: Path to curated clusters. Each cluster should be on a new line, and each sequence ID separated by whitespace.

$: cat true_clusters.tsv

BOL-PanxαE Bab-PanxαC Bch-PanxαD Bfo-PanxαC Bfr-PanxαB Cfu-PanxαA Cfu-PanxαB Cfu-PanxαD Cfu-PanxαF
BOL-PanxαC Bab-PanxαD Bch-PanxαE Bfo-PanxαD Bfr-PanxαC Cfu-PanxαC Dgl-PanxαG Edu-PanxαB Hca-PanxαH
BOL-PanxαA Bab-PanxαB Bch-PanxαC Bfo-PanxαB Dgl-PanxαE Edu-PanxαA Hca-PanxαB Hru-PanxαA Lcr-PanxαH
Hvu-PanxβA Hvu-PanxβB Hvu-PanxβC Hvu-PanxβD Hvu-PanxβE Hvu-PanxβF Hvu-PanxβG
BOL-PanxαB Bab-PanxαA Bch-PanxαA Bfo-PanxαE Bfr-PanxαA Dgl-PanxαI Edu-PanxαE
Bfo-PanxαA Bfo-PanxαG Dgl-PanxαA Edu-PanxαD Edu-PanxαI Hru-PanxαC
BOL-PanxαD Bab-PanxαE Bfo-PanxαI Dgl-PanxαD
BOL-PanxαH Dgl-PanxαH Edu-PanxαC

rdmcl_clusters_file: Path to RD-MCL final_clusters.txt file. These files can look a little bit different than the example above.

$: cat final_clusters.txt

group_0_1  70.0358 BOL-PanxαC Bab-PanxαD Bch-PanxαE Bfo-PanxαD Bfr-PanxαC Cfu-PanxαC Dgl-PanxαG Edu-PanxαB Hca-PanxαH
group_0_3  45.8238 BOL-PanxαA Bab-PanxαB Bch-PanxαC Bfo-PanxαB Dgl-PanxαE Edu-PanxαA Hca-PanxαB Hru-PanxαA Lcr-PanxαH
group_0_0  75.9846 Bab-PanxαC Bch-PanxαD Bfo-PanxαC Bfr-PanxαB Cfu-PanxαA Cfu-PanxαD Cfu-PanxαF
group_0_18 2.0734  Hvu-PanxβA Hvu-PanxβB Hvu-PanxβC Hvu-PanxβD Hvu-PanxβE Hvu-PanxβF Hvu-PanxβG
group_0_2  54.1235 BOL-PanxαB Bab-PanxαA Bch-PanxαA Bfo-PanxαE Bfr-PanxαA Dgl-PanxαI Edu-PanxαE
group_0_5  20.471  BOL-PanxαD Bab-PanxαE Bfo-PanxαI Dgl-PanxαD
group_0_6  13.037  BOL-PanxαH Dgl-PanxαH Edu-PanxαC
group_0_7  9.316   Bfo-PanxαG Dgl-PanxαA
group_0_19 2.9556  Hru-PanxαC
group_0_20 1.642   Edu-PanxαI
group_0_23 1.642   Edu-PanxαD
group_0_26 2.463   Cfu-PanxαB
group_0_30 1.4778  Bfo-PanxαA

args: All flagged arguments are explained in detail below.

Arguments

-s, --score

Output a table with a set of similarity metrics.

Metric	Description
Precision	TP / (TP + FP)
Recall	TP / (TP + FN)
Accuracy	((TP + TN) / #seqs) * (#clusters / #seqs)
tn rate	(TN / (TN + FP)) * (#clusters / #seqs)
Query Score	Σ(RD-MCL cluster scores)
True Score	Σ(RD-MCL cluster scores)

Where:

TP = True positives
FP = False positives
TN = True negatives
FN = False negatives
#seqs = Total number of records included
#clusters = Total number of clusters in the final_clusters file

Example with expected output

$: compare_homolog_groups true_clusters_file rdmcl_clusters_file

$: compare_homolog_groups true_clusters_file rdmcl_clusters_file --score

Precision:    88.89%
Recall:       83.54%
Accuracy:     97.22%
tn rate:      97.98%
Query score:  121.21
True score:   119.94