4 Quality and evaluation - adaa-polsl/RuleKit GitHub Wiki

4.1. Rule quality

An important factor determining performance and comprehensibility of the resulting model is a selection of a rule quality measure. RuleKit provides user with a number of state-of-art measures calculated on the basis of the confusion matrix. Additionally, there is a possibility to define own measures. The confusion matrix consists of the number of positive and negative examples in the entire training set (P and N) and the number of positive and negative examples covered by the rule (p and n). The measures based on the confusion matrix can be used for classification and regression problems (note, that for the former P and N are fixed for each analyzed class, while for the latter P and N are determined for every rule on the basis of covered examples). In the case of survival problems, log-rank statistics is always used for determining rules quality (for simplicity, all examples are assumed positive, thus N and n equal to 0). Below one can find all built-in measures together with formulas.

Quality measure	Formula
Accuracy
BinaryEntropy	, where the probabilities can be calculated straightforwardly from the confusion matrix
C1
C2
CFoil
CNSignificnce
Coleman
Correlation
Coverage
FBayesianConfirmation
FMeasure
FullCoverage
GeoRSS
GMeasure
JMeasure
Kappa
Klosgen
Laplace
Lift
LogicalSufficiency
MEstimate
MutualSupport
Novelty
OddsRatio
OneWaySupport
PawlakDependencyFactor
Q2
Precision
RelativeRisk
Ripper
RuleInterest
RSS
SBayesian
Sensitivity
Specificity
TwoWaySupport
WeightedLaplace
WeightedRelativeAccuracy
YAILS

4.2. Model characteristics

These indicators are common for all types of problems and their values are established during model construction.

time_total_s - algorithm execution time in seconds,
time_growing_s - growing time in seconds,
time_pruning_s - pruning time in seconds,
#rules - number of rules,
#conditions_per_rule - average number of conditions per rule,
#induced_conditions_per_rule - average number of induced conditions per rule (before pruning),
avg_rule_coverage - average rule full coverage defined as (p + n) / (P + N),
avg_rule_precision - average rule precision defined as p / (p + n),
avg_rule_quality - average value of voting measure,
avg_pvalue - average rule p-value (see below the details),
avg_FDR_pvalue - average rule p-value after false discovery rate (FDR) correction,
avg_FWER_pvalue - average rule p-value after family-wise error (FWER) correction,
fraction_0.05_significant - fraction of significant rules at &alpha = 0.05,
fraction_0.05_FDR_significant - fraction of significant rules at 0.05 level (with FDR correction),
fraction_0.05_FWER_significant - fraction of significant rules at 0.05 level (with FWER correction).

Rule p-values are determined during model construction using following tests:

classification: Fisher's exact test for for comparing confusion matrices,
regression: Χ²- test for comparing label variance of covered vs. uncovered examples,
survival: log-rank for comparing survival functions of covered vs. uncovered examples.

4.3. Performance metrices

Performance metrices are established on the basis of model outcome and real example labels. They are specific for investigated problem.

Classification

accuracy - relative number of correctly classified examples among all examples,
classification_error - equal to 1 minus accuracy,
balanced_accuracy - averaged relative numbers of correctly classified examples from all classes,
kappa - kappa statistics for multi class problems,
#rules_per_example - average number of rules covering an example,
#voting_conflicts - number of voting conflicts (example covered by rules pointing to different classes),
#negative_voting_conflicts - number of voiting conflicts resolved incorrectly,
cross-entropy - cross-entropy for the predictions of a classifier,
margin - the margin of a classifier, defined as the minimal confidence for the correct label,
soft_margin_loss - the soft margin loss of a classifier, defined as the average over all 1 - y * f(x),
logistic_loss - the logistic loss of a classifier, defined as the average over all ln(1 + exp(-y * f(x))).

In binary classification problems some additional metrices are computed :

precision (positive predictive value, PPV) - relative number of correctly as positive classified examples among all examples classified as positive,
sensitivity (recall, true positive rate, TPR) - relative number of correctly as positive classified examples among all positive examples,
specificity (selectivity, true negative rate, TNR) - relative number of correctly as negative classified examples among all negative examples,
negative_predictive_value (NPV) - relative number of correctly as negative classified examples among all examples classified as negative,
fallout (false positive rate, FPR) - relative number of incorrectly as positive classified examples among all negative examples,
youden - the sum of sensitivity and specificity minus 1,
geometric_mean - geometric mean of sensitivity and specificity,
psep - the sum of the positive predicitve value and the negative predictive value minus 1,
lift - the lift of the positive class,
f_measure - F1-score; combination of precision and recall: F1 = 2 * PPV * TPR / (PPV + TPR),
false_positive - absolute number of incorrectly as positive classified examples,
false_negative - absolute number of incorrectly as negative classified examples,
true_positive - absolute number of correctly as positive classified examples,
true_negative - absolute number of correctly as negative classified examples.

Regression

absolute_error - the average of the difference between predicted and actual value,
relative_error - ; the relative error of label 0 and prediction 0 is defined as 0, the relative error of label 0 and prediction != 0 is infinite,
relative_error_lenient -
relative_error_strict -
normalized_absolute_error - absolute error normalized by the error simply predicting the average of the actual values,
squared_error -
root_mean_squared_error -
root_relative_squared_error - the total squared error made relative to what the error would have been if the prediction had been the average of the absolute value,
correlation - empirical corelation coefficient r between label and prediction,
squared_correlation - the square of the empirical corellation coefficient r between label and prediction.

Survival

integrated_brier_score (IBS) - the Brier score (BS) represents the squared difference between true event status at time T and predicted event status at that time; the integrated Brier score summarizes the prediction error over all observations and over all times in a test set.