3 R package - adaa-polsl/RuleKit GitHub Wiki

3.1. Installation

RuleKit is compatible with R 3.3.x or later. In Linux, curl, ssl, and xml system packages are additionally required for RuleKit building. For instance, under Ubuntu 18.04, execute in terminal:

sudo apt-get install libcurl4-gnutls-dev
sudo apt-get install libssl-dev
sudo apt-get install libxml2-dev

In other distributions, package names may differ slightly. To build the package, please download the rulekit-<version>-all.jar file from the releases folder and copy it to the ./r-package/inst/java/ directory of the repository. Then, open ./r-package/rulekit.Rproj project under RStudio environment. Install all required dependencies:

install.packages(c('RWeka','XML','caret','rprojroot','devtools'))

and build the package with Install and Restart button (the appropiate version of RTools will be downloaded automatically, if it is not present at the target platform). RuleKit will be installed under default R package directory.

3.2. Usage

After installation, the package is imported with library(adaa.rules) command. Input datasets have to be in the form of standard R data frames with colums and rows representing variables and examples, respectively.

Training of the rule-based model is done with the following function:

learn_rules(formula, control = NULL, train_data, test_data = train_data)

Parameters:

  • formula - R model formula. In general, formulae have the form response variable ~ predictor variables. RuleKit, however, ignores the predictor variables part and performs the analysis on the basis of all conditional attributes unless user-guided induction is enabled. Thus, in the case of classification and regression problems, use response variable ~ . as the formula. In the case of survival problems, user must additionally specify an attribute representing survival in the formula: survival::Surv(survival time, survival status) ~ .
  • control - named list of induction parameters. When control argument is not specified, default algorithm configuration is used. This parameter is also used when specifying user's requirements in user-guided induction.
  • train_data - training dataset.
  • test_data - testing dataset. If not specified, training set is used for evaluation.

3.3. Example

In this subsection we present a survival analysis of BMT-Ch dataset with RuleKit R package. The set concerns the problem of analyzing factors contributing to the patients’ survival following bone marrow transplants. After loading the package, survival time and survival status variables are specified and induction parameters are set. Note, that in survival problems, log-rank statistic is always used as a rule quality measure.

library(adaa.rules)
formula <- survival::Surv(survival_time, survival_status) ~ .
control <- list(min_rule_covered = 5)

In the next step, the analysis is initialized (training and testing performed on the same set) and the results are gathered.

rules = results[["rules"]]        # list of rules
cov = results[["train-coverage"]] # coverage of training examples by rules
surv = results[["estimator"]]      # data frame with survival function estimates
perf = results[["test-performance"]]   # data frame with performance metrices

Survival function estimates for the entire dataset and for the rules are then plotted (Figure 3.1).

library(ggplot2)
library(reshape2)

# melt dataset for automatic plotting of multiple series
meltedSurv <- melt(surv, id.var="time")

ggplot(meltedSurv, aes(x=time, y=value, color=variable)) +
  geom_line(size=1.0) +
  xlab("time") + ylab("survival probability") +
  theme_bw() + theme(legend.title=element_blank())
Figure 3.1. Survival function estimates for the entire BMT dataset and for the induced rules.

The entire R script for performing survival analysis for BMT dataset can be found here.

⚠️ **GitHub.com Fallback** ⚠️