1 Batch interface - adaa-polsl/RuleKit GitHub Wiki

1.1. General information

In order to use batch mode, please download rulekit-<version>-all.jar file from the releases folder. Alternatively, one can build the package from the sources by running the following commands in the adaa.analytics.rules directory of this repository. Windows:

gradlew -b build.gradle rjar

Linux:

./gradlew -b build.gradle rjar

The JAR file will be placed in adaa.analytics.rules/build/libs subdirectory.

To run the analysis, execute the following command:

java -jar *rulekit-<version>-all.jar* experiments.xml

where experiments.xml is an XML file with experimental setting description. Ignore the SLF4J warning reported on the console - it does not affect the procedure. The batch mode allows investigating multiple datasets with many induction parameters. The general XML structure is as follows:

</experiment>
	<parameter_sets>
		<parameter_set name="paramset_1">...</parameter_set>
		<parameter_set name="paramset_2">...</parameter_set>
		...
	</parameter_sets>

	<datasets>
		<dataset>...</dataset>
        	<dataset>...</dataset>
   		...
  	</datasets>
</experiment>

1.2. Parameter set definition

This section allows user to specify induction parameters. The package enables testing multiple parameter sets in a single run. The definition of a single parameter is presented below. Every parameter has its default value, thus only selected may be specified by the user.

<parameter_set name="paramset_1">
  	<param name="min_rule_covered">...</param>
  	<param name="max_uncovered_fraction">...</param>
	<param name="max_growing">...</param>
  	<param name="induction_measure">...</param>
	<param name="pruning_measure">...</param>
  	<param name="voting_measure">...</param>
	<param name="user_induction_equation">...</param>
	<param name="user_pruning_equation">...</param>
	<param name="user_voting_equation">...</param>
	<param name="ignore_missing">...</param>
	<param name="select_best_candidate">...</param>
	<param name="complementary_conditions">...</param>
	<param name="mean_based_regression">...</param>
	<param name="control_apriori_precision">...</param>
	<param name="max_rule_count">...</param>
	<param name="approximate_induction">...</param>
	<param name="approximate_bins_count">...</param>
</parameter_set>

where:

min_rule_covered (aliases: minsupp_new, mincov_new) - a minimum number (or fraction, if value < 1.0) of previously uncovered examples to be covered by a new rule (positive examples for classification problems); default: 5,
max_uncovered_fraction - floating-point number from [0,1] interval representing maximum fraction of examples that may remain uncovered by the rule set; default: 0,
max_growing - non-negative integer representing maximum number of conditions which can be added to the rule in the growing phase (use this parameter for large datasets if execution time is prohibitive); 0 indicates no limit; default: 0,
induction_measure/pruning_measure/voting_measure - name of the rule quality measure used during growing/pruning/voting (ignored in the survival analysis where log-rank statistics is used); default: Correlation,
user_induction_equation/user_pruning_equation/user_voting_equation (RuleKit < 2.0) - equation of user-defined quality measure; applies only when the corresponding measure parameter has value UserDefined; the equation must be a mathematical expression with p, n, P, N literals (elements of confusion matrix), operators, numbers, and library functions (sin, log, etc.).
user_induction_class/user_pruning_class/user_voting_class (RuleKit >= 2.0) - names of classes that implement a user-defined quality measure; each class must implement the adaa.analytics.rules.logic.quality.IUserMeasure interface; applies only when the corresponding measure parameter has value UserDefined.
ignore_missing - boolean telling whether missing values should be ignored (by default, a missing value of given attribute is always considered as not fulfilling the condition built upon that attribute); default: false.
select_best_candidate - if enabled, the rule of highest quality is returned as a result of growing (instead of the fully grown rule); default: false.
complementary_conditions - if enabled, complementary conditions in the form a = !{value} for nominal attributes are supported; default: false.
mean_based_regression - enable fast induction of mean-based regression rules instead of default median-based; default: true.
control_apriori_precision - when inducing classification rules, verify if candidate precision is higher than apriori precision of the investigated class; default: true.
max_rule_count - maximum number of rules to be generated (for classification data sets it applies to a single class); the rules are generated in such a way as to cover at least 1 - max_uncovered_fraction fraction of examples; 0 indicates no rule limit; default: 0.
approximate_induction - use an approximate induction heuristic which does not check all possible splits; note: this is an experimental feature and currently works only for classification data sets, results may change in future; default: false.
approximate_bins_count - maximum number of bins for an attribute evaluated in the approximate induction; default: 100.

Additional parameters concerning user-guided generation of rules are described in this section.

1.3. Dataset definition

Definition of a dataset has the following form.

<dataset>
     <label>...</label>							
     <weight>...</weight>                       
     <survival_time>...</survival_time>
     <contrast_attribute>...</contrast_attribute>
     <ignore>...</ignore>	 
     <out_directory>...</out_directory>		 
    
    <training> 
          <report_file>...</report_file>         
	  <train>
             <in_file>...</in_file>            
             <model_file>...</model_file>
	     <model_csv>...</model_csv>			 
         </train>
         ...
    </training>
    
    <prediction>
	 <report_file>...</report_file>   	 
         <predict>
             <model_file>...</model_file>      	
             <test_file>...</test_file>         
             <predictions_file>...</predictions_file>  
         </predict>
         ...
    </prediction>
    
</dataset>

There are three main parts of the dataset definition: the general properties, the traning section, and the prediction section. General parameters and at least one of the two latter sections must be specified.

General properties

The general dataset properties are:

label - label attribute,
weight - optional weight attribute,
survival_time - name of the survival time attribute, its presence indicates survival analysis problems,
contrast_attribute - name of the contrast group attribute, its presence indicates contrast set mining problems,
ignore - an optional comma-separated list of attributes to be ignored,
out_directory - output directory for storing results, subdirectories for all parameter sets are created automatically inside it,

Training section

The training section allows generating models on specified training sets. It consists of the report_file field and any number of train subsections. Each train subsection is defined by:

in_file - full path to the training file (in ARFF, CSV, XLS format),
model_file - output binary model file (without full path); for each parameter set, a separate model is generated under location <out_directory>/<parameter_set name>/<model_file>,
model_csv - output tabular model file (without full path); the table contains a list of rules with some additional statistics; for each parameter set, a separate model is generated under location <out_directory>/<parameter_set name>/<model_csv>.

The report_file is created for each parameter set under <out_directory>/<parameter_set name>/<report_file> location. It contains a common text report for all training files: rule sets, model characteristics, detailed coverage information, training set prediction quality, KM-estimators (for survival problems), etc. Details on its content can be found here.

Prediction section

The prediction section allows making predictions on specified testing sets using models generated by the training section. It consists of the performance_file field and any number of predict subsections. Each predict subsection is defined by:

model_file - name of the input binary model file generated in the training part; for each parameter set, a model is searched under location <out_directory>/<parameter_set name>/<model_file>,
test_file - full path to the testing file (in ARFF, CSV, XLS format),
predictions_file - output data file with predictions (without full path); for each parameter set, a prediction is generated under location <out_directory>/<parameter_set name>/<predictions_file>.

The performance_file is created for each parameter set under <out_directory>/<parameter_set name>/<performance_file> location. It contains a common CSV report for all testing files with values of performance measures. In this section one can find all the information concerning performance report.

1.4. Example

Here we present how to prepare the XML experiment file for an example classification analysis. The investigated dataset named deals concerns a problem of predicting whether a person making a purchase will be a future customer. The entire dataset is split in train and test parts and can be found here.

Let the user be interested in two parameter sets:

mincov = 5 with RSS measure used for growing, pruning, and voting,
mincov = 8 with BinaryEntropy measure used for growing, user-defined measure described by the equation 2p/n for pruning, and C2 for voting.

The corresponding parameter set definition is as follows:

<parameter_sets>
	<parameter_set name="mincov=5, RSS">
		<param name="min_rule_covered">5</param>
		<param name="induction_measure">RSS</param>
		<param name="pruning_measure">RSS</param>
		<param name="voting_measure">RSS</param>
	</parameter_set>
	
	<parameter_set name="mincov=8, Entropy_User_C2">
		<param name="min_rule_covered">8</param>
		<param name="induction_measure">BinaryEntropy</param>
		<param name="pruning_measure">UserDefined</param>
		<param name="user_pruning_equation">2 * p / n</param>
		<param name="voting_measure">C2</param>
	</parameter_set>
	
</parameter_sets>

The experiment will be performed according to the scheme below:

name of the label attribute: class,
no weighting,
output directory: ./results-deals
training set: ../data/deals/deals-train.arff
training log file: training.log
testing set: ../data/deals/deals-test.arff
testing performance file: performance.csv

The corresponding dataset definition is as follows:

<dataset>
	<label>Future Customer</label>
	<out_directory>./results-minimal/deals</out_directory>		
	<training>  
		<report_file>training.log</report_file>           		
		<train>
			<in_file>../data/deals/deals-train.arff</in_file>               	
			<model_file>deals.mdl</model_file> 
		</train>
	</training>
	<prediction>
		<performance_file>performance.csv</performance_file>  
		<predict>
			<model_file>deals.mdl</model_file>      	
			<test_file>../data/deals/deals-test.arff</test_file>            			
			<predictions_file>deals-pred.arff</predictions_file>  	  
		</predict>
	</prediction>
</dataset>

In the training phase, RuleKit generates a subdirectory in the output directory for every investigated parameter set. Each of these subdirectories contains models (one per training file) and a common text report. Therefore, the following files are produced as a result of the training:

./results-minimal/deals/mincov=5, RSS/deals.mdl
./results-minimal/deals/mincov=5, RSS/training.log
./results-minimal/deals/mincov=8, Entropy_User_C2/deals.mdl
./results-minimal/deals/mincov=8, Entropy_User_C2/training.log

In the prediction phase, previously-generated models are applied on the specified testing sets producing the following files:

./results-minimal/deals/mincov=5, RSS/deals-pred.arff
./results-minimal/deals/mincov=5, RSS/performance.csv
./results-minimal/deals/mincov=8, Entropy_User_C2/deals-pred.arff
./results-minimal/deals/mincov=8, Entropy_User_C2/performance.csv

The complete experiment definition in the XML format is available in minimal-deals.xml file.

Presented aproach can be easily suited to cross-validation experiment with existing splits by specifying multiple train and predict sections. The example can be seen in guider-seismic-bumps.xml file. The problem concerns forecasting high energy seismic bumps in coal mines. Note, that since the dataset is large and there are several parameter sets to be investigated (including user-guided), this example may take long time to finish.