9 Contrast set mining - adaa-polsl/RuleKit GitHub Wiki

RuleKit includes an algorithm for contrast set (CS) identification Gudyś et al, 2024. Currently, the mode is available only through XML batch interface.

Parameter set definition

Below we present an example parameter set suitable for contrast set mining:

<parameter_set name="Contrast set mining">		
	<param name="induction_measure">Correlation</param>
	<param name="pruning_measure">Correlation</param>
	<param name="voting_measure">Correlation</param>
	<param name="minsupp_all">0.8 0.5 0.2 0.1</param>
	<param name="minsupp_new">0.1</param>
	<param name="max_neg2pos">0.5</param>
	<param name="max_passes_count">5</param>
	<param name="penalty_strength">0.5</param>
	<param name="penalty_saturation">0.2</param>
	
</parameter_set>

Parameter meaning (symbols from the RuleKit-CS manuscript are given in parentheses):

induction_measure / pruning_measure / voting_measure - name of the rule quality measure used during growing/pruning/voting (ignored in the regression/survival analysis where special measure is used); recommended: Correlation,
minsupp_all - a minimum positive support of a contrast set (p/P). When multiple values are specified, a metainduction is performed; recommended sequence: 0.8, 0.5, 0.2, 0.1,
minsupp_new - a minimum positive support of a contrast set calculated w.r.t. to previously uncovered examples (p_new/P); an alias of min_rule_covered parameter in traditional rules induction; recommended: 0.1,
max_neg2pos - a maximum ratio of negative to positive supports (nP/pN); recommended: 0.5,
'max_passes_count` (max-passes) - a maximum number of sequential covering passes for a single minsupp-all; recommended: 5,
penalty_strength (s) - penalty strength; recommended: 0.5,
penalty_saturation - the value of p_new/P at which penalty reward saturates; recommended: 0.2.

Dataset definition

In order to enable contrast set mining, one need to specify contrast_attribute tag in the data set description which indicates the group attribute. Note, that prediction section in the contrast set mining is ignored, thus it can be ommited.

In the case of traditional contrast sets, a group attribute must be the same as a discrete label:

<dataset>
	<label>class</label>
	<contrast_attribute>class</contrast_attribute>
	<out_directory>./classification/final/anneal</out_directory>
	<training>
		<report_file>training.log</report_file>
		<train>
			<in_file>../data/classification/anneal.arff</in_file>
			<model_file>anneal.mdl</model_file>
			<model_csv>anneal.csv</model_csv>
		</train>
	</training>
	
</dataset>

For regression problems, one need to specify a discrete group attribute and a continous label:

<dataset>
	<label>class</label>
	<contrast_attribute>group</contrast_attribute>
	<out_directory>./regression/final/plastic</out_directory>
	<training>
		<report_file>plastic.train.log</report_file>
		<train>
			<in_file>../data/regression/plastic.arff</in_file>
			<model_file>plastic.mdl</model_file>
			<model_csv>plastic.csv</model_csv>
		</train>
	</training>
</dataset>

When analyzing survival data, the following attributes are needed: a discrete group attribute, a binary survival status (label), and a continous survival time.

<dataset>
	<survival_time>survival_time</survival_time>
	<label>survival_status</label>
	<contrast_attribute>group</contrast_attribute>
	<out_directory>./survival/final/actg320</out_directory>
	<training>
		<report_file>actg320.train.log</report_file>
		<train>
			<in_file>../data/survival/actg320.arff</in_file>
			<model_file>actg320.mdl</model_file>
			<model_csv>actg320.csv</model_csv>
		</train>
	</training>
</dataset>

Dataset availability and experiments

All the data sets investigated in the RuleKit-CS study can be downloaded from here. This location also contains an XLS spreadsheet with detailed information on the data sets. The XML files defining experiments from the paper are also available in examples/contrast-sets repository folder. In order to reproduce the analyses, please copy the RuleKit jar file into that folder and run

java -jar rulekit-<version>-all.jar <experiments.xml>

with <experiments-xml> being an XML file name for a particular experiment.

References

Gudyś, A, Sikora, M, Wróbel, Ł (2024) Separate and conquer heuristic allows robust mining of contrast sets in classification, regression, and survival data, Expert Systems with Applications, 248: 123376