Python Library Usage - xcsf-dev/xcsf GitHub Wiki

Version 1.4.7

The XCSF Python library has been designed to be as familiar as possible to users of scikit-learn and Keras. This document contains a brief guide to the functions available and the ways in which it can be customised.

Older versions:

Help

Documentation can be displayed within Python:

import xcsf
help(xcsf.xcsf)

Constructor

Specifying the input dimensionality x_dim, the output dimensionality y_dim, and the number of actions (or classes) n_actions is required. Other parameters may be specified when changing from the default value.

Example

import xcsf

xcs = xcsf.XCS(x_dim=1, y_dim=1, n_actions=1)

Library Stub

def __init__(
    self,
    #############################################
    # Required
    #############################################
    x_dim: int = 1,  # input dimensionality
    y_dim: int = 1,  # output dimensionality
    n_actions: int = 1,  # number of actions or classes; 1 for supervised learning
    #############################################
    # General
    #############################################
    omp_num_threads: int = 8,  # number of CPU cores to use
    random_state: int | None = None,  # sets random number seed; uses current time if None or negative
    population_file: str = "",  # JSON file with an initial population set
    pop_init: bool = True,  # whether to seed the population with random rules
    max_trials: int = 100000,  # number of trials to execute for each xcs.fit()
    perf_trials: int = 1000,  # number of trials to average performance output
    pop_size: int = 2000,   # maximum population size
    set_subsumption: bool = False,  # whether to perform set subsumption
    theta_sub: int = 100,  # minimum experience of a classifier to become a subsumer
    #############################################
    # Error Function
    #############################################
    loss_func: str = "mae",
    # "mae" = mean absolute error
    # "mse" = mean squared error
    # "rmse" = root mean squared error
    # "log" = log loss (cross-entropy)
    # "binary_log" = binary log loss
    # "onehot" = one-hot encoding classification error
    # "huber" = Huber error
    huber_delta: float = 1,  # delta parameter for Huber error calculation
    #############################################
    # Multi-step Problems
    #############################################
    teletransportation: int = 50,  # num steps to reset a multistep problem if goal not found
    gamma: float = 0.95,  # discount factor in calculating the reward for multistep problems
    p_explore: float = 0.9,  # probability of exploring vs. exploiting in a multistep trial
    #############################################
    # Classifier
    #############################################
    e0: float = 0.01,  # target error, under which accuracy is set to 1
    alpha: float = 0.1,  # accuracy offset for rules above E0 (1=disabled)
    nu: float = 5,  # accuracy slope for rules with error above E0
    beta: float = 0.1,  # learning rate for updating error, fitness, and set size
    delta: float = 0.1,  # fraction of least fit classifiers to increase deletion vote
    theta_del: int = 20,  # min experience before fitness used in probability of deletion
    init_fitness: float = 0.01,  # initial classifier fitness
    init_error: float = 0,  # initial classifier error
    m_probation: int = 10000,  # trials since creation a rule must match at least 1 input or be deleted
    stateful: bool = True,  # whether classifiers should retain state across trials
    compaction: bool = False,  # if enabled and sys err < E0, the largest of 2 roulette spins is deleted
    #############################################
    # Evolutionary Algorithm
    #############################################
    ea: dict = {
        "select_type": "roulette",  # parental selection; options = {"roulette", "tournament"}
        "select_size": 0.4,  # fraction of set size for tournament parental selection
        "theta_ea": 50,  # average set time between EA invocations
        "lambda": 2,  # number of offspring to create each EA invocation (use multiples of 2)
        "p_crossover": 0.8,  # probability of applying crossover (usage depends on representation)
        "err_reduc": 1.0,  # amount to reduce an offspring error (1=disabled)
        "fit_reduc": 0.1,  # amount to reduce an offspring fitness (1=disabled)
        "subsumption": False,  # whether to try and subsume offspring classifiers
        "pred_reset": False,  # whether to reset offspring predictions instead of copying
    },  
    #############################################
    # Action: See below for options.
    #############################################
    action: dict = {
        "type": "integer",
    },
    #############################################
    # Condition: See below for options.
    #############################################
    condition: dict = {
        "type": "hyperrectangle_csr",
        "args": {
            "eta": 0.0,
            "min": 0.0,
            "max": 1.0,
            "spread_min": 0.1,
        },
    },
    #############################################
    # Prediction: See below for options.
    #############################################
    prediction: dict = {
        "type": "nlms_linear",
        "args": {
            "x0": 1.0,
            "eta": 0.1,
            "evolve_eta": True,
            "eta_min": 1e-05,
        },
    },
) -> None: ...

Notes

If n_actions > 1 the system will be configured for Reinforcement Learning and action sets will be constructed from the match sets using the chosen representation as in traditional XCS. If n_actions = 1 the system will be configured for Supervised Learning and the action component will be removed. That is, only match sets will be created as in traditional XCSF.

The loss_func is used when calculating each classifier's error for fitness determination, and therefore affects the reproductive success with the evolutionary algorithm (EA). It is also used when measuring the system performance; for example, when comparing the fitness weighted match set predictions with the true values in score().

However, the loss_func is not used when updating classifier predictions since these updates depend on the representation chosen. For example, Constant predictions minimise the mean absolute error, whereas NLMS and RLS minimise the mean squared error. Neural Networks also minimise the mean squared error, however if using a one-hot encoding and a softmax layer, it will minimise the mean log loss (cross-entropy).

A given random_state is not guaranteed to generate reproducible behaviour across different platforms (e.g., Linux vs. Windows) due to compiler optimisations, etc.

Hyperparameters

The default hyperparameters are not intended as general values suitable for all problems and must be set appropriately for the specific learning task. This is best done in a systematic way, e.g., as either a grid search or using a hyperparameter tuning framework such as Optuna. However, below are some general hints that may help finding optimal values.

One of the most important hyperparameters pop_size will vary considerably depending on the representation chosen (e.g., evolving hyperrectangle conditions will require larger rule sets than neural networks) and also based on problem complexity. Faster convergence will generally be seen with larger values, although with a higher resultant computational cost. Note that LCS are notorious for requiring much larger population sizes than typical EAs. For example, Butz et al. (2008) use pop_size = 6400 to solve a number of sine function regression tasks with hyperellipsoidal conditions and RLS predictions. The first priority should be to achieve acceptable and stable performance, and then rule compaction algorithms can be applied subsequently if necessary.

Another of the most important hyperparameters is max_trials, which specifies the total number of "trials" to execute. For supervised learning, a trial is a single iteration where a single sample is drawn and a match set created and updated; see Fitting for more information. For reinforcement learning, a trial (also known as an episode) consists of one or more steps, where each step involves creating a match set and action sets based on the current environment state. Similar to pop_size, the value of max_trials will vary significantly based on both the representation and problem complexity.

Increasing theta_ea will slow down the rate of EA invocation. In simple problems this will lead to slower convergence, but in more complex ones it may be necessary to provide enough samples for the classifier predictions to converge correctly and therefore provide a more accurate error/fitness calculation. In simple problems, values of 25 or 50 are typically used, whereas in more complex problems, values of 100, 500, or even larger may be required.

The hyperparameter theta_del is closely related to theta_ea because it controls the minimum experience a classifier can have before fitness is used in the probability of deletion. As a general rule of thumb, these two parameters should usually be similar values. Also closely related is beta which controls the update rate for classifier metrics such as error and set size. Therefore, small values of theta_ea will require larger values of beta to more quickly update classifier metrics, but with larger values of theta_ea it can be reduced to allow a smoother and more accurate fitness determination. A good starting point for beta is 0.1; in very simple problems a value of 0.2 may converge faster, but in more complex problems values of 0.05 and smaller may be beneficial with larger theta_ea.

For regression tasks, it is typical to set alpha to 1, which disables the accuracy offset since there is almost always no cliff edge in the fitness landscape as compared with predicting a different label for classification. In reinforcement learning however, the offset can be useful for stabilising performance and typically a value of 0.1 is used.

The hyperparameter nu controls the exponent used in the accuracy calculation, which affects the slope of the accuracy curve. Larger values therefore increase the difference in fitness between rules of similar errors. This increases the pressure toward evolving accurate rules and consequently increase the pressure to niche during the early stages of learning. In some early papers comparing roulette wheel and tournament selection for EA parental select_type there was a debate about whether tournament selection was faster and more robust, but it was later shown that when nu is correctly set the differences are minimal. nu can therefore be seen to have a similar affect on roulette wheel selection as select_size does for tournament. Generally nu = 5 provides a good performance without compromising diversity by increasing the fitness pressure too high, however there are occasionally some problems where higher values such as 10, 20, or even 50, can increase performance.

A value of 0.1 for delta is almost always used, which results in the 10% least fit classifiers in the population receiving a bigger deletion vote and are therefore more likely to be removed from the population. Disabling this by using a value of 0 will likely slow convergence.

Setting Parameters

The parameters listed above can be modified on an instantiated xcsf.XCS object using set_params()

For example:

xcs.set_params(beta=0.5)

Note that while EA parameters require specifying a dictionary, only the values that should be modified need to be present, for example only changing theta_ea:

xcs.set_params(ea={"theta_ea": 100})

However, setting the action, condition, and prediction requires providing a complete dict including the type and args.

Library Stub:

def set_params(**kwargs) -> xcsf.XCS: ...

Initialising Conditions

Always match (dummy)

The use of always matching conditions results in the match set being equal to the population set, i.e., [M] = [P]. The EA and classifier updates are thus performed within [P], and global models are designed (e.g., neural networks) that cover the entire state-space. This configuration operates as a more traditional EA, which can be useful for debugging and benchmarking.

Additionally, a single global model (e.g., a linear regression) can be fit by also setting pop_size = 1 and disabling the EA by setting the invocation frequency to a larger number than will ever be executed, e.g., ea = {"theta_ea": 5000000}. This can also be useful for debugging and benchmarking.

condition = {
    "type": "dummy",
}

Ternary Bitstrings

With ternary bitstrings, each classifier's condition is represented as $cl.C \in \{0,1,\#\}^L$ where the length of the string $L$ is equal to the x_dim multiplied by the number of encoding bits.

For binary problems, the number of encoding bits is simply: bits = 1. For real-valued inputs, the values are binarised to the specified number of bits with the assumption that the inputs are in the range [0,1]. For example with bits = 2, an input vector [0.23,0.76,0.45,0.5] will be converted to [0,0,1,1,0,1,0,1] before being tested for matching with the ternary bitstring using the alphabet {0,1,#} where the don't care symbol # matches either bit.

Uniform crossover is applied with probability p_crossover and a single self-adaptive mutation rate (log normal) is used.

condition = {
    "type": "ternary",
    "args": {
        "bits": 2, # number of bits per float to binarise inputs
        "p_dontcare": 0.5, # don't care probability during covering
    }
}

Related Literature:

S. W. Wilson (1995) Classifier fitness based on accuracy

Hyperrectangles and Hyperellipsoids

Hyperellipsoids currently use the center-spread representation (and axis-rotation is not yet implemented.)

Hyperrectangles currently implement the center-spread and unordered-bound representations.

With the hyperrectangle center-spread representation, each classifier condition is represented as a concatenation of interval predicates, $cl.C = (c_i, s_i)^L$ where $L$ is equal to the x_dim and $c_i, s_i \in \mathbb{R}$. $c_i$ encodes the center of the interval and $s_i$ encodes the spread (or width.) A classifier matches an input $x$ with attributes $x_i$ if and only if $(c_i - s_i) \le x_i \le (c_i + s_i)$ for all $x_i$.

With the hyperrectangle unordered-bound representation, each classifier condition is represented as a concatenation of interval predicates, $cl.C = (p_i, q_i)^L$ where $L$ is equal to the x_dim and $p_i, q_i \in \mathbb{R}$. A classifier matches an input $x$ with attributes $x_i$ if and only if $min(p_i, q_i) \le x_i \le max(p_i, q_i)$ for all $x_i$.

Uniform crossover is applied with probability p_crossover. A single self-adaptive mutation rate (log normal) specifies the standard deviation used to sample a random Gaussian (with zero mean) which is added to each center and spread value (or bound for unordered-bounds).

For center-spread representations, if eta > 0 each classifier's centers are adjusted at rate $\eta$ towards the mean of the observed inputs during each update (see Tamee et al., 2007). That is, $c_i \leftarrow c_i + \eta (x_i - c_i)$.

condition = {
    "type": "hyperrectangle_csr", # center-spread
    # "type": "hyperrectangle_ubr", # unordered-bound
    # "type": "hyperellipsoid", # center-spread
    "args": {
        "min": 0, # minimum value of a center/bound
        "max": 1, # maximum value of a center/bound
        "spread_min": 0.1, # minimum initial spread
        "eta": 0, # gradient descent rate for moving centers to mean inputs matched
    }
}

The input features will need to be numpy arrays of type numpy.float64 and when using hyperrectangles or hyperellipsoids it is recommended to scale the input features in the range [0,1], for example as below. Monitor the average number of matching classifiers with the mset metric to make sure that a sufficient number of rules are covering the input domain.

from sklearn.preprocessing import MinMaxScaler                                    
scaler = MinMaxScaler(feature_range=(0, 1))
x = scaler.fit_transform(x)

Related Literature:

S. W. Wilson (2000) Get real! XCS with continuous-valued inputs
C. Stone and L. Bull (2003) For real! XCS with continuous-valued inputs
M. V. Butz (2005) Kernel-based, ellipsoidal conditions in the real-valued XCS classifier system
M. V. Butz, P.-L. Lanzi, and S. W. Wilson (2006) Hyper-ellipsoidal conditions in XCS: rotation, linear approximation, and solution structure
K. Tamee, L. Bull, and O. Pinngern (2007) Towards clustering with XCS

GP Trees

GP trees currently use arithmetic operators from the set {+,-,/,*}. Return values from each node are clamped [-1000,1000]. The rule matches if the output node is greater than 0.5. Subsumption is not implemented.

Sub-tree crossover is applied with probability p_crossover. A single self-adaptive mutation rate (rate selection) is used to specify the per allele probability of performing mutation where terminals are randomly replaced with other terminals and functions randomly replaced with other functions.

condition = {
    "type": "tree_gp",
    "args": {
        "min_constant": 0, # minimum value of a constant
        "max_constant": 1, # maximum value of a constant
        "n_constants": 100, # number of (global) constants available
        "init_depth": 5, # initial depth of a tree
        "max_len": 10000, # maximum initial length of a tree 
    }
}

DGP Graphs

Temporally dynamic graphs with fuzzy symbolic functions selected from the CFMQVS set: {fuzzy NOT, fuzzy AND, fuzzy OR}. Each graph is initialised with a randomly selected function assigned to each node and random connectivity (including recurrent connections) and is synchronously updated in parallel for T cycles before sampling the output node(s). These graphs can exhibit inherent memory by retaining state across inputs. Inputs must be in the range [0,1].

Currently implements a fixed number of nodes with the connectivity and update cycles evolved along with the function for each node. Log normal self-adaptive mutation is used for node function and connectivity and uniform self-adaptive mutation for the number of update cycles.

When used as conditions, the number of nodes n must be at least 1 and the rule matches a given input if the state of that node is greater than 0.5 after updating the graph T times. When used as condition + action rules, the action is encoded as binary (discretising the node outputs with threshold 0.5); for example with 8 actions, a minimum of 3 additional nodes are required. Subsumption is not implemented.

condition = {
    "type": "dgp",
    # "type": "rule_dgp" # conditions + actions in single DGP graphs
    "args": {
        "max_k": 2, # number of connections per node
        "max_t": 10, # maximum number of cycles to update graphs
        "n": 20, # number of nodes in the graph
        "evolve_cycles": True, # whether to evolve the number of update cycles
    }
}

Neural Networks

Condition output layers should be set to a single neuron, i.e., "n_init": 1. A classifier matches an input if this output neuron is greater than 0.5.

When used to represent conditions and actions within a single network ("rules") the output layers should be "n_init": 1 + binary where binary is the number of outputs required to output binary actions. For example, for 8 actions, 3 binary outputs are required and the output layer should contain 4 neurons. Again, the neuron states of the action outputs are discretised with threshold 0.5. Subsumption is not implemented.

See Neural Network Initialisation.

condition = {
    "type": "neural",
    "args": layer_args,
}

condition = {
    "type": "rule_neural" # conditions + actions in single neural nets
    "args": layer_args,
}

Related Literature:

L. Bull (2002) On using constructivism in neural classifier systems
R. J. Preen, S. W. Wilson, and L. Bull (2021) Autoencoding with a classifier system
R. J. Preen and L. Bull (2021) Deep learning with a classifier system: Initial results

Initialising Actions

Integers

A constant integer value. A single self-adaptive mutation rate (log normal) specifies the probability of randomly reselecting the value.

action = {
    "type": "integer",
}

Related Literature:

S. W. Wilson (1995) Classifier fitness based on accuracy

Neural Networks

Output layer should be a softmax.

See Neural Network Initialisation.

action = {
    "type": "neural",
    "args": layer_args,
}

Related Literature:

T. O'Hara and L. Bull (2005) A memetic accuracy-based neural learning classifier system
P.-L. Lanzi and D. Loiacono (2007) Classifier systems that compute action mappings
D. Howard, L. Bull, and P.-L. Lanzi (2015) A cognitive architecture based on a learning classifier system with spiking classifiers

Initialising Predictions

Constant

Original XCS behaviour can be specified with (piece-wise) constant predictions where each classifier maintains a scalar $cl.p$ for each predicted variable $y$. These are updated at learning rate $\beta$. For example, for each predicted variable:

if $cl.exp < 1 / \beta$:
- $cl.p \leftarrow (cl.p \times (cl.exp - 1) + y) / cl.exp$
otherwise:
- $cl.p \leftarrow cl.p + \beta (y - cl.p)$

beta = 0.1  # classifier update rate includes constant predictions

prediction = {
    "type": "constant",
}

Related Literature:

S. W. Wilson (1995) Classifier fitness based on accuracy

Normalised Least Mean Squares

Each classifier maintains a vector of weights that are used to compute the prediction from the input state $\vec{x}$. For linear prediction, there are $n$ linear weights plus an offset/bias for each predicted variable $y$, where $n$ is equal to x_dim. Therefore, the length of the weight vector is equal to (x_dim + 1) * y_dim.

For quadratic prediction, there are 1 (offset) + $n$ linear + $n$ quadratic + $n(n-1)/2$ mixed terms weights for each predicted variable.

For linear prediction, each classifier computes its prediction:

$$cl.P(\vec{x}) = cl.w_0 \times x_0 + \sum_{i>0}cl.w_i \times x_i$$

Where $x_0$ is the constant x0 and $cl.w$ is the classifier's weight vector.

When updating the classifier, each weight is adjusted by: $$\Delta w_i = \frac{\eta}{|\vec{x}|^2} (y - cl.P(\vec{x}))x_i$$

Where $|\vec{x}|^2$ is the norm of the input vector and $\eta$ is the learning rate eta.

The classifier weight vector is then updated: $$cl.w_i \leftarrow cl.w_i + \Delta w_i$$

If eta is evolved, the rate is initialised uniformly random [eta_min, eta]. Offspring inherit the rate and a single (log normal) self-adaptive mutation rate specifies the standard deviation used to sample a random Gaussian (with zero mean) which is added to eta (similar to evolution strategies).

prediction = {
    "type": "nlms_linear", # options: {"nlms_linear", "nlms_quadratic"}
    "args": {
        "x0": 1, # offset value
        "eta": 0.1, # gradient descent update rate (maximum value, if evolved)
        "eta_min": 0.0001, # minimum gradient descent update rate (if evolved)
        "evolve_eta": True, # whether to evolve the gradient descent rate
    }
}

Related Literature:

S. W. Wilson (2001) Function approximation with a classifier system
S. W. Wilson (2002) Classifiers that approximate functions
P.-L. Lanzi, D. Loiacono, S. W. Wilson, and D. E. Goldberg (2005) XCS with computed prediction for the learning of Boolean functions
P.-L. Lanzi, D. Loiacono, S. W. Wilson, and D. E. Goldberg (2005) XCS with computed prediction in multistep environments
P.-L. Lanzi, D. Loiacono, S. W. Wilson, and D. E. Goldberg (2005) Extending XCSF beyond linear approximation

Recursive Least Mean Squares

RLS predictions use the same number of weights as for NLMS and are computed in the same way. However, the weight vectors (coefficients) are updated using the recursive least squares algorithm via the use of a gain matrix, which converges extremely quickly, although with a high computational cost.

prediction = {
    "type": "rls_linear", # options: {"rls_linear", "rls_quadratic"}
    "args": {
        "x0": 1, # offset value
        "scale_factor": 1000, # initial diagonal values of the gain-matrix
        "lambda": 1, # forget rate (small values may be unstable)
    }
}

Related Literature:

P.-L. Lanzi, D. Loiacono, S. W. Wilson, and D. E. Goldberg (2006) Prediction update algorithms for XCSF: RLS, Kalman filter, and gain adaptation
D. Loiacono and P.-L. Lanzi (2007) Recursive least squares and quadratic prediction in continuous multistep problems
M. V. Butz, P.-L. Lanzi, and S. W. Wilson (2008) Function approximation with XCS: Hyperellipsoidal conditions, recursive least squares, and compaction
D. Loiacono and P.-L. Lanzi (2008) Computed prediction in binary multistep problems
D. Loiacono and P.-L. Lanzi (2009) Recursive least squares and quadratic prediction in continuous multistep problems

Neural Networks

Output layer should be "n_init": y_dim.

See Neural Network Initialisation.

prediction = {
    "type": "neural",
    "args": layer_args, 
}

Related Literature:

P.-L. Lanzi and D. Loiacono (2006) XCSF with neural prediction
T. O'Hara and L. Bull (2007) Backpropagation in accuracy-based neural learning classifier systems
H. H. Dam, H. A. Abbass, C. Lokan, and X. Yao (2008) Neural-based learning classifier systems
R. J. Preen, S. W. Wilson, and L. Bull (2021) Autoencoding with a classifier system
R. J. Preen and L. Bull (2021) Deep learning with a classifier system: Initial results

Neural Network Initialisation

Neural networks support self-adaptive mutation with a number of different operators that may be enabled, however crossover is not implemented. Stochastic gradient descent may be enabled for predictions. Details may be found in:

R. J. Preen, S. W. Wilson, and L. Bull (2021) Autoencoding with a classifier system
R. J. Preen and L. Bull (2021) Deep learning with a classifier system: Initial results

General Network Specification

layer_args = {
    "layer_0": { # first hidden layer
        "type": "connected", # layer type
        ..., # layer specific parameters
    },
    ..., # as many layers as desired
    "layer_n": { # output layer
        "type": "connected", # layer type
        ..., # layer specific parameters
    },          
}

Activation Functions

Note: Neuron states are clamped [-100,100] before activations are applied. Weights are clamped [-10,10].

"logistic", # logistic [0,1]
"relu", # rectified linear unit [0,inf]
"tanh", # tanh [-1,1]
"linear", # linear [-inf,inf]
"gaussian", # Gaussian (0,1]
"sin", # sine [-1,1]
"cos", # cosine [-1,1]
"softplus", # soft plus [0,inf]
"leaky", # leaky rectified linear unit [-inf,inf]
"selu", # scaled exponential linear unit [-1.7581,inf]
"loggy", # logistic [-1,1]

Connected Layers

layer_args = {
    "layer_0": {
        "type": "connected", # layer type
        "activation": "relu", # activation function
        "evolve_weights": True, # whether to evolve weights
        "evolve_connect": True, # whether to evolve connectivity
        "evolve_functions": True, # whether to evolve activation function
        "evolve_neurons": True, # whether to evolve the number of neurons
        "max_neuron_grow": 5, # maximum number of neurons to add or remove per mut
        "n_init": 10, # initial number of neurons
        "n_max": 100, # maximum number of neurons (if evolved)
        "sgd_weights": True, # whether to use gradient descent (only for predictions)
        "evolve_eta": True, # whether to evolve the gradient descent rate   
        "eta": 0.1, # gradient descent update rate (maximum value, if evolved)
        "eta_min": 0.0001, # minimum gradient descent update rate (if evolved)
        "momentum": 0.9, # momentum for gradient descent update
        "decay": 0, # weight decay during gradient descent update
    },       
}

Recurrent Layers

layer_args = {
    "layer_0": {
        "type": "recurrent",
        ..., # other parameters same as for connected layers
    }
}

LSTM Layers

layer_args = {
    "layer_0": {
        "type": "lstm",
        "activation": "tanh", # activation function
        "recurrent_activation": "logistic", # recurrent activation function
        ..., # other parameters same as for connected layers
    }
}

Softmax Layers

Softmax layers can be composed of a linear connected layer and softmax:

layer_args = {
    "layer_0": {
        "type": "connected",
        "activation": "linear",
        "n_init": N_ACTIONS, # number of (softmax) outputs
        ..., # other parameters same as for connected layers
    },       
    "layer_1": {
        "type": "softmax",
        "scale": 1, # softmax temperature
    },       
}

Dropout Layers

layer_args = {
    "layer_0": {
        "type": "dropout",
        "probability": 0.2, # probability of dropping an input
    }
}

Noise Layers

Gaussian noise adding layers.

layer_args = {
    "layer_0": {
        "type": "noise",
        "probability": 0.2, # probability of adding noise to an input
        "scale": 1.0, # standard deviation of Gaussian noise added
    }
}

Convolutional Layers

Convolutional layers require image inputs and produce image outputs. If used as the first layer, the width, height, and number of channels must be specified. RGB images must be in the array format: R1,R2,R3,...,G1,G2,G3,...,B1,B2,B3,..., this is in contrast with Keras for example, which requires R1,G1,B1,R2,G2,B2, etc.

If "evolve_neurons": True the number of filters will be evolved using an initial number of filters "n_init" and maximum number "n_max".

layer_args = {
    "layer_0": {
        "type": "convolutional",
        "activation": "relu", # activation function
        "height": 16, # input height
        "width": 16, # input width
        "channels": 1, # number of input channels
        "n_init": 6, # number of convolutional kernel filters
        "size": 3, # the size of the convolution window
        "stride": 1, # the stride of the convolution window
        "pad": 1, # the padding of the convolution window
        ..., # other parameters same as for connected layers
    },       
    "layer_1": {
        "type": "convolutional",
        ..., # parameters same as above; height, width, channels not needed
    },       
}

Max-pooling Layers

Max-pooling layers require image inputs and produce image outputs. If used as the first layer, the width, height, and number of channels must be specified.

layer_args = {
    "layer_0": {
        "type": "maxpool",
        "height": 16, # input height
        "width": 16, # input width
        "channels": 1, # number of input channels
        "size": 2, # the size of the maxpooling operation
        "stride": 2, # the stride of the maxpooling operation
        "pad": 0, # the padding of the maxpooling operation
    },       
    "layer_1": {
        "type": "maxpool",
        "size": 2,
        "stride": 2,
        "pad": 0,
    },       
}

Average-pooling Layers

Average-pooling layers require image inputs. If used as the first layer, the width, height, and number of channels must be specified. Outputs an average for each input channel.

layer_args = {
    "layer_0": {
        "type": "avgpool",
        "height": 16, # input height
        "width": 16, # input width
        "channels": 1, # number of input channels
    },       
    "layer_1": {
        "type": "avgpool",
    },       
}

Upsampling Layers

Upsampling layers require image inputs and produce image outputs. If used as the first layer, the width, height, and number of channels must be specified.

layer_args = {
    "layer_0": {
        "type": "upsample",
        "height": 16, # input height
        "width": 16, # input width
        "channels": 1, # number of input channels
        "stride": 2, # the stride of the upsampling operation
    },       
    "layer_1": {
        "type": "upsample",
        "stride": 2,
    },       
}

Note

When using layers that include random numbers during forward propagation (e.g., dropout and noise), results are not guaranteed to be reproducible with a given random_state if parallel CPU processing is enabled.

Saving and Loading XCSF

XCSF provides support for pickle and also provides the following functions for serializing to a binary file.

Example saving the entire current state of XCSF to a binary file:

xcs.save("saved_name.bin")

Example loading the entire state of XCSF from a binary file:

xcs.load("saved_name.bin")

Functions return the total number of elements written or read.

Library Stub:

def save(self, filename: str) -> int: ...
def load(self, filename: str) -> int: ...

Storing and Retrieving XCSF

Example storing the current XCSF population in memory for later retrieval, overwriting any previously stored population:

xcs.store()

Example retrieving the previously stored XCSF population from memory:

xcs.retrieve()

Library Stub:

def store(self) -> None: ...
def retrieve(self) -> None: ...

Printing XCSF

Example printing the current XCSF parameters:

print(json.dumps(xcs.internal_params(), indent=4))

Example printing the current XCSF population:

xcs.print_pset()

Library Stub:

def print_pset(self, condition: bool = True, action: bool = True, prediction: bool = True) -> None: ...

Learning Metrics

The learning performance metrics dictionary is updated after perf_trials number of trials during fit() and can be accessed through the following method:

metrics: dict = xcs.get_metrics()

trials: list[int] = metrics["trials"]  # cumulative total number of learning trials performed at each update
train: list[float] = metrics["train"]  # system error on the training set at each update
val: list[float] = metrics["val"]  # system error on the validation set at each update
psize: list[float] = metrics["psize"]  # population set size (macro-classifiers) at each update
msize: list[float] = metrics["msize"]  # average match set size (macro-classifiers) at each update
mfrac: list[float] = metrics["mfrac"]  # fraction of samples matched by the best rule at each update

XCSF Getters

Library Stub:

# General
def get_metrics(self) -> dict: ...  # returns a dictionary of performance metrics
def aset_size(self) -> float: ...  # returns the mean action set size
def mfrac(self) -> float: ...  # returns the mean fraction of inputs matched by the best rule
def mset_size(self) -> float: ...  # returns the mean match set size
def pset_mean_cond_size(self) -> float: ...  # returns the mean condition size
def pset_mean_pred_size(self) -> float: ...  # returns the mean prediction size
def pset_num(self) -> int: ...  # returns the mean population numerosity
def pset_size(self) -> int: ...  # returns the mean population size
def time(self) -> int: ...  # returns the current EA time

# Neural network specific - population set averages
# "layer" argument is an integer specifying the location of a layer: first layer=0
def pset_mean_cond_connections(self, layer: int) -> float: ...  # returns the number of active connections for a condition layer
def pset_mean_cond_layers(self) -> float: ...  # returns the mean number of layers in the condition networks
def pset_mean_cond_neurons(self, layer: int) -> float: ...   # returns the mean number of neurons for a condition layer
def pset_mean_pred_connections(self, layer: int) -> float: ...  # returns the number of active connections for a prediction layer
def pset_mean_pred_eta(self, layer: int) -> float: ...   # returns the mean eta for a prediction layer
def pset_mean_pred_layers(self) -> float: ...  # returns the mean number of layers in the prediction networks
def pset_mean_pred_neurons(self, layer: int) -> float: ...  # returns the mean number of neurons for a prediction layer

Getting the Population

import json
json_string = xcs.json()  # returns the current population as JSON
parsed = json.loads(json_string)  # convert to dict

Then to print the current population:

print(json.dumps(parsed, indent=4))

Example printing ternary conditions, integer actions, and fitnesses:

fitness = [cl["fitness"] for cl in parsed["classifiers"]]
ternary = [cl["condition"]["string"] for cl in parsed["classifiers"]]
actions = [cl["action"]["action"] for cl in parsed["classifiers"]]
for i in range(len(fitness)):
    print("%s %d %.5f" % (ternary[i], actions[i], fitness[i]))

Printing and returning the individual weights from neural networks is disabled by default. To enable, change the flags in the neural_json_export() functions in cond_neural.c, pred_neural.c, etc.

Library Stub:

def json(self, condition: bool = True, action: bool = True, prediction: bool = True) -> str: ...

Getting the Parameters

xcs.get_params()  # returns external parameters (used for sklearn)
xcs.internal_params()  # returns the internal parameters actually used by XCSF

Example getting and printing the current parameters:

print(json.dumps(xcs.internal_params(), indent=4))

Library Stub:

def get_params(self) -> dict: ...
def internal_params(self) -> dict: ...

Seeding the Population

Classifiers can be inserted into the population in a number of ways.

Firstly, the initial population can be set with a JSON file containing an existing population set. This can be specified in the constructor with the argument population_file.

The json_insert_cl() function can be used to insert a single new classifier into the population. The new classifier is initialised with a random condition, action, prediction, and then any supplied properties overwrite these values. This means that all properties are optional. If the population set numerosity exceeds pop_size after inserting the rule, the standard roulette wheel deletion mechanism will be invoked to maintain the population limit.

GP trees and neural networks are not yet implemented.

See notebook example.

Note: when manually adding classifiers, be careful that the keys are correct because if an exact match is not found it will be ignored silently.

Multiple classifiers can be added through the same mechanism as a single JSON string with json_insert().

Additionally, the entire population set can be written in JSON format to a plain text file:

xcs.json_write("pset.json")

And read into the population with:

xcs.json_read("pset.json")

Notes:

Make sure to set warm_start=True in fit() to continue using the population.

This is not the recommended way to backup the system to persistent storage since temporary memory buffers (e.g., update matrices) and parameters are not saved and reloaded. For this purpose, see Saving and Loading XCSF.

Library Stub:

def json_insert(self, clset_json: str) -> None: ...
def json_insert_cl(self, cl_json: str) -> None: ...
def json_read(self, filename: str) -> None: ...
def json_write(self, filename: str) -> None: ...

Visualising GP Trees

The TreeViz class from viz.py will generate a tree with graphviz. The first argument must be the tree array; and the second, the filename to save the output as a pdf. Optionally accepts a list of strings representing the feature_names. Optionally accepts a string note, which will add a note/caption at the bottom.

Example plotting the first classifier condition:

import json
from xcsf.utils.viz import TreeViz
parsed = json.loads(xcs.json())
trees = [cl["condition"]["tree"]["array"] for cl in parsed["classifiers"]]
TreeViz(trees[0], "test")

Note this will require the graphviz package installed with:

$ pip install graphviz

TreeViz Stub:

def __init__(self, 
    tree: list[str], 
    filename: str, 
    note: str | None = None, 
    feature_names: list[str] | None = None,
) -> None: ...

Visualising DGP Graphs

The DGPViz class from viz.py will generate a graph with graphviz. The first argument must be the graph; and the second, the filename to save the output as a pdf. Optionally accepts a list of strings representing the feature_names. Optionally accepts a string note, which will add a note/caption at the bottom.

Example plotting the first classifier condition and passing the error as a note:

import json
from xcsf.utils.viz import DGPViz
parsed = json.loads(xcs.json())
errors = [cl["error"] for cl in parsed["classifiers"]]
graphs = [cl["condition"]["graph"] for cl in parsed["classifiers"]]
note = "Error = %.5f" % errors[0]
DGPViz(graphs[0], "test", note=note)

DGPViz Stub:

def __init__(self, 
    graph: dict, 
    filename: str, 
    note: str | None = None, 
    feature_names: list[str] | None = None,
) -> None: ...

Reinforcement Learning

Initialisation

Initialise XCSF with y_dim = 1 for predictions to estimate the scalar reward. The number of actions n_actions must be greater than 1 otherwise the system will be configured for supervised learning.

import xcsf
xcs = xcsf.XCS(x_dim=X_DIM, y_dim=1, n_actions=N_ACTIONS)

Method 1

The standard method involves the basic loop as shown below. state must be a 1-D numpy array representing the feature values of a single instance; reward must be a scalar value representing the current environmental reward for having performed the action; and done must be a boolean value representing whether the environment is currently in a terminal state.

state = env.reset()
xcs.init_trial()
for cnt in range(xcs.TELETRANSPORTATION):
    xcs.init_step()
    action = xcs.decision(state, explore) # explore specifies whether to explore/exploit
    next_state, reward, done = env.step(action)
    xcs.update(reward, done) # update the current action set and/or previous action set
    err += xcs.error(reward, done, env.max_payoff()) # system prediction error
    xcs.end_step()
    if done:
        break
    state = next_state
cnt += 1
xcs.end_trial()

See notebook example.

Library Stub:

def init_step(self) -> None: ...
def init_trial(self) -> None: ...
def end_step(self) -> None: ...
def end_trial(self) -> None: ...
def error(self) -> float: ...
def update(self, reward: float, done: bool) -> None: ...
def decision(
    self,
    state: np.ndarray[Any, np.dtype[np.float64]],  # shape = (x_dim, )
    explore: bool,
) -> int: ...

Method 2

The fit() function may be used as below to execute one single-step learning trial, i.e., creation of the match and action sets, updating the action set and running the EA as appropriate. The vector state must be a 1-D numpy array representing the feature values of a single instance; action must be an integer representing the selected action (and therefore the action set to update); and reward must be a scalar value representing the current environmental reward for having performed the action.

xcs.fit(state, action, reward)

The entire prediction array for a given state can be returned using the supervised predict() function, which must receive a 2-D numpy array. For example:

prediction_array = xcs.predict(state.reshape(1,-1))[0]

See notebook example.

Library Stub:

@typing.overload
def fit(
    self,
    state: np.ndarray[Any, np.dtype[np.float64]],  # shape = (x_dim, )
    action: int,
    reward: float,
) -> float: ...

def predict(
    self,
    X_predict: np.ndarray[Any, np.dtype[np.float64]],  # shape = (n_samples, x_dim)
) -> np.ndarray[Any, np.dtype[np.float64]]: ...   # shape = (n_samples, y_dim)

Method 3

The supervised fit() and predict() functions can be used for reinforcement learning without action sets, i.e., [A] = [M].

See notebook example using experience replay.

Related Literature:

A. Stein, R. Maier, L. Rosenbauer, and J. Hähner (2020) XCS classifier system with experience replay
L. Rosenbauer, A. Stein, D. Pätzel, and J. Hähner (2020) XCSF with experience replay for automatic test case prioritization

Supervised Learning

Initialisation

Initialise XCSF with n_actions=1, set x_dim and y_dim as needed. Set conditions and predictions as desired. In this configuration [A] = [M]. That is, no action sets are actually constructed: the action component of classifiers is not used. A match set [M] is constructed and the prediction array of length y_dim is calculated using the usual fitness weighted method. Updates are performed in [M].

Example:

import xcsf
xcs = xcsf.XCS(x_dim=10, y_dim=2, n_actions=1)

Fitting

The fit() function may be used as below to execute max_trials number of learning iterations (i.e., match set creations) using a supplied training set. Note that while the training data is supplied as a batch, learning proceeds in the usual online way: one sample at a time. To execute a single trial simply pass a batch size of one by reshaping the data and set max_trials=1 and warm_start=True.

If shuffle=True, for each iteration a single random training example is selected (with replacement) and a single match set is created and updated. This is known as online learning or stochastic training. It's similar to doing SGD with a batch size of 1, where training examples are sampled randomly with replacement. These samples are drawn max_trials number of times in total, but for displaying metrics, the system performance is averaged over the most recent perf_trials number of iterations.

Here we refer to an "epoch" as perf_trials number of iterations. The system performance will thus be displayed after each epoch and any callbacks used will run then. The total number of possible epochs will be equal to ceil(max_trials/perf_trials) as defined in the constructor.

The input arrays X_train and y_train must be 2-D numpy arrays of the shape (n_samples, x_dim) and (n_samples, y_dim).

Example:

xcs.fit(
    X_train, # numpy array of shape (n_samples, x_dim). 
    y_train, # numpy array of shape (n_samples, y_dim). 
    shuffle=True,  # whether to randomly draw samples
    warm_start=False,  # whether to continue using the existing population
    verbose=True,  # whether to display performance
    validation_data=(X_val, y_val),  # optional validation data
    callbacks=[callback],  # optional callbacks
)

Library Stub:

@typing.overload
def fit(
    self,
    X_train: np.ndarray[Any, np.dtype[np.float64]],  # shape = (n_samples, x_dim)
    y_train: np.ndarray[Any, np.dtype[np.float64]],  # shape = (n_samples, y_dim)
    shuffle: bool = True,
    warm_start: bool = False,
    verbose: bool = True,
    validation_data: tuple[np.ndarray, np.ndarray] | None = None,
    callbacks: list | None = None,
) -> xcsf.XCS: ...

Callbacks

Several callbacks can be provided to fit() to perform tasks based on the learning performance at runtime. Callbacks are executed after each "epoch", where an epoch is equal to perf_trials number of learning trials.

EarlyStoppingCallback

Stop training when a monitored metric has stopped improving. Either training or validation error can be chosen.

If the training error is selected, the match set fitness weighted prediction for each of the most recent perf_trials samples is compared with the true value and the mean error as defined by loss_func is used. If the validation error is selected, the mean system error is computed over the entire validation data provided (the same as if score() had been called with the validation set).

The patience and start_from parameters are defined in number of trials but since the callback is only checked after perf_trials, their values should be a multiple of perf_trials.

Example:

callback = xcsf.EarlyStoppingCallback(
    # note: perf_trials is considered an "epoch" for callbacks
    monitor="val",  # which metric to monitor: {"train", "val"}
    patience=20000,  # trials with no improvement after which training will be stopped
    restore_best=True,  # whether to restore the best population after terminating
    min_delta=0,  # minimum change to qualify as an improvement
    start_from=0,  # trials to wait before starting to monitor improvement
    verbose=True,  # whether to display when an action is taken
)

xcs.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[callback])

CheckpointCallback

Callback to save XCSF at some frequency.

Example:

callback = xcsf.CheckpointCallback(
    # note: perf_trials is considered an "epoch" for callbacks
    monitor="val",  # which metric to monitor: {"train", "val"}
    filename="xcsf.bin",  # name of the file to save XCSF
    save_best_only=False,  # Whether to only save the best population.
    save_freq=0,  # Trial frequency to (possibly) make checkpoints.
    verbose=True,  # whether to display when an action is taken
)

xcs.fit(X_train, y_train, validation_data=(X_val, y_val), callbacks=[callback])

Scoring

The score() function may be used as below to calculate the prediction error over a single pass of a supplied data set without updates or the EA being invoked (e.g., for scoring a validation set). An argument N may be supplied that specifies the maximum number of iterations performed; if this value is less than the number of instances supplied, samples will be drawn randomly. Returns a scalar representing the error. 2-D numpy arrays are expected as inputs.

Note that if the match set is empty for a given sample, the prediction will be set to zeros. An optional argument cover can be used to specify the values to use as system output instead of returning zeros. cover must be an array of length y_dim.

val_error = xcs.score(X_val, y_val)

val_error = xcs.score(X_val, y_val, N=1000, cover=[0.1])

Library Stub:

def score(
    self,
    X_val: np.ndarray[Any, np.dtype[np.float64]],  # shape = (n_samples, x_dim)
    y_val: np.ndarray[Any, np.dtype[np.float64]],  # shape = (n_samples, y_dim)
    N: int = 0,  # max number of samples to use
    cover: Optional[np.ndarray[Any, np.dtype[np.float64]]],  # shape = (1, y_dim)
) -> float: ...

Predicting

The predict() function may be used as below to calculate the XCSF predictions for a supplied data set. That is, the fitness weighted match set prediction for each sample. No updates or EA invocations are performed. The input vector must be a 2-D numpy array of the shape (n_samples, x_dim). Returns a 2-D numpy array of shape (n_samples, y_dim).

Note that similar to score(), if the match set is empty for a given sample, the prediction will be set to zeros. An optional argument cover can be used to specify the values to use as system output instead of returning zeros. cover must be an array of length y_dim.

predictions = xcs.predict(X_test)

predictions = xcs.predict(X_test, cover=[0.1])

Library Stub:

def predict(
    self, 
    X_test: np.ndarray[Any, np.dtype[np.float64]],  # shape = (n_samples, x_dim)
    cover: Optional[np.ndarray[Any, np.dtype[np.float64]]],  # shape = (1, y_dim)
) -> np.ndarray[Any, np.dtype[np.float64]]: ...   # shape = (n_samples, y_dim)

Notebook Examples

Notes

Self-adaptive mutation

Currently 3 self-adaptive mutation methods are implemented and their use is defined within the various implementations of conditions, actions, and predictions. The smallest allowable mutation rate MU_EPSILON = 0.0005.

Uniform adaptation: selects rates from a uniform random distribution. Initially the rate is drawn at random ~U[MU_EPSILON,1]. Offspring inherit the parent's rate, but with 10% probability the rate is randomly redrawn.
Log normal adaptation: selects rates using a log normal method (similar to evolution strategies). Initially the rate is selected at random from a uniform distribution ~U[MU_EPSILON,1]. Offspring inherit the parent's rate, before applying log normal adaptation: $\mu \leftarrow \mu e^{\mathcal{N}(0,1)}$.
Rate selection adaptation: selects rates from the following set of 10 values: {0.0005, 0.001, 0.002, 0.003, 0.005, 0.01, 0.015, 0.02, 0.05, 0.1}. Initially the rate is selected at random. Offspring inherit the parent's rate, but with 10% probability the rate is randomly reselected.

Related Literature:

L. Bull and J. Hurst (2003) A neural learning classifier system with self-adaptive constructivism
G. D. Howard, L. Bull, and P.-L. Lanzi (2008) Self-adaptive constructivism in neural XCS and XCSF
M. V. Butz, P. O. Stalph, and P.-L. Lanzi (2008) Self-adaptive mutation in XCSF
M. Serpell and J. E. Smith (2010) Self-adaptation of mutation operator and probability for permutation representations in genetic algorithms