`NeuralDice` (Abstract Base Class)

The NeuralDice class defines a unified interface for continuous stationary distribution correction estimation (DICE) algorithms. It is meant to be extended by specific continuous DICE estimators, such as:

It includes utilities for

preprocessing input data,
defining neural network architectures,
computing losses, and
running training loops.

🏗️ Constructor

def __init__(
        self,
        gamma, seed, batch_size,
        learning_rate, hidden_dimensions,
        obs_min, obs_max, n_act, obs_shape,
        dataset, preprocess_obs=None, preprocess_act=None, preprocess_rew=None,
        dir=None, get_recordings=None, other_hyperparameters=None, save_interval=100):

Args:

gamma (float): Discount factor $\gamma$.
seed (int): Random seed for reproducibility.
batch_size (int): Number of samples per batch.
learning_rate (float or tf.keras.optimizers.schedules.LearningRateSchedule): Learning rate.
hidden_dimensions (tuple): Number of neurons per hidden layer in the neural networks.
obs_min (np.ndarray): Minimum bounds for observations (states) $s_\min$.
obs_max (np.ndarray): Maximum bounds for observations (states) $s_\max$.
n_act (int): Number of actions $|A|$.
obs_shape (tuple): Shape of the observation space $S$.
dataset (pd.DataFrame): Dataset with columns:
- obs_init (int or NDArray[float]): Initial observation (state) $s_0$.
- obs (int or NDArray[float]): Current observation (state) $s$.
- act (int): Action $a$.
- rew (float): Reward $R(s, a, s')$.
- obs_next (int or NDArray[float]): Next observation (state) $s'$.
- probs_next_evaluation or probs_next (NDArray[float]): Action probabilities under the target policy at the next state $\pi(\cdot \mid s')$.
preprocess_obs (callable, optional): Preprocessing function for observations (states).
preprocess_act (callable, optional): Preprocessing function for actions.
preprocess_rew (callable, optional): Preprocessing function for rewards.
dir (str, optional): Directory for saving logs.
get_recordings (callable, optional): Function to get additional logging information.
other_hyperparameters (dict, optional): Any other (algorithm specific) hyperparameters to label the logs.
save_interval (int, optional): Interval (in steps) at which to save logs and model.

All preprocess_* functions should take (a pd.Series of) an obs, act, or rew as an argument and return a preprocessed version. For obs, self.preprocess_probs is applied by default, since probs is usually of the same type as obs. For act and rew, no preprocessing is applied by default.

Define get_recordings by

def get_recordings(
        estimator,
        obs_init, obs, act, obs_next, probs_init, probs_next,
        values, loss, gradients,
        pv_s, pv_w, ):

and let it return a dict[str, float]. More information on the arguments is found below.

Creates a dict self.hyperparameters with

"name" (str): specific name of the child class estimator,
"gamma",
"seed",
"batch_size",
"learning_rate",
"hidden_dimensions",
"other" (dict): other_hyperparameters.

Then proceeds to set the seed and call self.set_up_recording and self.set_up_networks.

def set_up_recording(self):

Initializes logging and recording setup at self.save_dir, including TensorBoard writers and unique run identifiers and generates a unique evaluation id self.id via datetime.now().isoformat().

def save_hyperparameters(self):

Saves the model's hyperparameters and unique evaluation id to a json file evaluation.json at self.dir.

def set_up_networks(self):

Initializes the value networks v and w and their corresponding SGD optimizers.

📦 Properties

@property
@abstractmethod
def __name__(self):

Should return the name of the specific estimator class.

@property
def save_dir(self):

Joins directory self.dir with the evaluation id self.id, provided that self.id is not None.

@property
def output_activation_fn_dual(self):

Returns the activation function applied to the output of the dual value network. Defaults to tf.identity, but should be overwritten in a child class if necessary.

🚀 Solve

def evaluate_loop(self, n_steps, verbosity=1, pbar_keys=None):

Runs the training loop over a specified number of steps, recording and logging progress.

Args:

n_steps (int): Number of training steps.
verbosity (int): Verbosity level for logging.
pbar_keys (list[str], optional): Keys to display in the progress bar.

def solve_pv(self, weighted):

Estimates the policy value $\rho^\pi$ from the approximated stationary distribution correction $\hat w_{\pi / D}$.

Args:

weighted (bool): Whether to normalize by the sum of weights $\sum_{i=1}^n \hat w_{\pi / D}(s_i, a_i)$ instead of the number of samples $n$.

Returns:

pv (float): Estimated policy value $\hat \rho^\pi$.

💵 Loss

@abstractmethod
def get_loss(self, v_init, v, v_next, w):

Should compute and return the training loss $J$ for SGDA optimization.

Args:

v_init (tf.Tensor): Initial primal value $v(s_0, a_0)$.
v (tf.Tensor): Current primal value $v(s, a)$.
v_next (tf.Tensor): Next primal value $v(s', a')$.
w (tf.Tensor): Current dual value $w(s, a)$.

Returns:

loss (tf.Tensor): Loss scalar $J(v, w)$.

@tf.function(jit_compile=True)
def get_gradients(
        self,
        obs_init, obs, act, obs_next,
        probs_init, probs_next,
        batch_size):

Computes gradients of the loss with respect to all trainable variables using SGDA.

Args:

obs_init (tf.Tensor): Initial observations (states) $s_0$.
obs (tf.Tensor): Current observations (states) $s$.
act (tf.Tensor): Actions $a$.
obs_next (tf.Tensor): Next observations (states) $s'$.
probs_init (tf.Tensor): Array of initial action probabilities under the target policy $\pi(a_0 \mid s_0)$.
probs_next (tf.Tensor): Array of next action probabilities under the target policy $\pi(a' \mid s')$.
batch_size (int): Batch size.

Returns:

result (tuple): (values, loss, gradients).

⚙️ Utility

def preprocess_obs(self, obs):

Applies preprocessing to raw observations (states) and converts them to a TensorFlow tensor.

Args:

obs (float or tensor): Raw observation (state) data.

Returns:

obs_preprocessed (tf.Tensor): Preprocessed observation (state) tensor.

def preprocess_act(self, obs):

Applies preprocessing to raw actions and converts them to a TensorFlow tensor.

Args:

act (int or tensor): Raw action data.

Returns:

act_preprocessed (tf.Tensor): Preprocessed action tensor.

def preprocess_rew(self, obs):

Applies preprocessing to raw rewards and converts them to a TensorFlow tensor.

Args:

rew (float or tensor): Raw reward data.

Returns:

rew_preprocessed (tf.Tensor): Preprocessed reward tensor.

def preprocess_probs(self, probs):

Applies preprocessing to raw action probabilities and converts them to a stacked TensorFlow float tensor.

Args:

probs (float or tensor): Raw action probabilities.

Returns:

probs_preprocessed (tf.Tensor): Preprocessed action probabilities as a tensor.

def get_value(self, network, obs, act):

Returns the state-action value for a specific state-action pair from a value network.

Args:

network (ValueNetwork): Network to query, $v$ or $w$.
obs (tf.Tensor): Preprocessed observation (state) tensor $s$.
act (tf.Tensor): Preprocessed action tensor $a$.

Returns:

value (tf.Tensor): Output from the network, $v(s, a)$ or $w(s, a)$.

def get_average_value(self, network, obs, probs):

Computes the expected state-action value for an observation (state) from a value network.

Args:

network (ValueNetwork): Value network to query, $v$ or $w$.
obs (tf.Tensor): Observations (states) $s$.
probs (tf.Tensor): Array of action probabilities under the target policy $\pi(a \mid s)$.

Returns:

average_value (tf.Tensor): Expected output from the network, $E_{a \sim \pi(s)}[v(s, a)]$ or $E_{a \sim \pi(s)}[w(s, a)]$.

def get_values(
        self,
        obs_init, obs, act, obs_next,
        probs_init, probs_next):

Queries the relevant networks to return all value components needed for loss computation.

Args:

obs_init (tf.Tensor): Initial observations (states) $s_0$.
obs (tf.Tensor): Current observations (states) $s$.
act (tf.Tensor): Actions $a$.
obs_next (tf.Tensor): Next observations (states) $s'$.
probs_init (tf.Tensor): Array of initial action probabilities under the target policy $\pi(a_0 \mid s_0)$.
probs_next (tf.Tensor): Array of next action probabilities under the target policy $\pi(a' \mid s')$.

Returns:

values (tuple): network outputs (v_init, v, v_next, w).

def evaluate_step(
        self,
        obs_init, obs, act, obs_next,
        probs_init, probs_next, batch_size):

Performs a single optimization step using a batch from the dataset.

Args:

obs_init (tf.Tensor): Initial observations (states) $s_0$.
obs (tf.Tensor): Current observations (states) $s$.
act (tf.Tensor): Actions $a$.
obs_next (tf.Tensor): Next observations (states) $s'$.
probs_init (tf.Tensor): Array of initial action probabilities under the target policy $\pi(a_0 \mid s_0)$.
probs_next (tf.Tensor): Array of next action probabilities under the target policy $\pi(a' \mid s')$.
batch_size (int): Batch size.

Returns:

result (tuple): (values, loss, gradients).

def get_sample(self):

Samples a training batch from the dataset and applies preprocessing.

Returns:

sample (tuple): (obs_init, obs, act, obs_next, probs_init, probs_next).

🧪 Example

from some_module import SomeNeuralDiceChildClass

estimator = SomeNeuralDiceChildClass(
    gamma=0.99,
    seed=0,
    batch_size=64,
    learning_rate=1e-3,
    hidden_dimensions=(64, 64),
    obs_min=obs_min,
    obs_max=obs_max,
    n_act=4,
    obs_shape=(8,),
    dataset=df,
    dir="./logs"
)

estimator.evaluate_loop(n_steps=10_000)

rho_hat = estimator.solve_pv(weighted=True)

API 2.0. NeuralDice - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki

`NeuralDice` (Abstract Base Class)

🏗️ Constructor

📦 Properties

🚀 Solve

💵 Loss

⚙️ Utility

🧪 Example

⚠️ GitHub.com Fallback ⚠️

API 2.0. NeuralDice - Reinforcement-Learning-TU-Vienna/dice_rl_TU_Vienna GitHub Wiki

NeuralDice (Abstract Base Class)

🏗️ Constructor

📦 Properties

🚀 Solve

💵 Loss

⚙️ Utility

🧪 Example

⚠️ **GitHub.com Fallback** ⚠️

`NeuralDice` (Abstract Base Class)

⚠️ GitHub.com Fallback ⚠️