CB with Graph Feedback - VowpalWabbit/vowpal_wabbit GitHub Wiki

What: Contextual Bandits (CB) with graph feedback can be used for scenarios where some actions, when taken, reveal information other actions (not taken), or maybe don't reveal any information at all. If there exists prior knowledge of this relationship between actions then that knowledge can be used to make exploration and learning more efficient.

CB with graph feedback is a constraint optimization problem where regret is minimized while maximizing the information gathering (less exploration waste) by using the graph feedback leads.

Publication: https://arxiv.org/abs/2302.08631

Some scenarios include spam filtering (aka apple tasting), first-price auction bidding, and inventory.

Expected input

Input is the same as cb_explore_adf input for Vowpal Wabbit but it now includes the graph. The other change is that if an action is taken and it reveals information about other actions, all actions influenced can have a label, thus accelerating learning.

This tutorial here is a good example of how to build VW examples, especially the section on understanding the format.

The shared example would have to change to accommodate the graph. The graph is supplied as an NxN matrix, where N is the number of actions in the input. Each cell represents the probability of action in the row affecting the action in the column. The graph is supplied in coordinate format (row, col, val) and only the non-zero values need to be specified.

for example, the graph below

0 1
0 1

indicates that taking action 0 will reveal no information either for itself nor for action 1, where as action 1 will reveal information for both actions (e.g. categorizing an email as spam vs non spam).

This is represented in the Vowpal Wabbit input as:

for text input: shared graph 0,1,1 1,1,1 | <shared features>
for json input: "_graph" : [{"row": 0, "col": 1, "val": 1}, {"row": 1, "col": 1, "val": 1}]

Expected Labels

In the above example, if action 1 is taken then we can have a cost function that determines, given the features, what is the cost of taking action 1 on action 1, and what is the cost of taking action 1 on action 0. We can set a label with action:cost:probability tuple for both actions when learning. The important thing to get right here is the cost of each action and what information is revealed about it when an action is taken. action can be set to 0 as it will be ignored and probability can be set to 1.

Available reduction options

--cb_explore_adf --graph_feedback activates the algorithm
gamma_scale defaults to 0.5 and gamma_exponent defaults to 1.0, values are used to set gamma=[gamma_scale]*[num examples]^[gamma_exponent]
- gamma will increase as the number of examples consumed grows: a small gamma means the resulting probability distribution over the available actions will be influenced more heavily influenced by the graph, and as gamma increases then the distribution will be less influenced by the graph and more by the learned policy