CB with Graph Feedback - VowpalWabbit/vowpal_wabbit GitHub Wiki
What: Contextual Bandits (CB) with graph feedback can be used for scenarios where some actions, when taken, reveal information other actions (not taken), or maybe don't reveal any information at all. If there exists prior knowledge of this relationship between actions then that knowledge can be used to make exploration and learning more efficient.
CB with graph feedback is a constraint optimization problem where regret is minimized while maximizing the information gathering (less exploration waste) by using the graph feedback leads.
Publication: https://arxiv.org/abs/2302.08631
Some scenarios include spam filtering (aka apple tasting), first-price auction bidding, and inventory.
Expected input
Input is the same as cb_explore_adf
input for Vowpal Wabbit but it now includes the graph. The other change is that if an action is taken and it reveals information about other actions, all actions influenced can have a label, thus accelerating learning.
This tutorial here is a good example of how to build VW examples, especially the section on understanding the format.
The shared
example would have to change to accommodate the graph. The graph is supplied as an NxN
matrix, where N
is the number of actions in the input. Each cell represents the probability of action in the row affecting the action in the column. The graph is supplied in coordinate format (row, col, val) and only the non-zero values need to be specified.
for example, the graph below
0 1
0 1
indicates that taking action 0 will reveal no information either for itself nor for action 1, where as action 1 will reveal information for both actions (e.g. categorizing an email as spam vs non spam).
This is represented in the Vowpal Wabbit input as:
- for text input:
shared graph 0,1,1 1,1,1 | <shared features>
- for json input:
"_graph" : [{"row": 0, "col": 1, "val": 1}, {"row": 1, "col": 1, "val": 1}]
Expected Labels
In the above example, if action 1 is taken then we can have a cost function that determines, given the features, what is the cost of taking action 1 on action 1, and what is the cost of taking action 1 on action 0. We can set a label with action:cost:probability
tuple for both actions when learning. The important thing to get right here is the cost
of each action and what information is revealed about it when an action is taken. action
can be set to 0
as it will be ignored and probability
can be set to 1
.
Available reduction options
--cb_explore_adf --graph_feedback
activates the algorithmgamma_scale
defaults to0.5
andgamma_exponent
defaults to1.0
, values are used to setgamma=[gamma_scale]*[num examples]^[gamma_exponent]
gamma
will increase as the number of examples consumed grows: a small gamma means the resulting probability distribution over the available actions will be influenced more heavily influenced by the graph, and as gamma increases then the distribution will be less influenced by the graph and more by the learned policy