Original Reward Function - opendigital/RL-collective-action GitHub Wiki

The Reward Function

From Coordinate to Cooperate or Compete: Abstract Goals and Joint Intentions in Social Interaction (citation at the end of paper)

Context:

  • the goal of this game is to secure the reward
  • the blue and yellow dots have the choice of going for the reward cell at the edge or in the center
  • going to the edges symbolizes cooperating, going to the middle symbolizes competing (since if one dot’s in the middle the other cannot access their central reward cell)
  • moving costs the dots one point, while staying still and waiting does not; this forces the dots to minimize the number of moves spent
    • also makes the central reward cell more attractive than the edge cells

Cooperating

The Q function for the group cooperative behavior

  • cooperative action defined as one that maximizes reward for all agents
    • this motivates the idea of a “group agent” whose reward is a sum of all agents’ reward weighed by how much the group agent cares about each agent
      • sum of how much each agent has; don’t overcomplicate this
    • for our purposes we can weigh each agent equally Agents pick the best action from this group policy by finding the best action for themselves when considering all of the other actions the group can take

Competing

  • agents have some default “level 0” behavior, which is then evolved upon with each round based on the round prior The “level 0” behavior for agents is to just greedily select the future state (s’) with the highest expected reward (P(future state) * (Utility(future state) + decay factor * future actions’ utility))

  • from this, they can develop a recursively defined behavior

pretty similar to the “level 0” behavior; it’s just that the policy (π) has been influenced by the prior rounds

  • the other agent is treated as a part of the environment, something which is factored in when calculating P(s’|s, a’) since the other agent is part of the current state (s)

P(s’, s, a) is really the probability of arriving at a future state times the probability that the other agent will take the action that will allow them to arrive at the future state

  • just because these agents are selfishly planning doesn’t mean that they are actively opposing each other; it is possible that agents all develop norms that collectively help everyone

Coordinating Cooperating and Competing:

  • An agent can derive the intentions of other agents from these functions (I)
    • in the case of cooperative planning, the agents are given the other agents’ actions from the group-agent function
    • in the case of competitive planning, the agents take their actions using an expectation for what the other agents’ actions will be
  • From there, use simple Bayesian techniques P(I|D) ∝ P(D|I)P(I) where P(D|I) is the policy to understand which policy an agent will choose (D) given their intention (I)
  • From there, can use any sort of strategy (e.g. tit-for-tat learning) to pick what action agent should choose
    • choice 1: tit-for-tat learning
    • choice 2: Bush-Mosteller

probability for cooperating for public goods game

the stimulus (s) is based on whether or not the player’s actual payoff met some aspiration (expectation)

  • basic idea behind this is that whether or not a person contributes is based on whether or not they think the group’s work met an expectation of theirs
    • the initial condition p1 is from the uniform density on [0, 1] independently for different players.
    • First, we define pt as the expected contribution that the player makes in round t. We draw the actual contribution at from the truncated Gaussian distribution whose mean and standard deviation are equal to pt and 0.2, respectively. If at falls outside the interval [0, 1], we discard it and redraw at until it falls within [0, 1]. Second, we introduce a threshold contribution value X, distinct from A, used for regarding the action to be either cooperative or defective.
      • X = 0.3 and 0.4 for conditional cooperative behavior
      • beta = 0.4, A = 0.9
  • we use Bush-Mosteller to decide whether to be competitive or cooperative, and then use Q-Learning to tune the exact contribution

How This Applies to Our Game

  • actions are simply how much an agent decides to contribute
  • tit-for-tat learning applies to the public goods game once we reduce the choices down to cooperate or compete - people are most successful in the public goods game if they work together, but more successful were they to defect and work on their own than if they tried to work with others who had defected themselves

How Ezaki Verified His Paper

Citations

  • Kleiman-Weiner, Max, Ho, Mark K., Austerweil, Joseph L., Littman, Michael L., & Tenenbaum, Joshua B. (2016). Coordinate to cooperate or compete: Abstract goals and joint intentions in social interaction. COGSCI. Retrieved from http://par.nsf.gov/biblio/10026426
  • Ezaki T, Horita Y, Takezawa M, Masuda N (2016) Reinforcement Learning Explains Conditional Cooperation and Its Moody Cousin. PLoS Comput Biol 12(7): e1005034. https://doi.org/10.1371/journal.pcbi.1005034