Analysis Report - opendigital/RL-collective-action GitHub Wiki

Analysis Report

Question

How does different information dissemination feedback schema influence conditional cooperator reinforcement learning agents’ behavior in a public goods game?

Motivation

Per Kurzban and Houser (2005), around 63% of people are conditional cooperators in groups - their contributions are a function of how much they think other members of their group will contribute to the group. Understanding the process by which they gauge the cooperativeness of their group better helps us figure out what drives people to contribute to groups, and how we can organize information available to groups in a manner that leads to the best, most cooperative outcome.

Related Work

Kleiman-Weiner, Max, Ho, Mark K., Austerweil, Joseph L., Littman, Michael L., & Tenenbaum, Joshua B. (2016). Coordinate to cooperate or compete: Abstract goals and joint intentions in social interaction. COGSCI. Retrieved from http://par.nsf.gov/biblio/10026426

The lower-level reinforcement learning function was taken from this paper. It studies how two agents in an environment behave when cooperating and competing (it does not cover in great detail how to switch between cooperating and competing states, instead allowing anyone to pick their approach)

Ezaki T, Horita Y, Takezawa M, Masuda N (2016) Reinforcement Learning Explains Conditional Cooperation and Its Moody Cousin. PLoS Comput Biol 12(7): e1005034. https://doi.org/10.1371/journal.pcbi.1005034

The upper level reinforcement learning function was taken from here. It studies how agents playing the public goods game switch between cooperating and competing using an aspiration-based model.

RL Model Specification

Cooperating

The Q function for the group cooperative behavior

cooperative action defined as one that maximizes reward for all agents
- this motivates the idea of a “group agent” whose reward is a sum of all agents’ reward weighed by how much the group agent cares about each agent
- for our purposes we can weigh each agent equally

Agents pick the best action from this group policy by finding the best action for themselves when considering all of the other actions the group can take

Competing

agents have some default “level 0” behavior, which is then evolved upon with each round based on the round prior

The “level 0” behavior for agents is to just greedily select the future state (s’) with the highest expected reward (P(future state) * (Utility(future state) + decay factor * future actions’ utility))

from this, they can develop a recursively defined behavior

pretty similar to the “level 0” behavior; it’s just that the policy (π) has been influenced by the prior rounds

the other agent is treated as a part of the environment, something which is factored in when calculating P(s’|s, a’) since the other agent is part of the current state (s)

P(s’, s, a) is really the probability of arriving at a future state times the probability that the other agent will take the action that will allow them to arrive at the future state

just because these agents are selfishly planning doesn’t mean that they are actively opposing each other; it is possible that agents all develop norms that collectively help everyone

Coordinating Cooperating and Competing:

An agent can derive the intentions of other agents from these functions (I)
- in the case of cooperative planning, the agents are given the other agents’ actions from the group-agent function
- in the case of competitive planning, the agents take their actions using an expectation for what the other agents’ actions will be
From there, use simple Bayesian techniques P(I|D) ∝ P(D|I)P(I) where P(D|I) is the policy to understand which policy an agent will choose (D) given their intention (I)
From there, can use any sort of strategy (e.g. tit-for-tat learning) to pick what action agent should choose
- We will use the Bush-Mosteller approach:

    - basic idea behind this is that whether or not a person contributes is based on whether or not they think the group’s work met an expectation of theirs
    - the initial condition p1 is from the uniform density on [0, 1] independently for different players.
    - First, we define pt as the expected contribution that the player makes in round t. We draw the actual contribution at from the truncated Gaussian distribution whose mean and standard deviation are equal to pt and 0.2, respectively. If at falls outside the interval [0, 1], we discard it and redraw at until it falls within [0, 1]. Second, we introduce a threshold contribution value X, distinct from A, used for regarding the action to be either cooperative or defective.
        - X = 0.3 and 0.4 for conditional cooperative behavior
        - beta = 0.4, A = 0.9

How This Applies to Our Game

actions are simply how much an agent decides to contribute

Information Dissemination Schemes: Base Game

agents are not exposed to any information whatsoever

Treatment 1

agents are exposed to the sum of contributions for each round

Treatment 2

agents are exposed to the individual contributions of every other agent in each round

Method

Which Parameters are Used (for each type of game)

A = 0.9, β = 0.4, X = 0.3
A = 0.9, β = 0.4, X = 0.4
A = 0.9, β = 0.4, X = 0.5

Our modeling policies/experimental set-up: Base Game

agents are not exposed to any information whatsoever

Treatment 1

agents are exposed to the sum of contributions for each round

Treatment 2

agents are exposed to the individual contributions of every other agent in each round

The parameters varied:

A in range(0, 2.1, 0.1),
β in range(0, 3.1, 0.1),
X = 0.3, 0.4, 0.5,
I = {Base Game, Treatment 1, Treatment 2}

Total number of games: 21 * 31 * 3 * 3 games Each game is ran 100 times

Base Game sanity check and sensitivity analysis

Play with 5 only: examine the contribution per round per player
- 4 fixed agents with max contributions and one reinforcement learning:
Vary number of players

How Each Game is Ran

25 players, 12 rounds
- 3 unconditional cooperators (contribute from uniform distribution in [0.8, 1])
- 5 unconditional free-riders (contribute from uniform distribution in [0, 0.2])
- 17 conditional cooperators trained on the reward function we’ve designated
contributions are in [0, 1] and discrete with intervals of 0.1 (e.g. 0.1, 0.2, 0.3 are valid but 0.01 is not)
p is initialized as a random number from the uniform distribution in [0, 1] (as ran in the Ezaki paper)
the reinforcement learning strategies for cooperation and competition will train and come up with what they think is the best first round contribution in the cooperating case and in the competing case
at the end of each game no matter what the contributions are multiplied by 1.6 and evenly distributed back to the players (as ran in the Ezaki paper and the supporting Nature paper)

Number of simulations: 50 to 100 per game. So 100*21 * 31 * 3 * 3 games

Parameters Examined

contribution behavior of individual agents (use time series plot for games to examine)
contribution behavior of all conditionally cooperating agents in each game (use time series plot for games to examine)

Validation Technique

when making the average contribution of other agents graph vs the average contribution of agent graph for the case with the distributions visible we should see a graph similar to this

What Data is Collected for each condition;

each agent’s contributions in each round for all games
their type: cooperator, free-rider, reciprocators
Game condition (A,b,X,I)
the mean/sd of contributions for each agent for a particular game condition

Explorative analysis: What Data is Plotted:

time series plot for each agent in each game (line graph, x-axis is time, y-axis is contribution)
average agent contribution vs. other agents contribution (plot from Ezaki) for each information dissemination scheme

Statistical modeling:

Anova/t-test on the different treatments etc.

Results