Py Pommerman Past Competition Entries - GAIGResearch/java-pommerman GitHub Wiki

Pommerman: A Multi-Agent Playground

https://arxiv.org/pdf/1809.07124.pdf

Pommerman – 11x11 grid, 4 agents

6 actions – STOP, UP, DOWN, LEFT, RIGHT, LAY BOMB
3 wall types - wooden - wall has 50% of becoming passage, 50% of becoming hidden power-ups
Power-ups – extra ammo, extra range, can kick (bomb)
Tie - rerun competition in case of tie, if not then collapsing wall, until one winner
Communication – agent emits a message of 2 word of a dictionary of size 8 every turn
Agent sees 7x7 area centered on its position ( Maybe just on NIPS competition )
team mate spawns diagonally
Maps are procedural generated, there is a guaranteed path between every single agent

Input:

Board 11x11 ints – flattened
Fog has value 5 in partial observable setting
Agent got a purview of 5x5
Position – 2 int, x,y[0,10]
Ammo – 1 int
Blast strength – 1 int
Can kick – 1 int, [0,1]
Teammate, 1 int [-1,3] non-teammate has value –1
Enemies, 3 ints[-1, 3], in team variant, teammate has value -1
Bomb blast strength – list of ints, for each bomb in the agent’s purview
Bomb life - list of ints
Message – 2 ints, [0,8] - both ints are 0 when teammate dead

FFA – single agent submission, no teams

NIPS 2018 - requires the submission of 2 agents, required Docker for submission (2 agent 2 docker files)

Act takes in a dictionary of observations
http is used for communication
Response from agent is a single int [0,5] representing which action to take. Team variant – extra 2 int representing message [0,8]. Timeout after 100ms, issue stop action with message (0,0). Stop is unique to competition not the framework itself.

Mechanics

agent has 1 ammo, when lays bomb –1, when explodes +1
Blast strength initially 2
Bomb life – 10 steps
Wooden plank – 50% chance of power-up
Power-ups

Extra bomb
Increase range
Can kick

permanently kick bomb by moving into them
bomb travels in direction 1 unit per time step until hit player, wall or another bomb

5. Skynet

2nd place in learning track, 5th in global ranking NIPS 2018 https://www.borealisai.com/en/blog/pommerman-team-competition-or-how-we-learned-stop-worrying-and-love-battle/
Flames have lifetime of 2 game steps
Noisy reward – opponent commits suicide, agent receives +1 as reward
Lack of a fast – forward simulator – vanilla search algorithms are ineffective
SimpleAgent uses Dijkstra’s algorithm to get away from bomb explosion
The trained agent learned to exploit a bug in SimpleAgent by forcing it to commit suicide
Input 14x11x11 – including previous observation for memory
Competition top-3 agents used forward model for search
They filtered the action space – not to avoid suicide and not to place bomb when teammate is nearby and when the agent’s position is covered by a certain bomb

4. Continual Match Based Training in Pommerman: Technical Report:

https://arxiv.org/pdf/1812.07297.pdf

Navocado, 4th place overall, winner of the learning track
COMBAT – randomly pick 4 agents – play them – remove converged weak ones and create new ones
Population based A2C agents, round robin method for matches – ELO score for ranking
State space – 11x11x11 encoding of all information into the map
Local optima – agent learns no to lay bomb and explode, to help reshape action space as below
Modified action space – 122 dimensions, 121 dimensions – flattened board + bomb position
Dijkstra is used to find position of predicted destination
Network: 16, 32, 64 3x3 conv filters flatten, hyperbolic tangent as activation – A2C for policy

Won the 2018 NeurIPS learning competition. Trained using round-robin competitions against other agents and ranking them continuously. 220 CPUs and 32 GPUS were used. Used SimpleAgent as teammate which slowly got replaced by another trainable agent. Trained for dozens of days and kept improving.

1. and 3. Hakozaki, dypm - IBM Tokyo – winners of NIPS competition

https://www.ibm.com/blogs/research/2019/03/real-time-sequential-decision-making/

https://arxiv.org/pdf/1902.10870.pdf

Uses a pessimistic tree search approach with limited depth. Pessimistic scenarios can be illegal or unrealistic, for example copying the opponent into multiple positions, while in the actual game only one position is allowed.
Submissions – hakozaki and dypm-final (same idea, small differences)
Redone the competition with the top 5 agents playing 200 matches against each agent, Search based agents completely dominated the other entries. Note that this was done in a different setup than it was in the competition. So maybe learning agent did not benefit from the same hardware as used in the competition. Hakozaki used JAVA, eisenach C+++ and dypm python. Hakozaki and eisenach – multi threading.

2nd place, Eisenach (Gorog Marton) – implemented in C++, many engineering tricks to achieve an average depth of 2 in the tree search. Search as far as possible

Backplay: ‘Man muss immer umkehren’

https://arxiv.org/pdf/1807.06919.pdf

Pommerman maps are random, but there is a guaranteed path between any two agent
Obs space is 19 11x11 maps
uses FFA setting
Max steps – 800, reward –> win 1, -1 otherwise
RL – 4 conv with 256 output channels, PPO for optimization, batch_size 102400, 60 parallel workers
Trained for about 50m frames, 72 hours

Play against top players

Run 'docker pull multiagentlearning/{hakozakijunctions, eisenach, dypm.1/dypm.2, navocado, skynet955}' to get them and 'pom_battle --agents=MyAgent,docker::multiagentlearning/navocado,player::arrows,docker::multiagentlearning/eisenach --config=PommeTeamCompetition-v0' to play.