Py Pommerman Past Competition Entries - GAIGResearch/java-pommerman GitHub Wiki
Pommerman: A Multi-Agent Playground
https://arxiv.org/pdf/1809.07124.pdf
Pommerman β 11x11 grid, 4 agents
- 6 actions β STOP, UP, DOWN, LEFT, RIGHT, LAY BOMB
- 3 wall types - wooden - wall has 50% of becoming passage, 50% of becoming hidden power-ups
- Power-ups β extra ammo, extra range, can kick (bomb)
- Tie - rerun competition in case of tie, if not then collapsing wall, until one winner
- Communication β agent emits a message of 2 word of a dictionary of size 8 every turn
- Agent sees 7x7 area centered on its position ( Maybe just on NIPS competition )
- team mate spawns diagonally
- Maps are procedural generated, there is a guaranteed path between every single agent
Input:
- Board 11x11 ints β flattened
- Fog has value 5 in partial observable setting
- Agent got a purview of 5x5
- Position β 2 int, x,y[0,10]
- Ammo β 1 int
- Blast strength β 1 int
- Can kick β 1 int, [0,1]
- Teammate, 1 int [-1,3] non-teammate has value β1
- Enemies, 3 ints[-1, 3], in team variant, teammate has value -1
- Bomb blast strength β list of ints, for each bomb in the agentβs purview
- Bomb life - list of ints
- Message β 2 ints, [0,8] - both ints are 0 when teammate dead
FFA β single agent submission, no teams
NIPS 2018 - requires the submission of 2 agents, required Docker for submission (2 agent 2 docker files)
- Act takes in a dictionary of observations
- http is used for communication
- Response from agent is a single int [0,5] representing which action to take. Team variant β extra 2 int representing message [0,8]. Timeout after 100ms, issue stop action with message (0,0). Stop is unique to competition not the framework itself.
Mechanics
- agent has 1 ammo, when lays bomb β1, when explodes +1
- Blast strength initially 2
- Bomb life β 10 steps
- Wooden plank β 50% chance of power-up
- Power-ups
- Extra bomb
- Increase range
- Can kick
- permanently kick bomb by moving into them
- bomb travels in direction 1 unit per time step until hit player, wall or another bomb
5. Skynet
- 2nd place in learning track, 5th in global ranking NIPS 2018 https://www.borealisai.com/en/blog/pommerman-team-competition-or-how-we-learned-stop-worrying-and-love-battle/
- Flames have lifetime of 2 game steps
- Noisy reward β opponent commits suicide, agent receives +1 as reward
- Lack of a fast β forward simulator β vanilla search algorithms are ineffective
- SimpleAgent uses Dijkstraβs algorithm to get away from bomb explosion
- The trained agent learned to exploit a bug in SimpleAgent by forcing it to commit suicide
- Input 14x11x11 β including previous observation for memory
- Competition top-3 agents used forward model for search
- They filtered the action space β not to avoid suicide and not to place bomb when teammate is nearby and when the agentβs position is covered by a certain bomb
4. Continual Match Based Training in Pommerman: Technical Report:
https://arxiv.org/pdf/1812.07297.pdf
- Navocado, 4th place overall, winner of the learning track
- COMBAT β randomly pick 4 agents β play them β remove converged weak ones and create new ones
- Population based A2C agents, round robin method for matches β ELO score for ranking
- State space β 11x11x11 encoding of all information into the map
- Local optima β agent learns no to lay bomb and explode, to help reshape action space as below
- Modified action space β 122 dimensions, 121 dimensions β flattened board + bomb position
- Dijkstra is used to find position of predicted destination
- Network: 16, 32, 64 3x3 conv filters flatten, hyperbolic tangent as activation β A2C for policy
Won the 2018 NeurIPS learning competition. Trained using round-robin competitions against other agents and ranking them continuously. 220 CPUs and 32 GPUS were used. Used SimpleAgent as teammate which slowly got replaced by another trainable agent. Trained for dozens of days and kept improving.
1. and 3. Hakozaki, dypm - IBM Tokyo β winners of NIPS competition
https://www.ibm.com/blogs/research/2019/03/real-time-sequential-decision-making/
https://arxiv.org/pdf/1902.10870.pdf
- Uses a pessimistic tree search approach with limited depth. Pessimistic scenarios can be illegal or unrealistic, for example copying the opponent into multiple positions, while in the actual game only one position is allowed.
- Submissions β hakozaki and dypm-final (same idea, small differences)
- Redone the competition with the top 5 agents playing 200 matches against each agent, Search based agents completely dominated the other entries. Note that this was done in a different setup than it was in the competition. So maybe learning agent did not benefit from the same hardware as used in the competition. Hakozaki used JAVA, eisenach C+++ and dypm python. Hakozaki and eisenach β multi threading.
2nd place, Eisenach (Gorog Marton) β implemented in C++, many engineering tricks to achieve an average depth of 2 in the tree search. Search as far as possible
Backplay: βMan muss immer umkehrenβ
https://arxiv.org/pdf/1807.06919.pdf
- Pommerman maps are random, but there is a guaranteed path between any two agent
- Obs space is 19 11x11 maps
- uses FFA setting
- Max steps β 800, reward β> win 1, -1 otherwise
- RL β 4 conv with 256 output channels, PPO for optimization, batch_size 102400, 60 parallel workers
- Trained for about 50m frames, 72 hours
Play against top players
Run 'docker pull multiagentlearning/{hakozakijunctions, eisenach, dypm.1/dypm.2, navocado, skynet955}' to get them and 'pom_battle --agents=MyAgent,docker::multiagentlearning/navocado,player::arrows,docker::multiagentlearning/eisenach --config=PommeTeamCompetition-v0' to play.