PRIMAL: Pathfinding via Reinforcement and Imitation Multi agent Learning - adigoswami/MiniProject1 GitHub Wiki

By Guillaume Sartoretti, Justin Kerr, Yunfei Shi, Glenn Wagner, TK Satish Kumar, Sven Koenig, and Howie Choset

This paper deals with Multi-agent pathfinding (MAPF) using reinforcement and imitation learning. Unlike other state-of-the-art MAPF planners which rely on centralized planning and scale poorly on the number of agents, MAPF combines reinforcement and imitation learning to teach fully decentralized policies, where agents plan paths online in a partially observable world while showing implicit coordination.

This paper mainly focuses on training a decentralized policy, where each agent learns its own policy. The decentralized policy should also encompass a measure of agent cooperation necessary for MAPF. In order to match real-world deployment, the paper uses a partially observable discrete grid world for training and testing of agents. Partial observability is defined as a limited field of view (FOV) centered around agents (10 x 10 FOV in practice). Fully observable agents can also be trained using a sufficiently large FOV. The action space consists of actions sampled from valid actions observed during training time. Experimental observations prove that training on actions taken through sampling from a valid action space is more stable, compared to giving negative rewards to agents for selecting invalid moves. Rewards during RL are slightly more for moving than staying still, which is necessary to encourage exploration.

In order to teach collaborative behavior, the following techniques are employed. Blocking penalty, where a sharp penalty of -2 is added if an agent decides to stay on goal while preventing another agent from reaching its goal. Combining imitation learning with RL helps to quickly identify high-quality regions of the action space. Environment sampling, where randomizing both the sizes and obstacle densities of the world, in the beginning, helps the agents to be exposed to enough situations where coordination is needed. This helps agents to learn coordination avoid collisions.

Decentralized policies often lead to each agent acting selfishly which can lead to deadlocks. The main contribution of this paper is their ability to train agents to learn non-selfish behavior despite being decentralized. The paper shows impressive results for agents in spaces of low obstacle density. PRIMAL can easily deal with teams up to 1024 agents in low obstacle density spaces, with a nearly perfect success rate.

PRIMAL sometimes generates paths twice as long as what other planners do. The model is not generalizable to larger worlds e.g. larger than 70 x 70, as agents were never exposed to such sizes during training.

Return to main page.