Value Iteration - reporkey/Berkeley-Pacman GitHub Wiki

Theory

Value Iteration is a Model-based approach. It’s a method of generating a moving policy over an MDP (Markov Decision Processes). The key step during the process is to find the optimal value function V* solving the Bellman equations. Bellman equations can be modified slightly, so they become Q-functions in Q learning. Thus, VI and Q-learning have certain things in common.

Evolution and Experiments

Build map

The first step is to build a map that each cell are assigned with different rewards according its content. The given value of rewards is the initial value of Q for each cell.

Reward

  • Enemy Pacman - positive value
  • Enemy ghost - negative value
  • Food lost penalty - negative value * number of carrying food
  • Foods and capsules - positive value
  • Mid line (delivery line) - positive value * number of carrying food
  • Walls- none value

Note that the value of rewards are given differently for different agents. So that they could act differently.

Iteration

During iteration the Q value of all cells (except wall) keeps updating. Until ran out of time budget which is 1s in this game.

Build Policies

Then the map of rewards can leads to a policies map: the policy (action) of a cell is towards to the max Q value around it.

Implementations:

During implemtation we found some issues.

  1. Pitfall: They do not go back before they eating all the opposite foods.

    Improvement: We set some rewards to the middle line in the map, so our pacmans will go back to our own section thus assuring the foods carried can be transformed into scores. By adjusting the value of rewards for the middle line, we can control the pacmans go back just after eating certain number of foods.

  2. Pitfall: In this stage, our pacmans would go back but they sometimes keep trying to eat more foods even the eaten foods (scores + carring foods) are enough. The rule is that once the score of one team reaches (total number of foods) - 2, the game is over. The pacmans don’t have to eat the last two foods, and they should go back as soon as possible.

    Improvement: By changing the rewards of foods to 0 after number of food left no more than 2, the pacmans will go back after getting enough foods.

  3. Pitfall: Defencer does not correctly chase invaders, because it are not able to perceive the correct position of invaders.

    Improvement: We designed a enemy predict function. The game has "fog of war", but we still can infer the enemy position if one of our defending food disappeared. An enemy must to be there in last time step. With a set of agentes positions, our defender has more chances to find the enemy Pacman and defeat it.

Performance

After comparing these three approaches, we finally chose VI as our final approach as it has the best performance. The performance of our submitted VI version is shown below.

https://gitlab.eng.unimelb.edu.au/zichunz/comp90054-pacman/tree/submission-contest

Demo:

Red: VI Team
Blueteam: Staff_super_team

VI

Result: VI Team Against the Baseline Team in Random 10 Games

VI

Result: VI Team Against the Staff Teams in Random 5 Games

VI_vs_Staff