Project Reflection - ZonaLing/COMP90054_AI_Planning_for_Autonomy GitHub Wiki

Monte Carlo Tree Search (MCTS)

Justification for Technique Selected
Problem Modeling and Implementation Design
Advantages and Disadvantages
- Advantages
- Disadvantages
Solved Challenges
Future Improvements

Justification for Technique Selected

The Monte Carlo Tree Search(MCTS) algorithm is utilized as the main technique for my agent. Its exploration and exploitation characteristics are beneficial for handling massive and strategic game playing with large search space. With upper confidence bounds for trees(UCT), it explores the promising actions with better winning rates to prevent consumption of memory and high computation time. In Azul, there are multiple ways to achieve higher points, but requires strategic planning. For example, not only achieving bonus points at the end of the game can increase the chances of winning, placing adjacent tiles in the grid is critical for earning extra points in each round. Therefore, the MCTS can sort actions based on UCT score to select the best performing action as the next step. In the meantime, the game only allows 0.9 seconds of thinking time. The sequential prioritization during simulation prevents the MCTS agent from brute forcing every solution, which enhances the efficiency of decision making. Lastly, the quick adaptability of the algorithm can adjust its optimal plan based on the opponent’s moves, so it applies flexibility to the agent for it to react correctly when facing unexpected moves from the competitor.

Problem Modeling and Implementation Design

Problem modeling

The model of MCTS can be modelize with Markov Decision Process(MDP), which would be in the form of $M<S, s_0, A, P_a(s'|s), r(s, a, s')>$, for the agent to select actions based on probability and the maximum value of evaluation function.

Symbol	Representation
$S$	{ All possible AzulStates }
$s_0$	Initial AzulState
$P_a(s'\mid s)$	Transition probability of the choosing an outcome for each action
$A$	{ Legal actions of each agent in state $s \in S$ }
$r$	Reward function: Agent score - Opponent score

The states in Azul involves with the tiles in factory, center, walls, pattern lines, and floor line of each agent playing in the game. The initial state starts with 4 tiles in each of the 5 factories and there should be no tiles in the center. The walls, pattern lines, and the floor line are empty as well. The goal state of Azul means that there are no more tiles left for selection and the final score of each agent are computed. All legal actions include taking tiles from one factory or center in each turn and place them to pattern lines to fulfill the grid or floor lines for points deduction. The reward function for the game is designed to be our agent's score subtracting the opponent's score. If it is positive, then it means that the actions selected are helpful for winning; on the other hand, the actions will be avoided if the function returns a negative value.

Implementation Designing

There are 4 steps of implementation in MCTS: (1) select (2) expand (3) simulate (4) backpropagate

1. Selection

In this process, the algorithm explores the tree and selects the action with higher evaluation value for optimal selection. The evaluation function in the game is the upper confidence bounds for trees (UCT) function, which combines both the MCTS and upper confidence bound (UCB1). Here, we set the C parameter as 0.1 for more stable game state to earn more bonus points at the end. It balances the trade-offs between exploration and exploitation, which is important to maintain both the short term goal - to maximize the immediate points in each round, and the long term goal - to get as many bonus points as possible. In the game, when the MCTS agent is the first to start the game, it randomly selects the action from all legal actions, but makes sure that it can fill up the pattern line to get one point immediately. After the first turn, it selects actions based on the best action retrieved from the highest result of evaluation function. According to the opponent's actions and available tiles in the center or factories, the actions are dynamic and is able to provide alternative solutions even though the originally selected move is unavailable.

$$ \text{UCT} = argmax (\frac{\text{the differences of wins and loses of child node}}{\text{the number of child node visit times}} + C * \sqrt{2*log(\frac{\text{number of agent visit times}}{\text{number of child node visit times}}})) $$

2. Expansion

During expansion, a new child node is appended to the optimal path selected in the selection phase, and it returns the child node. In the game, the self._untried_actions represents all the legal actions, and we sort out the actions first with the function sort_actions. We sorted the actions based on the sequence: action that gets at least one point, action that is able to complete a row, action that is able to complete the column, and the action that is able to complete the set. The reason behind is because of the bonus of placing adjacent tiles in the grid. Although completing the sets can aim for higher bonus (10 points), the same type of tiles aren't placed right next to each other, so we might not maximize the immediate score in each round if we focus on fulfilling the sets. Since the column bonus is higher and can count for placing adjacent tiles simultaneously, we prioritize the action first.

3. Simulation

In the simulation stage, it is implemented until a action chosen achieved the predefined result. For our agent, we decide to run 300 simulations as it has the best performance and won't result in timeout. When the current simulation state is not terminal, the action is selected with the rollout_policy that returns best action with the maximum value in the GetOnePoint action category or the minimum value of the action to floor line. At the end of each round, we calculate the score difference between our agent and the competitor as the reward function for the backpropagation of the system to learn the strategy and return optimal plan for next round.

4. Backpropagation

After determining the value of the newly appended node, the remaining tree must be updated. Therefore, backpropagation is required to update the information from new nodes to the root node. In the algorithm we implemented, it takes into account the reward in each simulation then records the outcome and the number of visit times. The information transferred back to the root node is related to the strategy that the agent will react to the next round. In the best_action function, we use the tree_policy to check whether the tree has been expanded to the child node, then the reward is calculated and passed to backpropagation function to make use of the information to return the best child node with the best action.

Advantages and Disadvantages

Advantages

1. High adaptability:

MCTS is suitable for combinatorial games, meaning that the game itself must be played by 2 players, the players have to take turns to make their moves, and also the game should have a set of well-defined moves. Azul fulfills all the preconditions of a combinatorial game, so applying the MCTS as the solving algorithm is a reasonable choice. Due to the adaptability of MCTS, the tree explores more frequently on the better performing nodes and exploits the beneficial values it brings to the player. This avoids the problem of going too deep or too broad into irrelevant search nodes in a tree, instead the agent can focus on high values nodes and branches to the optimal plan. In Azul, it is important because there are limited steps for the agent in each round. If the agent takes the wrong step, the impact may lead to failure, so MCTS can reduce the risk of going on the wrong path.

2. Uncertainty can be included:

In Azul, the uncertainty of the game is that the game state changes dynamically and incomplete information exists. Since the MCTS is mainly implemented by statistical sampling and empirical data, it can handle the unknown by training itself to find the optimal solution by expanding nodes, simulating the actions, and backpropagating to the root node to reflect the updates, so it knows whether to change a path or continue the same path. Therefore, it is not necessary for MCTS to understand all the rules and conditions to come up with a best solution. In order to deal with uncertainty, our agent is able to balance the exploration and exploitation with the UCT function, and it can shift exploration and exploitation at different stages in the game. In this case, even if the opponent has reacted unexpectedly, our agent is able to find an alternative action.

Disadvantages

1. Massive computation memory required:

The implementation of MCTS is based on simulations to explore for optimal solutions to work effectively. In Azul, the game rules are complex as the game state changes dynamically across time, so it consumes memories in the computer while running the game as it needs to calculate and store the values of expanded nodes for comparison to return the best action of the best child node. The more the simulations, more computation spaces are required as data is accumulated iteratively. If the server does not have sufficient memory, the algorithm operates inefficiently and might affect the performance. When we apply multiple restrictions in the select stage, it didn't perform according to our expectation, and ended with poor results. For example, the function that we applied to avoid placing tiles at the floor line to select best action makes it too complex to work well because it needed to check whether the action will deduct point one by one. The brute force implementation takes up memory spaces and result in 0 wins eventually.

2. Slow:

Similarly to the memory problem, more simulations can lead to better performances of the MCTS agent. However, due to the limited thinking time of 0.9 seconds, it can only run at most 300 simulations for best performances of our design when training the agent, and it gets 23 wins as the highest server score. If it goes up to above 300 simulations, not only will it time out, it doesn't perform as well as the one running 300 simulations. From the observation of competing against heuristic agent and the minimax agent in our team, I also found out that there is a short lag whenever the MCTS agent is choosing actions on its turn in the game.

3. Rely on empirical data:

Azul is an deterministic game that the agent achieves the goal by a series of sequential actions. MCTS is suitable for this type of game because it encourages exploration and utilizes simulation and backpropagation to find out the actions by passing back recorded information back to root node to make decision. However, there is no data at the starting point of the first round, the agent can only select actions randomly. If the MCTS competes with more powerful algorithm such as minimax, it might be caught off guard and get behind by a lot of points as it is unable to build up a meaningful search tree at the beginning. From my observation, the MCTS agent can only get the highest of 3 points in the first round with whoever it is competing, so it proves that it often has a cold start of game.

Solved Challenges

Sort actions:

In the beginning, the MCTS agent is not working well and only wins 1 game out of 40 games when I apply the basic algorithm as the implementation. I started sorting actions by overthinking too many conditions, so the agent wasn't able to run efficiently as the nodes are mostly restricted for expansion. Not only did it get timed out a lot of times, the results didn't turn out well and got plenty of 1 wins when comparing with the server agents. Then, I reorganized my thoughts to make the logic simpler for implementation. After applying the sort_actions function that prioritizes the actions guaranteed to get at least one point to the untried actions before the selection, expansion and simulation, it becomes much more efficient because it helps eliminating the useless actions that might end up deducting points. Then, I optimized it by prioritizing the actions that complete columns, rows, and sets to accumulated more extra points for adjacent tiles and bonus points at the end of the game. The agent got to win 23 games out of 40 games as its best score.

Add weights to certain actions:

As mentioned above, I sorted the actions of completing columns, rows, and sets. If none of the beneficial actions work in the circumstance, the agent will select the action with the minimum deduction of points. To detect whether the agent almost fill up the wall, I added 3 points of weight for selecting actions that can complete the column, row, or set by 1 more step, and added 1 point of weight for selecting actions that can complete by 2 more steps, so the agent will be able to make better decision and perform better.

Experiments on the value of C parameter and number of simulations:

The MCTS relies on simulations for searching optimal solutions. We tested on multiple numbers (300, 1000, 500, 350, 200) to find out which number of simulations work the best by running in pairs with the minimax agent and heuristic agent we implemented, as well as the agents on the server. 1000 simulations ends up with timeout, and the agent loses more games when applying all the numbers of simulations except for 300. Therefore, we concluded that 300 simulations is the most suitable for the agent we designed.

Future Improvements

Currently, the MCTS agent we implemented now focuses on maximizing its own score to beat the competitor. In the future, we should improve the agent by adding the ability to observe the competitors’ grid and try to estimate its next step to block its way. The improved MCTS agent will maximize its own score in each round to stabilize its path to champion, meanwhile, minimizing the competitor’s points by setting traps for points deduction. This can be done by defining heuristic functions to simulate the competitor’s actions and take it into account. The blocking heuristic should be able to identify the actions that deny the competitor’s access to the tiles it intends to take to complete columns or rows and force it to take useless tiles to occupy available space in pattern lines or even worse conditions such as putting in the floor line to trigger penalties.

Another improvement can be modifying the sort_actions function. Currently, we only sorted the actions of completing columns, rows, sets, and the actions that guarantees to get at least one point in the round. We can further sort the actions by thinking of a strategy to place the more counts of same tiles in one factory or center to fill up the longer pattern lines as they are harder and takes more turns to fulfill. For example, for the pattern line that needs 5 tiles to fill up, we can design a heuristic function to compute and simulate how many turns are required based on the initial state of each round, then decide whether to fill up at the moment, or take actions to wait for accumulation of the tiles first in order to take them all at once, so more turns can be preserved in the round to perform other moves.