AI Method 3 - leungjunbob/AIP-project GitHub Wiki

MCTS Agent

This agent is developed by team member: Hange Cui - [email protected] -1104225

MCTS (Monte Carlo Tree Search) is a powerful general search method, implemented in agents/t_085/mcts.py, suitable for complex games with multiple choices and random factors in the decision-making process. It uses simulation to explore the actions most likely to lead to victory and is suitable for environments with incomplete information and complex strategies.

MCTS agents did not fare well against the teaching team agents. The MCTS agent's best record is just 4 wins out of 40 games total (win rate: 10.0%). I will explain the specific reasons for exploring this issue in the following pages.

Justification for Technique

In a game like Splendor, MCTS can effectively adapt to the dynamically changing state of the game while handling the game's diverse action choices and their complex consequences. Through repeated simulations starting from the current game state, evaluate various possible game development paths and make decisions based on the latest information. In addition, It simulates different opponent strategies and estimates the possible impact of these unknown factors on the game outcome. Through this method, MCTS naturally balances the trade-off between exploring new strategies and utilizing known strategies, enabling the AI to make balanced decisions between long-term and short-term goals, improving the strategic depth and flexibility of game AI.

Problem Modeling and Implementation Design

In this game AI project, game state modeling and the implementation of Monte Carlo Tree Search (MCTS) are core technical decisions. The game state is modeled by combining the player's resources (gems and cards), points, and retained cards and acquired nobles. Such modeling allows the AI to accurately track each player's game progress and resource status, which is the basis for effective decision-making. Game state modeling includes player's gems, cards, noble cards, and points. The weights of different gems are dynamically calculated through the evaluate_gemcolor_weight function to reflect changes in card demand in the game and enhance the AI's resource allocation strategy.

The entire process starts with the "game state", which is the starting point for algorithm analysis and contains information about the current state of the game. First, the AI performs a "search" operation to explore possible actions, and then selects an optimal action in the "select" phase. Next, the system will "expand" the search tree and add new action or state nodes for subsequent exploration. Then "simulation" is performed, where the AI simulates the game process starting from the current node until it reaches a game-end state, and then updates the statistics in the search tree based on the simulation results in the "backpropagation" stage. In this process, the existing action selection logic is as follows.

First of all, the Node class, as the basic unit of MCTS, stores the current game state and possible actions, and also records untried actions, thus providing a basis for tree expansion. This is accomplished by calling getStrategicActions(state) on initialization, which dynamically generates a list of possible actions based on the current game state. Nodes also maintain references to child nodes, parent nodes, and access times and scores, which are key to executing the four core steps of MCTS (selection, expansion, simulation, and backtracking).

In the selection phase, a variant of the UCB1 formula is used to balance exploration and exploitation, specifically by calculating the score/visit ratio of each child node plus an exploration bonus term to select the optimal node, where the exploration constant is set to math.sqrt (2) * 0.3. This approach ensures that the AI is not overly focused on already explored paths, while also exploring new possibilities. After selecting a node, during the expansion phase, an action is removed from the node's untried_moves to generate a new child node.

The simulation phase is the core of MCTS. Starting from a new node, strategic decisions based on SelectAction_v2 are used to play the game until the end or the preset maximum simulation depth 'MAX_SIM_DEPTH' is reached to evaluate the potential effect of the action. The results of the simulation are fed back to all relevant nodes through a backtracking process, updating node statistics to improve future decisions. During the backtracking process, the number of visits and score of each node will be adjusted according to the score and path depth, using an attenuation factor (such as depth_factor *= 0.99) to reflect the impact of the length of the decision path on the score.

The getStrategicActions method is one of the core functions, responsible for filtering and sorting possible actions from the current game state so that the AI can make the most strategic decisions. This method first calculates the total number of gems the player currently has, a step that helps determine the player's freedom of movement. Then, it obtains all legal actions by calling the game_rule.getLegalActions function. When the gems are close to the upper limit, non-gem-collecting actions are given priority. In the action sorting process, purchasing actions are sorted by card points, gem collecting actions are sorted by collection amount when the number of gems is small, and reserved cards are usually used as a backup option. Finally, based on game progress and player needs, action lists are appropriately combined to optimize strategy execution, such as prioritizing purchase actions when victory is close.

The core of the action selection strategy is reflected in four main functions: SelectAction_v1, SelectAction_v2, SelectAction and GreedySelectAction. These functions demonstrate different levels of strategy for different stages and situations of the game, aiming to provide the optimal choice of actions.

SelectAction_v1 is mainly used in quick decision-making scenarios. First, calculate the player's score in the current game state. This is done by calling the evaluate_player_score, which takes into account game elements such as the player's gems, cards, etc. The function iterates through each possible action, and for each action, a deep copy of the current game state is made, and the action is applied to generate a new game state. For each newly generated game state, evaluate_player_score is called again to calculate the new score after performing that action. Calculate the score improvement, which is the difference between the new score and the current score. Find the action with the highest score improvement among all actions and return it. This method is useful in the early stages of the game or when the AI needs to respond quickly to changes in the game, especially sudden moves by the opponent.

Different from SelectAction_v1, SelectAction_v2 introduces the weight calculation of gems in the decision-making process, adding a level of consideration to action selection. This function not only considers the current score, but also calculates the impact of different gems on the current game state, which is achieved through the evaluate_gemcolor_weight function. When selecting an action, SelectAction_v2 adopts different evaluation strategies based on the action type: if it is an action of collecting gems, the score increase based on the weight of the gems will be calculated; if it is a card purchase, the card score and the adjusted purchase cost will be considered; and in advance The action of keeping the card will increase the extra weight of the gold card. With this approach, SelectAction_v2 is able to more fully assess the potential value of each action, making decisions more granular and adaptable.

The SelectAction function first uses SelectAction_v1 for quick decision-making, and if time permits, Monte Carlo Tree Search (MCTS) will be enabled for more in-depth analysis. This design allows the AI to make quick decisions during the regular game process, while exploring better game strategies through MCTS at critical moments. This ensures the AI's adaptability and competitiveness in different game stages and complex situations. During the action selection process, the function continuously monitors the elapsed time to ensure that the decision-making process does not exceed the predetermined think time (THINKTIME).

The GreedySelectAction is responsible for selecting the action that maximizes immediate benefits from the currently available actions through a greedy algorithm. This function first calculates the total number of gems currently held by the player, and sets a gem threshold based on the progress of the game. This threshold is dynamically adjusted based on the progress of the current game and the number of remaining noble cards to decide whether to give priority to collecting gems or consider other types of gems. action. For each possible action, the function evaluates its potential score increase, which involves specific score calculations for the actions of collecting gems, purchasing cards, and reserving cards. The function compares the scores of these actions and selects the one with the highest score to perform, ensuring that each step maximizes the score. When there are actions with the same score, the function may choose one randomly or based on specific game strategy preferences.

Advantage & Disadvantage

A core advantage of MCTS is that by exploring and exploiting mechanisms, MCTS can naturally discover which actions lead to improved win rates and which may lead to failure in a large number of simulations. This ability to explore is especially important for a game with variable strategies like Splendor, because the optimal strategy for each game may change dramatically due to different cards and opponent strategies. On the other hand, MCTS provides Splendor games with an efficient way to evaluate different game actions by constructing and continuously updating decision trees. The expand and select processes enable AI to base every decision-making step on the latest game information, making the decision-making process more accurate and timely.

Although the Monte Carlo Tree Search (MCTS) algorithm brings significant strategic advantages, its implementation also has some shortcomings that cannot be ignored. First, the MCTS algorithm consumes a lot of computing resources when performing a large number of simulations to build and evaluate decision trees. This high resource consumption can lead to performance bottlenecks, especially in time-constrained environments. Furthermore, the decision-making quality of MCTS is highly dependent on the quality of the simulation. If simulation strategies are inaccurate or insufficiently cover critical game dynamics, it can lead to decisions based on incomplete or misleading information. As the complexity of the game increases, such as the introduction of more cards or more complex logic, the state space that MCTS needs to handle increases dramatically. This growth leads to a dramatic increase in the number of simulations required, further exacerbating performance and resource challenges, and processing large-scale state spaces can lead to reduced algorithm efficiency.

Solved Challenges

In the "Splendor" game AI project, a series of complex challenges were successfully met through a carefully designed multi-layer strategic decision-making system, dynamic resource evaluation tools and advanced search algorithms.

MCTS makes decisions by simulating many possible game outcomes, which is particularly computationally intensive when the depth and breadth of the tree is large. This is also the problem I mainly try to solve, because if the depth and breadth of the simulation are not enough, MCTS will not be able to achieve good results in this game. And because of the limitation of response time, it is necessary to ensure that the agent strikes a balance between fast response and in-depth analysis. So I spent a lot of time trying to use pruning technology reasonably to solve this problem, but due to time reasons, I did not implement this method well. Therefore, I designed the SelectAction to provide quick decision-making in regular operations, and at critical moments, I used MCTS to conduct in-depth analysis of possible game development paths, effectively balancing the needs of quick response and in-depth analysis. By combining instant decision-making with deep simulation-based analysis, AI can make more precise and strategic decisions at critical moments. This approach improves the quality of decisions, especially at key turning points in the game.

Secondly, players must make a series of complex decisions under restricted conditions. These decisions involve choosing when to buy cards, when to collect or use gems, and when to reserve cards to deter your opponents. The agent needs to be able to quickly adapt to changing game situations and make optimized strategic choices. This is the first problem I faced. Effectively cope with complex game decision-making environments through multi-strategy decision-making functions such as SelectAction_v1, SelectAction_v2 and GreedySelectAction. Implementing multiple action selection functions with different priorities and strategic focuses, the AI can flexibly adjust its behavior based on the current game environment and specific needs. This diversity of strategies ensures that the AI remains efficient and competitive in the face of complex and changing game dynamics.

In addition, the resources in the game are limited, and the victory or defeat of the game is inseparable from the rational use of resources. In the game, the AI needs to be able to evaluate and optimize the use of gems and cards to maximize the benefits of resources. Correct resource allocation is crucial to achieve victory in the game, as every decision can affect the final outcome of the game. To solve this problem I designed several functions. The evaluate_gemcolor_weight and card_cost functions optimize resource management. The former dynamically adjusts the strategic value of gems, and the latter optimizes the purchase cost of new cards based on existing resources, ensuring that every use of resources is the optimal choice under current conditions. Maximize resource utilization efficiency.

Future Improvement

The first and most important thing is the introduction of Alpha-Beta pruning. Improve search efficiency by pre-excluding some obviously suboptimal or less-contributing paths during the simulation phase. My plan is to quickly estimate the results of each possible action during the simulation process. If the potential value of an action is significantly lower than the current optimal value of the explored path, the depth of this path can be terminated early. This method can significantly reduce the number of useless simulations, thereby reducing computational costs. Introduce more explicit stopping conditions in the simulate function, such as terminating the simulation early when a certain score difference is reached or when there is a significant advantage or disadvantage. And in the simulation phase, add a pre-evaluation step for each action choice. Use a simplified scoring function to quickly estimate the potential value of each action. If an action's pre-evaluation score is below the currently known best action threshold, then in-depth simulation of that action can be skipped.

def pre_evaluate_action(action, current_state):
    potential_score = action['card'].points if action["type"] == "buy_available" else 0
    return potential_score
def simulate(self, node, startTime):
    for action in node.untried_moves:
        if pre_evaluate_action(action, node.state) < SOME_THRESHOLD:
            continue

At the same time, the evaluate_player_score function is improved to evaluate the value of actions through more sophisticated heuristics. For this, my implementation plan is to enhance my stage-specific strategy. The current evaluation function is mainly based on the weighting of player scores, gems and cards, and I plan to introduce evaluation regarding opponent resources and potential threats. For example, as the opponent's score approaches 15 points, increase the score for actions that prevent the opponent from obtaining a specific card. At the same time, more logic about noble cards has been added, but the starting condition for this logic must be that when the game enters the mid-term, that is, in the early stage, a certain number of gems are already held in hand, so as to meet the dynamic adjustment strategy and try Redeem noble card. SelectAction_v2 is reformulated based on the added scoring perspective to ensure that the weights of different actions can be adjusted in detail from multiple dimensions in each stage.