SARSA Agent

This agent is developed by team member: Junbo Liang - [email protected] -1019905

The SARSA(State-action-reward-state-action) agent, implemented in agents/t_085/leung_sarsa.py, is one of our attempt in trying to play this Splendor game automatically . SARSA is a reinforcement learning algorithm used to solve the optimization strategy problem in the Markov Decision Process (MDP), and SARSA is the on policy learning. Updating the Q-value of action under current state Q(s, a) base on the formula Q(s, a) = (1 - α)Q(s, a) + α * [r + γ * Q(s', a')], where α is the learning rate, r is reward, γ is discount factor and Q(s', a') is the estimate value of s'(next state) under choosing a' (a' ∈ A(s')).

The SARSA agent has demonstrated normal performance in competiting with agents from teaching teams. The best record for this SARSA agent only wins 20 games out of 40 in total (Win Rate: 50.0%), which is still a imprefect approach.

Justification for Technique

Splendor is a game has a clear distinction between each game state. Since it is turn based where players need to make choices at each stage. SARSA evaluates the value of different actions in each state through the state action value function, and selects the optimal action based on the current strategy. It enables SARSA to self-learn and update its Q-value to select the best action in each state.

After enough epoch of runing the training function, parameters under SARSA algorithmn will smoothly and converge to the value which is optimal based on the designs. This stability enables SARSA to generate more reliably in complex and changing game environments. And after enough training to update the parameters, it can generate the solution within a very short time since it only need to consider the actions in current state.

Problem Modelling and Implementation Design

Since each state has a large number of actions, and the state is composed of many factors, such as the number of gems the agent holds, the cards available on the table, etc. It is impossible to store all information of one state, and it is difficult to achieve the exact same game state. Rather than updating Q-values for discrete game states, we can extract some features from a game state and combine the values of these features together through a function to represent similar states. Store the parameters which use to calculate the score of each feature, and further compute the Q-value of the current state.

Here are the features which used for state potential heuristic in my SARSA algorithm:

Feature	Description	Reason
score	The score of current state	The direct factor to determine win or lose
gems	The number gems holding(including the yellow gems but not gems from card	Use to determine when to stop accumating gems
noble	The number of noble	The bonus points which is in advantage to have
card_point	If the selected action is buying card, the point from current card	Evaluate the value of buying this card
card_cost	If the selected action is buying card, the cost of buying this card	Evaluate the value of buying this card
level_1_cards	The number of level 1 cards holding	Evalute having how many level 1 cards benefit in winning
level_2_cards	The number of level 2 cards holding	Evalute having how many level 2 cards benefit in winning
level_3_cards	The number of level 3 cards holding	Evalute having how many level 3 cards benefit in winning

featureScorefunction

Here is the code for featureScorefunction.

  def featureScore(self, game_state, action):
      feature = {}
      feature['score'] = self.game_rule.calScore(game_state, self.id) 
      feature['gems'] = sum(game_state.agents[self.id].gems.values())
      feature['noble'] = len(game_state.agents[self.id].nobles)
      feature['card_point'] = 0
      feature['card_cost'] = 0
      if 'card' in action:
          feature['card_point'] = action['card'].points
          feature['card_cost'] = - sum(action['card'].cost.values())
      count1 = 0
      count2 = 0
      count3 = 0
      for key, items in game_state.agents[self.id].cards.items():
          for card in items:
              if card.code in LEVEL_1_CARDS:
                  count1 += 1
              if card.code in LEVEL_2_CARDS:
                  count2 += 1
              if card.code in LEVEL_3_CARDS:
                  count3 += 1
      feature['level_1_cards'] = count1
      feature['level_2_cards'] = count2
      feature['level_3_cards'] = count3
      return feature

By passing the gamestate and the action choice to this function, it will count or sum up the feature's values and return as a dictionary. This function will be called by the function to update the parameters and calculate the Q-value.

The parameters will store in a json file in a form of dictionary, after training 100 times agianst with the bfs function (agents/t_085/otherAgents/leung_bfs.py) the value of parameters (weight for each feature) is stored in weeight.json. When turning off the training mode (TRAINING = False)of the agent, it can use the weight to calculate the Q-value and generate the answer in a very short time.

Here is the selectAction function, which depend on the TRAINING status, if not in training mode it will call the onPolicy function to return the action with best Q-value, otherwise it will run the series training functions.

selectAction function

Here is the code for selectAction function.

  def SelectAction(self,actions,game_state):
      if not TRAINING:
          return self.onPolicy(actions, game_state)
      startTime = time.time()
      path = []
      new_path = self.train(game_state,startTime, path)
      return new_path[0] if new_path else random.choice(actions)

During training, this is the main loop that updates the state within the time limit or reaches the target state of winning the game. There is a ε mechanesim to determine in current state to update the parameters or use the current parameter to obtain action with the best Q-value then move into next state. In this method the ε value is 0.9. Since it might not able to update the value at the same time also reaching the goal state, A list is introduced from the beginning to store action paths, similar to the bfs method in the preliminary attempt. Incase it cannot find reach the goal state within time limit, it still can return a temporary best path. And there is function to calcualte the reward of this selected action which used to update the Q-value, which is the parameters in weight.json.

In contrast with the pure SARSA or Q-learning method, which does not have this loop and return the optimal actions according to their rules. In my version of SARSA, there is a loop that updates the value of each possible action that can be taken in the current state. Same as the preliminary bfs method, there is a filter that selects those valuable actions based on the defined gredy method.

train function

Here is the code for train function.

  def train(self, game_state, startTime, path):
      new_path = []
      while time.time() - startTime < MAX_TIME and self.game_rule.calScore(game_state, self.id) < 15:
          actions = self.game_rule.getLegalActions(game_state, self.id)
          if random.random() < self.epsilon:
              new_actions = self.greedyAction(game_state, actions)
              new_action = random.choice(new_actions)
              next_state = self.game_rule.generateSuccessor(deepcopy(game_state), new_action, self.id)
              reward = self.calReward(game_state, next_state, new_action)
              next_path = path + [new_action]
              self.updateAll(game_state, next_state, new_action, reward)
              new_path = self.train(next_state, startTime, next_path)
          else:
              new_action = random.choice(actions)
              next_path = path + [new_action] 
              next_state = self.game_rule.generateSuccessor(deepcopy(game_state), new_action, self.id)
              new_path = self.train(next_state, startTime, next_path)
      return new_path if new_path else path

The above train function contains three important functions, namely greedyAction, reward and updateAll, which respectively represent action priority heuristic, estimated reward heuristic and resource weight heuristic. gredyAction directly inherits my preminiary method, it first collects gems until cards can be purchased, and if gem-gathering actions are taken that can purchase cards in the next round, those actions will be considered prior other gem-gathering actions.

The reward function used to evaluate the reward after taking this action in current state, it calcualtes the reward which is used for upadting the Q-value. The reward is ranged from -1 to 1.

action	The reward of this action
noble	+1
gems from card	+ 0.5
reduce gem (return or purchase card)	+ 0.1 * No. of gems
obtain scores	+ 0.3 * No. of scores
pass	- 1

reward function

Here is the code for reward function.

def calReward(self, game_state, next_state, action):
  reward = 0
  # reward for obtain nobles in one action
  reward += 1 * (len(next_state.agents[self.id].nobles) - len(game_state.agents[self.id].nobles))
  # reward for obtain a permenant gem from card
  reward += 0.5 * (len(next_state.agents[self.id].cards) - len(game_state.agents[self.id].cards))
  # reward for reducing gem
  reward += 0.1 * (sum(next_state.agents[self.id].gems.values()) - sum(game_state.agents[self.id].gems.values()))
  # reward for obtain scores
  reward += 0.3 * (self.game_rule.calScore(next_state, self.id) - self.game_rule.calScore(game_state, self.id))
  # reward for pass
  if action['type'] == 'pass':
      reward -= 1
  return reward

The updateAll function use the formula Q(s, a) = (1 - α)Q(s, a) + α * [r + γ * Q(s', a')], where α is the learning rate, r is reward, γ is discount factor and Q(s', a') is the estimate value of s'(next state) under choosing a' (a' ∈ A(s')). In this program the r is defined as 0.01 initially and with a discount factor γ 0.9, the minimum learning rate is 0.0001. The reason for including a decreasing learning rate is to prevent the model overfitting one particular game. Since we use the weight of feature to update the Q(s, a), the formula was transformed into Q(s, a) = Q(s, a) + α * [r + γ * Q(s', a') - Q(s, a)] and where the delta in function represent the α * [r + γ * Q(s', a') - Q(s, a)]. Then use the old feature weights + delta * feature_value to update the feature weights. And overwrite a .json file to save the new feature weights.

updateAll function

Here is the code for updateAll function.

  def updateAll(self, game_state, next_state, action, reward):
      actions = self.game_rule.getLegalActions(game_state, self.id)
      next_actions = self.greedyAction(game_state, actions)
      next_action = random.choice(next_actions)
      delta = self.lr * (reward + self.gamma * self.getQ (next_state, next_action) - self.getQ(game_state, action))
      for feature, feature_value in self.featureScore(game_state, action).items():
          old_weight = self.weight.get(feature, 0)
          gradient = feature_value * delta
          self.weight[feature] = old_weight + gradient
      self.lr = max(0.0001, self.lr * 0.9)
      json.dump(self.weight, open(PATH, "w"))

The commands used to obtain information from the game program such as obtain available actions in current state, calculate next game state, etc. will be introduced in Problem Analysis section.

Advantage & Disavantage

Advantage:

Computation time is very short after training. When a converged set of feature weights is obtained, training mode is turned off. The model can call the onPolicy function to calculate the Q-values of all available actions based on the current game state, and select the action with the highest-Q value. This can be computed in a very short time and can be adapted for use as ai to simulate the movements of opponents in other method.
Safe learning and policy stability, the Q-value and feature weight only updated according to the action selected. During the training, the action is chosen was all valid under the game state, so the value and feature value of the algorithnm will not be updated by other invalid movement.
Unsupervised learning, which does not require manual iteration such as providing optimal solutions to compare losses and update values. SARSA will update the feature weights in each iteration by selecting actions chosen by the algorithm, and repeat this when updating the weights in the game.

Disavantage:

Complicative in defining the game state, the selection of features can briefly represent the stages which are similar, but there might still have some difference between those stages and the action optimal in this stage might not be optimal in the stage which is similar. And the weight of parameters in reward function is also difficult to design, it is very hard to determine the set of weight that would not overlook or underlook some features.
Need extra time in training the agent before it can use for competition, and the feature weights converge slowly. It is not easy to adapt the algorithm to other problem. Unlike the other method which finishing adjust several line of codes can be implemented, SARSA also need to modify the codes and need further time to learn the optimal parameters for the new problem.

Solved Challenges

I have encoutered exceeding the time limit when training my SARSA method, since my decide purpose was using a greedy epsilon value and a loop to try to explore depth in each iteration, which is very time consuming. I want to acheive a relatively less time of training and robust enough method, so those features mentions above should not be changed. And the computational time for each iteration is a bit longer since it need to calculate the Q-value , reward and updated feature weights. I decide to include a timer to record the start time and a time treshold of time which the runing time of each round excessing this defined treshold will terminating the training function and return a local optimal action, to ensure my program will not excess the 1s training time limit.
I faced was how to determine which action of collecting gem will allow my agent to buy a card in next round. My first idea was to convert the card's cost into a matrix and then subtract the current agent's gems. But it is more complicated and requires adding many constraints and conversions to get the result. I then realized that I could simply pass this collect gems option to the system and get the next game state and use the "if can buy" function to determine if this action enables the agent to buy cards in the next round.
When I try to save the gamestate and action taken under this action into a Q(s, a) directly, but it did not work. Then I try to use the current score as the gamestate, it also not working well since the gamestate is more complicative than agent score solely. Then I further consider the possible features can present the gamestate more comprehensively, and come out with a list of features which have mentioned above.
I found that without a learning decaying designed, after few games of learning, some weights will be 100 times larger than other, which make the decision domination solely by these features. After I have introduced the learning rate decaying and minimum learning rate, the weight updating is more reasonable and smoothly.

Future Improvement

More functions to guide the function to choose the actions path that could attract the noble and have more weight in noble. Since the feature weight of noble is negative which means noble is considered as a disavantage feature in this agent, it is violating the stategy of winning the game and not matching my design purpose. Maybe can involve more function in action priority heuristic to give higher priority for those actions can attract noble. Or increase the weight in reward for obtaining the noble.

Introduce a system to evalue the value of card, by calculating their reward(gem colour and score) and cost of purchasing this card, even can relate to the requirement of gem of noble to further weight each card. For example contain a priority order of the card, in level 1 the card can purchase by 3 gems is greater than those need 4 gems. And if the reward gem of this card match the requirement of the noble, further increase its priority in this game.

Take game stage into account, such as diving the game into three stages, the warm up stage where agents have less than 4 cards and 6 points, then medium stage which agents have less than 10 points, the final stage which agents have more than 10 points. For each stage have different feature weight to adjust action priority dynamically.

Use the nomalization method to smooth the feature weight, which can reduce the oscillations in the feature weights and lead to a relatively more smoothly value updating. Since larger number will enlarge more than a smaller number when multiplying same delta value. That is, some functions will dominate, but there will not be enough diversity.

AI Method 2 - leungjunbob/AIP-project GitHub Wiki

SARSA Agent

Justification for Technique

Problem Modelling and Implementation Design

Advantage & Disavantage

Solved Challenges

Future Improvement

⚠️ GitHub.com Fallback ⚠️

AI Method 2 - leungjunbob/AIP-project GitHub Wiki

SARSA Agent

Justification for Technique

Problem Modelling and Implementation Design

Advantage & Disavantage

Solved Challenges

Future Improvement

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️