Reinforcement learning - Falmouth-Games-Academy/comp250-wiki GitHub Wiki
To act on its own,
And find the consequences,
With rewards are taught
- Reinforcement learning can be understood using the concepts of agents, environments, states, actions and rewards.
Reinforcement learning (RL) is an area of machine learning concerned with how agents ought to take actions in an environment, as to maximize some notion of cumulative reward. Reinforcement Learning (RL) itself is a machine learning method inspired by behaviorist psychology and in particular,the way humans and animals learn to take decisions via (positive or negative) rewards received by their environment.[1][2]
In reinforcement learning, the training signal of the algorithm is provided by the environment based on how an agent is interacting with it. At a particular point in time (T),the agent is in a particular state (S) and takes an action (A) from all the available actions in its current state. As a response, the environment delivers an immediate reward, (R).
It is this continuous interaction between the agent and its environment, that allows the agent to gradually select actions that maximize its sum of rewards.
To elaborate, at each time step, the agent performs an action which leads to a change in the environment state and the agent (possibly) receiving a reward (or penalty) from the environment.
The goal of the agent is to discover an optimal policy (i.e. what actions to do in each state) such that it maximizes the total value of rewards received from the environment in response to its actions. MDP (Markov decision process) is used to describe the agent/ environment interaction settings in a formal way.[2]
MDP generally consists of 4 elements:
S : Set of states. At each time step the state of the environment is an element s ∈ S.
A: Set of actions. At each time step the agent choses an action a ∈ A to perform.
p(s_{t+1} | s_t, a_t) : State transition model that describes how the environment state changes when the user performs an action a depending on the action aand the current state s.
p(r_{t+1} | s_t, a_t) : Reward model that describes the real-valued reward value that the agent receives from the environment after performing an action. In MDP the the reward value depends on the current state and the action performed.
The way by which the agent chooses which action to perform is called the agent policy which is a function that takes the current environment state to return an action. The policy is often denoted by the symbol 𝛑.
A policy is a strategy that the agent follows in selecting actions, given the state it is in. If the function that characterizes the value of each action either exists or is learned, the optimal policy (π∗) can be derived by selecting the action with the highest value. The interactions with the environment (as mentioned above) occurs in discrete time steps(t ={0,1,2,...}) and are modeled as a Markov decision process(MDP).
Q-Learning is a basic form of Reinforcement Learning which uses Q-values (also called action values) to iteratively improve the behavior of the learning agent.
Q-Values or Action-Values: Q-values are defined for states and actions. Q(S, A) is an estimation of how good is it to take the action A at the state S. This estimation of Q(S, A) will be iteratively computed using the TD- Update rule
Rewards and Episodes: An agent over the course of its lifetime starts from a start state, makes a number of transitions from its current state to a next state based on its choice of action and also the environment the agent is interacting in. At every step of transition, the agent from a state takes an action, observes a reward from the environment, and then transits to another state. If at any point of time the agent ends up in one of the terminating states that means there are no further transition possible. This is said to be the completion of an episode.[3]
Many AI learning techniques are employed in the development process of video games but rarely are games released with these mechanisms turned on. There are a few notable exceptions however. One such exception is the game series Black and White. This series has one of the first examples of reinforcement learning in a commercial video game. In the game you play a "god" with an avatar creature/pet that learns to behave in certain ways based on how the player treats it. The creature learns what actions and routines to follow based on how the player scolds or rewards the creature throughout the game. The creature learns using a variety of different learning algorithms including decision trees and neural networks.[4]
- Adapting to player preferences (content generation)
- Computation creativity: Environment and character design/implementation
- NPC: Realistic interactions and dialogue [5]
[1] Wikipedia reinforcement learning - https://en.wikipedia.org/wiki/Reinforcement_learning
[2] G. Yannakakis and J. Togelius, Artificial intelligence and games. Cham: Springer International Publishing, 2018.
[3] GeeksforGeeks- https://www.geeksforgeeks.org/q-learning-in-python/
[4] Diller, David E., et al. "Behavior modeling in commercial games." Proceedings of the 2004 Conference on Behavior Representation in Modeling and Simulation (BRIMS). 2004.
[5] John Stephenson (LOGKIK.com) - 6 ways Machine Learning will be used in game development: https://www.logikk.com/articles/machine-learning-in-game-development/
[6] KDnuggets- (image) reinforcement learning/https://www.kdnuggets.com/2018/03/5-things-reinforcement-learning.html/
[7] Towards data science (image)- Q-learning/ Towards Data Science/ https://cdn-images-1.medium.com/max/1200/1*0_TNa54fr_LsLOllgIsrcw.png/


