Reinforcement Learning - nickagliano/bub GitHub Wiki

Good robot, bad robot

Reinforcement learning is a subset of machine learning dedicated to optimizing the behaviors of an artificial intelligence in the context of external rewards. I like to oversimplify it with the anthropomorphic phrase "good robot, bad robot".

Where can reinforcement learning be used

Games such as chess, poker, space invaders, or even Jeopardy! are such in nature that they are candidates for reinforcement learning. Games tend to have a definitive start and end state, with the end being either a win state or lose state (or sometimes a tie) -- and between the start and end are definitive, quantifiable states (often times turns, or perhaps atomic game-clock ticks in some video games). Since the aforementioned games can be broken down in this way, an artificial intelligence can be trained to "complete" or "solve" them.

A simple model of reinforcement learning in the context of a chess match:

If the AI loses an important piece such as its queen, then it is fed a negative reward -- it is mathematically chastised for its mistake. Contrarily, if the AI takes the opponent's queen, then it is fed a positive reward. Good robot.

A more complex, less greedy model of reinforcement learning in the context of a chess match:

The AI loses an important piece such as its queen, however this is a sacrifice in order to get better positioning on the board and eventually win the game. Even though the queen is lost, the AI is still fed a positive reward. This idea of more a more forward thinking AI is a subset of reinforcement learning known as Q-learning.

These sub-genres of machine learning played a large part in the development of the AI's that beat the world's best Chess, Jeopardy!, Dota 2, and Go players.

B.U.B.'s Reinforcement Learning

Pokémon battling, like a lot of other games (chess, poker, space invaders), is an area where reinforcement learning can be applied. Games such as the aforementioned are common candidates for reinforcement learning due to the reasons stated in the introduction above. If you're still confused, and/or want to learn more about reinforcement, this article does a good job of explaining the concepts in much more detail.

Defining a state

data on all pokemon - stats, status conditions, hp, items - determining items, has item been consumed, known moves, types, matchup table, abilities
which pokemon are active
field conditions - weather, screens, hazards, tailwind, terrain
data on moves - damage, effects, types, pp
how many options for a move are there? what is BUB choosing between?
- Number of usable moves (could change based on PP, taunt, torment, etc, item, moveset, etc.)
- Switch to another pokemon

Input layer

Hidden layers

Output layer

Gives Q-value for each possible move in the current game state