An Introduction To Reinforcement Learning Algorithm - ECE-180D-WS-2023/Knowledge-Base-Wiki GitHub Wiki

Introduction

Reinforcement Learning(RL) was recently popularized after the AI program AlphaGo beat the human player Lee Sedol, one of the strongest players in the history of Go.

Fig.1 Documentary from Netflix for AlphaGo vs. Lee Sedol
RL is not only used in games, but also it is widely applied in autonomous driving (Shalev-Shwartz et al., 2016), Robotics Tasks (S. Levine et al, 2015), Natural language processing (NLP)(Luketina et al, 2019), and so on. This wiki article will talk about the popular and classic RL algorithms as an introduction to RL algorithms.

Introduction To Reinforcement Learning

Before moving to the RL algorithms, let’s talk about what reinforcement learning is. RL is about how intelligent agents ought to make sequential actions in an unknown environment through trial and error. The state refers to the environment of the system, which the agent uses to make decisions and take action. The Markov property implies that each state is independent of the other states. Therefore, we can set up the environment as a Markov decision process (MDP) and frame all RL tasks as MDPs. The main components of RL can be seen in Figure 1.

Fig.2 The agent–environment interaction in a Markov decision process
  1. Agent: the agent is the main player in reinforcement learning and is responsible for making decisions and learning from experience in the environment.
  2. Environment: the environment provides the state of the system, defines the actions that the agent can take, and provides a reward signal based on the agent's actions.
  3. Reward Signal: the reward signal is the key feedback mechanism in reinforcement learning that allows the agent to learn from experience and improve its policy over time.
  4. Policy: The policy is the mapping from states to actions that the agent takes. The policy defines the behavior of the agent in the environment and is updated over time as the agent receives feedback in the form of rewards.
  5. Value Function: the value function is an important component of reinforcement learning that provides the agent with a way to estimate the expected long-term reward of its actions and to guide its decision-making.
  6. Model: Optionally, the environment could be model-free, but different model-based environments have a significant impact on the performance of the reinforcement learning system.
    With all the components in the RL, that is time to combine them together to solve the tasks in real life by RL algorithms.

Introduction To Reinforcement Learning Algorithms

Through the RL topic, scientists and engineers develop algorithms to solve hard problems for humans. Here's an overview of some algorithms in modern RL which can be broadly categorized into model-free and model-based approaches.

Model-free:

Within the model-free approach, there are also value-based and policy-based methods.

Value-based

In value-based RL, the agent learns to estimate the value of different states and uses this information to choose the best action to take. There are some popular algorithms in the family of value-based RL algorithms, such as Q-learning, Deep Q-Networks (DQN), and Hierarchical-DQN (h-DQN).

  1. Q-learning: Q-learning is a famous algorithm in the field of value-based RL algorithm, which was developed by Watkins & Dayan (1992). The goal of Q-learning is to learn an optimal value function with the implicit policy by selecting actions that maximize the value function. This is achieved by creating a map of state-action pairs and using a quality function (Q-function) to estimate the value of each action in achieving future rewards. The Q-function is updated iteratively using the Bellman equation, which calculates the expected value of taking an action in a given state.
    However, one of the disadvantages of Q-learning is that the learning process can become computationally expensive for the agent, particularly when dealing with large state and action spaces.
  2. DQN: Deep Q-Networks use neural networks to approximate the action-value function and address the scalability issue when the size of the action-value function is infeasible to compute. There are two innovative mechanisms of DQN: Experience Replay and Periodically Updated Target. On the one hand, the Experience Replay aims to store all the episode steps in one replay memory. During Q-learning updates, samples are drawn at random from the replay memory and thus one sample could be used multiple times, this method improves sample complexity and removes correlations in the observation sequences. On the other hand, Periodically Updated Target aims that the Q-Network is optimized towards target-Network that is periodically updated. The Q-Network is cloned and kept frozen optimization target network for every k step (k is a hyperparameter for the frequency of updating the target network ). This modification makes the learning process more stable as it overcomes short-term oscillations(Mnih et al. 2015).

Policy-based

Unlike value-based methods, Policy-based RL algorithms aim to learn an optimal policy directly by maximizing the expected cumulative reward without computing the value function. There are a couple of policy optimization methods like policy gradient, Asynchronous Advantage Actor-Critic(A2C/A3C), and Proximal Policy Optimization(PPO).

  1. Policy Gradient: Gradient Descent (GD) is an optimization algorithm that aims to find the minimum of a function by iteratively adjusting the parameters in the direction of the steepest descent of the function. In RL, the goal of policy gradient algorithms is to find a policy that maximizes the expected return or reward until it converges. These algorithms update the policy parameters in the direction of higher expected rewards using gradient ascent. Moreover, researchers develop policy gradients that have been widely used in various problems(including continuous and discrete control problems), such as Atari games(J. Schulman et al. 2015, V. Mnih et al. 2016), Real-world robots(S. Levine, 2015).
  2. PPO: Proximal Policy Optimization is designed to be more sample efficient and stable than policy gradient methods by using a trust region optimization approach and clipping the policy update step to prevent large policy changes.

Actor-Critic

Policy-based methods have a primary strength in their direct optimization of the objective function, making them more stable and reliable. In contrast, value-based methods indirectly optimize the agent's performance through a self-consistency equation, leading to more instability and failure modes. However, value-based methods are more sample efficient when they work due to their ability to reuse data more effectively than policy-based techniques. One might then ask, Can we take the advantage of both methods? Luckily, scientists create a range of algorithms that live in between the two extremes. For example, Deep Deterministic Policy Gradient(DDPG), Twin Delayed DDPG(TD3), Soft Actor-Critic(SAC)

  1. DDPG: The DDPG uses a neural network to approximate both the policy (actor) and the state-value function (critic). An innovative point is that DDPG uses a deterministic policy instead of a stochastic policy, which makes it more suitable for tasks where the action space is continuous. DDPG also uses experience replay and target networks to stabilize the learning process.
  2. SAC: Soft Actor-Critic is an actor-critic RL algorithm that learns a stochastic policy and a soft Q-function. It maximizes a trade-off between expected return and entropy, which encourages exploration and can lead to better performance in high-dimensional and continuous action spaces. A big advantage of SAC is that it can handle both discrete and continuous action spaces. Additionally, SAC uses a replay buffer and periodically updates target networks to improve learning stability, which takes advantage of the DQN algorithm.

Model-based:

Model-based algorithms learn a model of the environment and use it to make decisions. The model-based algorithms could use the learned model to simulate potential outcomes, then select the action that leads to the best outcome. This approach has the potential to achieve higher sample efficiency than model-free methods but requires accurate modeling of the environment. Consequently, researchers have to do fine-tuning a specific model while dealing with a specific problem.

Fig.4 A non-exhaustive taxonomy of RL algorithms mentioned above

Challenges in Reinforcement Learning Algorithms

There are some problems and challenges with core algorithms in reinforcement learning that require our care and research.

Does the RL algorithm converge?

On the engineering side, Stability is the ability of the algorithm to converge to the optimal policy. The convergence progress is affected by hyperparameters. Consequently, tuning hyperparameters is the key point. However, it is very hard to devise stable RL algorithms since we cannot run hyperparameter sweeps in the real world. For example, in certain settings for healthcare, tuning in an online RL setting could lead to safety concerns, as the agent's behavior may not be predictable. People could do more research here and develop more stable algorithms that are less sensitive to hyperparameters, making RL a viable tool for real-world problems.

How long does the RL algorithm take to converge?

Simply, it depends on the sample complexity. The efficiency of RL algorithms is a big challenge. For example, if the RL algorithm is controlling a self-driving car system that could cause harm if not operated correctly, its failure to converge can pose a safety risk. The system may make decisions that endanger people or property. People need to design faster algorithms and develop better-modeled based RL algorithms in the future.

After the RL algorithm converges, does it generalize?

The topic of scaling and generalization is another direction challenge in RL. We need RL algorithms to effectively learn in problems with larger state and action spaces or more complex environments with the ability to make problems scalable. After that with a larger scale, we hopefully make the RL algorithms more generalization, which could transfer knowledge from one problem to another. Overall, addressing these challenges is crucial for the success of RL algorithms in real-world problems, although the real world is not so simple.

Comparing RL with other Learning Models

To better understand Reinforcement Learning, we must understand its difference from other prominent machine learning models. These models are Supervised and Unsupervised Learning.

Supervised Learning

Let’s first start with a high-level understanding. Supervised learning is similar to training someone through examples. For instance, a student would learn to do a type of problem in class, understand the process or features, and then extrapolate this to answer test questions.

A supervised model is a machine learning model that gains knowledge from labeled instances. The training dataset comprises input data, which consists of features, and associated target labels. The primary objective is understanding the connection or pattern between the input features and the target labels, enabling it to provide predictions or classifications for unseen data. A simple example is training an ML model to decide if a picture is of a dog or a cat. When training, the model would find features from numerous labeled pictures and use this to classify testing data. We can test our model’s accuracy by comparing its guess with the correct image label.

Screenshot 2023-06-15 at 9 57 34 PM

Unsupervised Learning

At a high level, unsupervised learning is like learning without instructions, meaning there is no predefined label to work with. It is like visiting a new city without a guide. You analyze your surroundings by observing and categorizing them based on similar or different characteristics.

Unsupervised learning happens when the computer analyzes data that does not have any labels or specific instructions. It tries to find patterns or structures in the data by itself. The program is exploring a large pile of objects without knowing what they are or how they should be grouped. The computer tries to find similarities or patterns on its own. Let’s take the same cat and dog example from above. Here, the model focuses on finding different features in the photos and using that to classify. For example, it may notice that cats have long whiskers, whereas dogs do not. Another observation could be that dogs tend to have bigger ears than cats.

Screenshot 2023-06-15 at 9 58 30 PM

We already know that Reinforcement learning operates in a feedback-driven setting. RLs learn from the actions taken within an environment and evaluate them against each other. There is not necessarily a “training set” for it to analyze before exploring an environment. It instead learns “on the job”. Another way to see this distinction is that supervised learning strives for accurate predictions or classifications. Unsupervised learning aims to discover patterns or representations. RL focuses on learning how to make sequential decisions in uncertain situations for the best outcome.

Screenshot 2023-06-15 at 10 00 03 PM

As seen with AlphaGo, RL has made significant strides within real-world applications. One prominent example is autonomous vehicles, where RL algorithms enable them to navigate complex environments, make critical decisions, and control their movements. RL has also found practical implementation in recommendation systems, enhancing user experiences by personalizing content. Recommendation systems can learn from user interactions and feedback to provide tailored suggestions for products, movies, music, etc. As the user continues to interact with the application, the system provides better guidance.

RL has the potential to solve far more complex, real-world challenges. Ongoing research and development continue to push its boundaries, empowering the creation of intelligent, adaptive systems that learn from experience and improve their performance in diverse domains.

Conclusion

In conclusion, modern reinforcement learning algorithms can solve complex problems in a range of fields, from robotics to game playing and beyond. However, these algorithms also come with challenges, such as the difficulty of training in instability, taking too much time to converge, and generalization issues. As RL continues to evolve, it holds tremendous potential for solving real-world problems and advancing the field of AI.

Reference

  1. J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P.Abbeel. “Trust Region Policy Optimization”. (2015).
  2. V. Mnih, A. P. Badia, M. Mirza, A. Graves, T.P.Lillicrap, et al. “Asynchronous methods for deep reinforcement learning”. (2016).
  3. S. Levine*, C. Finn*, T. Darrell, P. Abbeel. “End-to-end training of deep visuomotor policies”. (2015).
  4. V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I.Antonoglou, et al. “Playing Atari with Deep Reinforcement Learning”. (2013).
  5. D. Kalashnikov et al. “QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation”. (2018).
  6. Levine, S., Finn, C., Darrell, T., & Abbeel, P. "End-to-End Training of Deep Visuomotor Policies". ArXiv.https://doi.org/10.48550/arXiv.1504.00702. (2015).
  7. Luketina, Jelena, Nantas Nardelli, Gregory Farquhar, Jakob Foerster, Jacob Andreas, Edward Grefenstette, Shimon Whiteson, and Tim Rocktäschel. "A survey of reinforcement learning informed by natural language." arXiv preprint arXiv:1906.03926 (2019).
  8. Shalev-Shwartz, S., Shammah, S., and Shashua, A. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295. (2016).
  9. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous Methods for Deep Reinforcement Learning. ArXiv. /abs/1602.01783
  10. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. ArXiv. /abs/1707.06347
  11. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous control with deep reinforcement learning. ArXiv. /abs/1509.02971
  12. Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ArXiv. https://doi.org/10.48550/arXiv.1802.09477
  13. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ArXiv. /abs/1801.01290
  14. Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., & Zaremba, W. (2017). Hindsight Experience Replay. ArXiv. /abs/1707.01495
  15. Jeffares, A. (2020). Supervised vs unsupervised learning in 2 minutes. Retrieved from https://towardsdatascience.com/supervised-vs-unsupervised-learning-in-2-minutes-72dad148f242
  16. (N.d.). Retrieved from https://machine-learning.paperspace.com/wiki/supervised-unsupervised-and-reinforcement-learning
⚠️ **GitHub.com Fallback** ⚠️