深度强化学习算法超参数调试

RL-Theory

Deep Q Networks (DQN) With the Cartpole Environment

This article gives a brief explanation of the DQN algorithm for reinforcement learning, focusing on the Cartpole-v1 environment from OpenAI gym.

Jari

Last Updated: Oct 10, 2022

Jari

Add a comment

10 hearts

Back to Fully Connected

This article explores the topic of reinforcement learning (RL), giving a brief introduction, before diving into how to use the Deep Q Network (DQN) for RL and applying it to the cartpole problem.

Deep Reinforcement Learning OpenAI's gym and The Cartpole Environment Results of Applying DQN to the Cartpole problem Hyperparameter Sweep Conclusion

Deep Reinforcement Learning

Reinforcement learning (RL) is a technique to create a self-learning agent (often represented using a neural net) that finds the best way to interact with an environment. This interaction comes in the form of some set of actions and for every action, the environment may confer a reward on the agent (e.g. +1 for winning a game and -1 for losing).

Video games are often a good abstraction for studying RL. The discipline itself is relatively old but saw a resurgence in 2015 when it was shown to perform better than humans on Atari games

One of the core algorithms for RL is Q learning. The goal of Q learning is to construct a function that, given the current state of the agent, can predict the maximum reward the agent can expect if taking any of the available actions (this is called the Q value).

The function may be a simple table, or it may be a complicated neural network (often referred to as a Deep Q Network or DQN for short). Once this function is learned, the best way for the agent to act is to simply pick the action associated with the largest total reward.

This post is written in parallel with some of my ongoing livestreams about implementing Deep Q learning with PyTorch. Find the latest one below

The above function is learned using the Bellman Equation which asserts that the Q value for the agent's current state is equal to the immediate reward at that time as well as the max possible reward that can be expected in the future. This recursive relationship allows the Q function to be learned from some random initialization.

There is one more intricacy to consider: exploration. Since the agent doesn't start with a knowledge of how to act, it needs some way to continuously better explore the environment as it learns more about how to act. One simple (and reasonably effective) approach is to start off acting completely randomly to generate the data to train over. Slowly, over time, the rate of random actions is decreased and replaced with using the learned function more and more (exploitation). This is called epsilon-greedy.

OpenAI's gym and The Cartpole Environment

The OpenAI gym is an API built to make environment simulation and interaction for reinforcement learning simple. It also contains a number of built-in environments (e.g. Atari games, classic control problems, etc).

One such classic control problem is cartpole, in which a cart carrying an inverted pendulum needs to be controlled such that the pendulum stays upright. The reward mechanics are described on the gym page for this environment.

https://gym.openai.com/envs/CartPole-v1/

Results of Applying DQN to the Cartpole problem

As part of the first stream in the series I mentioned above, I put together a model for solving the cartpole environment and got it training with a set of parameters that felt right from past experience. Usually, a learning rate somewhere between 1e-3 and 1e-4 tends to work well. I set the epsilon decay factor so that it would hit min epsilon or 5% by about half a million steps.

One thing I haven't mentioned so far is the target model. DQN tends to be very unstable unless you use two copies of the same model and update one of them every batch and the other one very rarely. How often that update happens becomes another hyperparameter. I set it to 5,000 epochs (which equals 50,000 steps in the environment).

Hyperparameter Sweep

Hyperparameters are a particular pain point in reinforcement learning, much more so than in deep learning since it can take a long time before any signs of progress show up. By using a hyperparameter sweep in W&B we can test how well we picked the parameters in the above example.

The epsilon decay factor seems to be the most important parameter here. Increasing learning rate also appears to negatively impact the test score. Min epsilon appears to be more forgiving, but it's important to note that this can be very environment specific. In a game where a single wrong move can mean the end, a large minimum epsilon can effectively control the upper bound on performance.

Conclusion

Reinforcement learning is a very interesting idea, and in the past few years, it has become even more powerful through the use of deep learning and modern hardware. DQN makes for a relatively pain-free starting point for beginners who can focus on the simpler environments in OpenAI's gym.

本文转载自： https://wandb.ai/safijari/dqn-tutorial/reports/Deep-Q-Networks-DQN-With-the-Cartpole-Environment--Vmlldzo4MDc2MQ

Document