【强化学习论文笔记】第1期：详解Human-level control through deep reinforcement learning

RLer

论文原文：https://static.aminer.org/pdf/man/DeepMindNature14236Paper.pdf

重点内容

The theory of reinforcement learning provides a normative account[1], deeply rooted in psychological[2] and neuroscientific[3] perspectives on animal behaviour, of how agents may optimize their control of an environment
We developed a novel agent, a deep Q-network (DQN), which is able to combine reinforcement learning with a class of artificial neural network[16] known as deep neural networks
In additional simulations, we demonstrate the importance of the individual core components of the deep Q-network agent—the replay memory, separate target Q-network and deep convolutional network architecture—by disabling them and demonstrating the detrimental effects on performance
We examined the representations learned by deep Q-network that underpinned the successful performance of the agent in the context of the game Space Invaders, by using a technique developed for the visualization of high-dimensional data called ‘t-SNE’25 (Fig. 4)
We show that the representations learned by deep Q-network are able to generalize to data generated from policies other than its own—in simulations where we presented as input to the network game states experienced during human and agent play, recorded the representations of the last hidden layer, and visualized the embeddings generated by the t-SNE algorithm (Extended Data Fig. 1 and Supplementary Discussion)
Extended Data Fig. 2 provides an additional illustration of how the representations learned by deep Q-network allow it to accurately predict state and action values

简介

The theory of reinforcement learning provides a normative account[1], deeply rooted in psychological[2] and neuroscientific[3] perspectives on animal behaviour, of how agents may optimize their control of an environment.
The authors use a deep convolutional neural network to approximate the optimal action-value function

结果

The authors used the same network architecture, hyperparameter values and learning procedure throughout—taking high-dimensional data (210|160 colour video at 60 Hz) as input—to demonstrate that the approach robustly learns successful policies over a variety of games based solely on sensory inputs with only very minimal prior knowledge.
The authors compared DQN with the best performing methods from the reinforcement learning literature on the 49 games where results were available[12,15].
In addition to the learned agents, the authors report scores for a professional human games tester playing under controlled conditions and a policy that selects actions uniformly at random (Extended Data Table 2 and Fig. 3, denoted by 100% and 0% on y axis; see Methods).
The authors examined the representations learned by DQN that underpinned the successful performance of the agent in the context of the game Space Invaders, by using a technique developed for the visualization of high-dimensional data called ‘t-SNE’25 (Fig. 4).
The authors found instances in which the t-SNE algorithm generated similar embeddings for DQN representations of states that are close in terms of expected reward but perceptually dissimilar (Fig. 4, bottom right, top left and middle), consistent with the notion that the network is able to learn representations that support adaptive behaviour from high-dimensional sensory inputs.
The authors show that the representations learned by DQN are able to generalize to data generated from policies other than its own—in simulations where the authors presented as input to the network game states experienced during human and agent play, recorded the representations of the last hidden layer, and visualized the embeddings generated by the t-SNE algorithm (Extended Data Fig. 1 and Supplementary Discussion).
Extended Data Fig. 2 provides an additional illustration of how the representations learned by DQN allow it to accurately predict state and action values.

结论

The authors demonstrate that a single architecture can successfully learn control policies in a range of different environments with only very minimal prior knowledge, receiving only the pixels and the game score as inputs, and using the same algorithm, network architecture and hyperparameters on each game, privy only to the inputs a human player would have.
In contrast to previous work[24,26], the approach incorporates ‘end-to-end’ reinforcement learning that uses reward to continuously shape representations within the convolutional network towards salient features of the environment that facilitate value estimation.

本文原文： https://www.aminer.cn/pub/55a6bae665ce054aad73115b/human-level-control-through-deep-reinforcement-learning

Document