在DeepMind于2015年发表的有关深度强化学习的论文中指出,“由于不稳定的学习,将RL与神经网络结合的先前尝试在很大程度上失败了”。论文根据观察结果之间的相关性列出了造成这种情况的一些原因。
具体是否可以理解为是神经网络和交互学习两方面的原因?
说一点自己浅显的认识,强化学习产生的是时间序列,一条时间序列内部是有时间相关性的,因此不满足独立不相关的要求,因此训练难以收敛,这也是为什么要加入“experience replay”,打破训练batch内的相关性,有助于训练。
我很喜欢的这个问题。我觉得原因很复杂。
Sutton老爷子曾在他书第11章中讲过3个导致RL训练不稳定甚至发散的致命因素:函数逼近、自举、离轨策略训练。自DQN、DDPG、TRPO为代表DRL兴起后,DRL似乎与这三个因素脱节了。
另外我想补充: 训练RL过程必然伴随探索,探索机制会带来奖励波动,但是振幅就很难讲了。
The main problem is that, as in many other fields, DNN can be hard to train. Here, one problem is the correlation of input data: if you think about a video game (they actually use those to test their algorithms), you can imagine that screenshots taken one step after another are highly correlated: the game evolves "continuously". That, for NNs, can be a problem: doing many iterations of gradient descent on similar and correlated inputs may lead to overfit them and/or fall into a local minimum. This why they use experience replay: they store a series of "snapshots" of the game, then shuffle them, and pick them some steps later to do training. In this way, the data is not correlated anymore. Then, they notice how during the training the Q values (predicted by the NN) can change the on going policy, so making the agent prefer only a set of actions and causing it to store data that is correlated for the same reasons as before: this is why they delay training and update Q periodically, to ensure that the agent can explore the game, and train on shuffled and uncorrelated data.
https://stats.stackexchange.com/questions/265964/why-is-deep-reinforcement-learning-unstable
https://stackoverflow.com/questions/52770780/why-is-my-deep-q-net-and-double-deep-q-net-unstable