训练强化学习算法所需CPU，GPU、TPU、FPGA硬件资源多少？

RL-Theory

在较小的计算预算上进行的研究能否提供有价值的科学见解？考虑到疯狂的培训时间和预算，人们很自然地会想，人工智能中是否有什么值得的东西会以很小的价格出现。到目前为止，研究人员已经将重点放在了语言模型的训练成本上，这些模型已经变得太大了。但是，深度强化学习（RL）算法——自动驾驶汽车、仓库机器人甚至击败国际象棋大师的人工智能背后的大脑——又如何呢？

深度RL将RL与深度学习相结合。2015年，当Alphabet的DeepMind在Deep Q Networks（DQN）上发布他们的作品时，Deep RL引起了轰动。当在雅达利2600游戏上测试时，DQN代理超越了之前所有算法的性能，达到了与专业人类游戏测试人员相当的水平。

然而，根据谷歌研究人员的说法，深度RL的发展是有代价的——一个计算的代价。多年来，最初的DQN算法经过了调整，以击败街机学习（ALE）基准。ALE被广泛用作在Atari游戏上对deepRL模型进行基准测试的接口。彩虹算法就是这样一种改进，它帮助DQN范式达到了最先进的状态。然而，Rainbow算法在计算方面非常繁重。

Rainbow was first introduced in 2018. The experiments reportedly required a large research lab set up as it took roughly 5 days to fully train using specialised hardware like the NVIDIA Tesla P100 GPU. According to Google researchers, to prove Rainbow’s superiority, it required approximately 34,200 GPU hours (or 1425 days). Moreover, this cost does not include the hyper-parameter tuning that was necessary to optimise the various components. “Considering that the cost of a Tesla P100 GPU is around $6,000, providing this evidence will take an unreasonably long time as it is prohibitively expensive to have multiple GPUs in a typical academic lab so they can be used in parallel,” according to Google researchers.

In their work titled, “Revisiting Rainbow”, the researchers at Google tried to answer the following questions:

Would state of the art (ALE) have been possible with smaller-scale experiments unlike in the case of Rainbow back in 2018?
How good are these algorithms in non-ALE environments?
Is there scientific value in conducting empirical research in reinforcement learning when restricting oneself to small- to mid-scale environments?

(Image credits: Google AI blog)

To demonstrate the effectiveness of small to mid-scale experiments, the researchers evaluated a set of four classic control environments as shown above. These experiments, according to the researchers, can be fully trained in 10-20 minutes (compared to five days for ALE games):

CartPole: Here the agent is tasked to balance a pole on a cart that the agent can move left and right.
Acrobot: The agent has to apply force to the joint between the two arms in order to raise the lower arm above a threshold.
LunarLander: The agent is meant to land the spaceship between the two flags.
MountainCar: The agent must build up momentum between two hills to drive to the top of the rightmost hill.

(Image credits: Paper by Castro et al.,)

In their experiments, the researchers gradually added double Q-learning, prioritized experience replay, dueling networks, multi-step learning, distributional RL, and other components to the DQN model while at the same time removing components of the Rainbow algorithm. The researchers found that a combination of these components performed on par with DQN that runs on Rainbow.

The original loss functions and optimisers were also tested during these experiments. Huber loss and RMS prop optimiser are commonly used while developing DQN models. The researchers also mixed these runs with Adam optimiser and mean squared error loss (MSE). The results show dam+MSE is a superior combination than RMSProp+Huber.

The researchers were able to reproduce the results of the original Rainbow paper on a limited computational budget and even uncover new and interesting phenomena. Hence making a strong case for the relevance and significance of empirical research on small- and medium-scale environments. The researchers believe that these less computationally intensive environments lend themselves well to a more critical and thorough analysis of the performance, behaviors, and intricacies of new algorithms.

本文转载自： https://analyticsindiamag.com/cost-effective-deep-reinforcement-learning/

Document