Reinforcement learning powers DeepMind’s MuZero, AlphaStar, Agent57 etc., while imitation learning is at the heart of Waymo’s self-driving cars. But what exactly are these two training methods, and how do they stack up against each other? Let’s find out.
What is reinforcement learning?
Reinforcement learning refers to how a model learns to perform a task through repeated trial-and-error interactions in a dynamic environment. The systems learn to make decisions based on a reward function without human intervention and explicit programming. RL is considered a viable path to AGI as it does not depend on historical data sets. To that end, tech companies like Facebook, Google, DeepMind, Amazon, and Microsoft have committed substantial resources to pushing the frontiers of RL. The goal of RL is to learn an optimal policy which maximises the long-term cumulative rewards.
What is imitation learning?
Imitation learning is a training method where the computer imitates human behaviour. In IL, instead of the reward function, an expert, usually a human, provides the agent with a set of demonstrations. The agent then tries to learn the optimal policy by following and imitating the expert’s decisions. Finally, the agent learns to map between observations and actions based on the demonstrations.
Benefits
Reinforcement learning does not need large datasets or historical data to train the agent. Hence, RL bypasses the challenges of data labelling and the pitfalls of biased and incorrect data. The method allows the agent to be innovative and design solutions humans may not have thought of, furthering its adaptability.
Imitation learning doesn’t face training issues such as lack of reward functions and the need for explicit programming. Research shows generative adversarial imitation learning has ‘tremendous effectiveness, especially when paired with neural network parameterisation’ in some use cases.
Limitations
RL comes with its own set of challenges. Agents can be very hard to train in environments with sparse or no rewards. With lesser samples, it takes the RL system a considerable amount of time to be efficient. For instance, DeepMind’s AlphaGoZero played five million Go games before beating the world champion. The lack of reproducibility and agents not performing well in real-life scenarios are other major limitations.
Imitation learning is used for data-driven models. Means, an unethical model built on biased historical data can pose problems. IL also doesn’t generalise well because the information fed is only a collection of the universal sample. Exactly why models like GPT-3, trained on billions of parameters, tend to go rogue.
Learning efficiency
Since reinforcement learning is based on a reward mechanism, the trainer has to set rules. RL works best when the action space of the model is different from that of the expert, allowing the model to learn and innovate based on the problem. However, given the sparse nature of rewards and constant learning and re-learning, reinforcement learning requires several training episodes.
Imitation learning is efficient when the action space of the model and the trainer overlaps. For instance, in a self-driving scenario, the action space of the model and human driver will consist of the same break, steering or accelerators. Therefore, imitation learning doesn’t require a lot of training episodes.
Use cases
Reinforcement learning is used for text summarisation, chat-bots, self-driving cars, online stock trading, automating the data centre cooling, and recommendation systems. It is also used in games like Pac-Man. DeepMind’s AlphaGo Zero is another example where the model learns to play Go from scratch by playing against itself.
The first self-driving car, ALVINN, is a classic example of imitation learning. The car was fitted with sensors that had to learn to map the sensor inputs into steering angles and drive autonomously. Today companies like Tesla and Waymo leverage imitation learning for their self-driving cars. DeepMind has also leveraged the technique in their model MIA.