In this introductory tutorial, we'll apply reinforcement learning (RL) to train an agent to solve the 'Taxi' environment from OpenAI Gym. We'll cover:
A basic introduction to RL
Setting up OpenAI Gym & Taxi
Step-by-step tutorial on how to train a Taxi agent in Python3 using RL
Before we start, what's 'Taxi'?
Taxi is one of many environments available on OpenAI Gym. These environments are used to develop and benchmark reinforcement learning algorithms.
The goal of Taxi is to pick-up passengers and drop them off at the destination in the least amount of moves. In this tutorial, you'll start with a taxi agent that takes actions randomly:
…and train the agent to be a better taxi driver using reinforcement learning:
💡 An introduction to Reinforcement Learning
Think about how you might teach a dog a new trick, like telling it to sit:
If it performs the trick correctly (it sits), you'll reward it with a treat (positive feedback) ✔️
If it doesn't sit correctly, it doesn't get a treat (negative feedback) ❌
By continuing to do things that lead to positive outcomes, the dog will learn to sit when it hears the command in order to get its treat. Reinforcement learning is a subdomain of machine learning which involves training an 'agent' (the dog) to learn the correct sequences of actions to take (sitting) on its environment (in response to the command 'sit') in order to maximise its reward (getting a treat). This can be illustrated more formally as:
Source: Sutton & Barto
🏋️ Installing OpenAI Gym and Taxi
We'll be using the 'Taxi-v3' environment for this tutorial.
You'll need to install:
The following snippet will import the necessary packages, and create the Taxi environment:
import numpy as np import gym import random # create Taxi environment env = gym.make('Taxi-v3')
🎲 Random Agent
We'll start by implementing an agent that doesn't learn at all. Instead, it will sample actions at random. This will be our baseline.
The first step is to give our agent an initial state of its environment. A state is how our agent will observe its environment. In Taxi, a state defines the current positions of the taxi, passenger, and pick-up and drop-off locations. Below are examples of three different states for Taxi:
Note: Yellow = taxi, Blue letter = pickup location, Purple letter = drop-off destination
To get the initial state:
# create a new instance of taxi, and get the initial state state = env.reset()
Next, we'll run a for-loop to cycle through the game. At each iteration, our agent will:
Make a random action from the action space (0 - south, 1 - north, 2 - east, 3 - west, 4 - pick-up, 5 - drop-off)
Receive the new state
Here's our random agent script:
import gym import numpy as np import random # create Taxi environment env = gym.make('Taxi-v3') # create a new instance of taxi, and get the initial state state = env.reset() num_steps = 99 for s in range(num_steps+1): print(f"step: {s} out of {num_steps}") # sample a random action from the list of available actions action = env.action_space.sample() # perform this action on the environment env.step(action) # print the new state env.render() # end this instance of the taxi environment env.close()
You can run this and watch your agent make random moves. Not super exciting, but hopefully this helped you get familiar with the OpenAI Gym toolkit.
Next, we'll implement the Q-learning algorithm that will enable our agent to learn from rewards.
📖 Q-Learning Agent
Q-learning is a reinforcement learning algorithm that seeks to find the best possible next action given its current state, in order to maximise the reward it receives (the 'Q' in Q-learning stands for quality - i.e. how valuable an action is).
Let's take the following starting state:
Which action (up, down, left, right, pick-up or drop-off) should it take in order to maximise its reward? (Note: blue = pick-up location and purple= drop-off destination)
First, let's take a look at how our agent is 'rewarded' for its actions. Remember in reinforcement learning, we want our agent to take actions that will maximise the possible rewards it receives from its environment.
'Taxi' reward system
According to the Taxi documentation:
"…you receive +20 points for a successful drop-off, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions."
Looking back at our original state, the possible actions it can take and the corresponding rewards it will receive are shown below:
In the image above, the agent loses 1 point per timestep it takes. It will also lose 10 points if it uses the pick-up or drop-off action here.
We want our agent to go North towards the pick-up location denoted by a blue R - but how will it know which action to take if they are all equally punishing?
Exploration
Our agent currently has no way of knowing which action will lead it closest to the blue R. This is where trial-and-error comes in - we'll have our agent take random actions, and observe what rewards it gets (i.e. our agent will explore).
Over many iterations, our agent will have observed that certain sequences of actions will be more rewarding than others. Along the way, our agent will need to keep track of which actions led to what rewards.
Introducing… Q-tables
A Q-table is simply a look-up table storing values representing the maximum expected future rewards our agent can expect for a certain action in a certain state (known as Q-values). It will tell our agent that when it encounters a certain state, some actions are more likely than others to lead to higher rewards. It becomes a 'cheatsheet' telling our agent what the best action to take is.
The image below illustrates what our 'Q-table' will look like:
Each row corresponds to a unique state in the 'Taxi' environment
Each column corresponds to an action our agent can take
Each cell corresponds to the Q-value for that state-action pair - a higher Q-value means a higher maximum reward our agent can expect to get if it takes that action in that state.
Before we begin training our agent, we'll need to initialize our Q-table as so:
state_size = env.observation_space.n # total number of states (S) action_size = env.action_space.n # total number of actions (A) # initialize a qtable with 0's for all Q-values qtable = np.zeros((state_size, action_size))
As our agent explores, it will update the Q-table with the Q-values it finds. To calculate our Q-values, we'll introduce the Q-learning algorithm.
Q-Learning Algorithm
The Q-learning algorithm is given below. We won't go into details, but you can read more about it in Ch 6 of Sutton & Barto (2018).
The Q-learning algorithm will help our agent update the current Q-value (Q(St,At)) with its observations after taking an action. I.e. increase Q if it encountered a positive reward, or decrease Q if it encountered a negative one.
Note that in Taxi, our agent doesn't receive a positive reward until it successfully drops off a passenger (+20 points). Hence even if our agent is heading in the correct direction, there will be a delay in the positive reward it should receive. The following term in the Q-learning equation addresses this:
This term adjusts our current Q-value to include a portion of the rewards it may receive sometime in the future (St+1). The 'a' term refers to all the possible actions available for that state. The equation also contains two hyperparameters which we can specify:
Learning rate (α): how easily the agent should accept new information over previously learnt information
Discount factor (γ): how much the agent should take into consideration the rewards it could receive in the future versus its immediate reward
Here's our implementation of the Q-learning algorithm:
# hyperparameters to tune learning_rate = 0.9 discount_rate = 0.8 # Qlearning algorithm: Q(s,a) := Q(s,a) + learning_rate \* (reward + discount_rate \* max Q(s',a') - Q(s,a)) qtable[state, action] += learning_rate \* (reward + discount_rate \* np.max(qtable[new_state,:]) - qtable[state,action])
Exploration vs Exploitation Trade-off
We can let our agent explore to update our Q-table using the Q-learning algorithm. As our agent learns more about the environment, we can let it use this knowledge to take more optimal actions and converge faster - known as exploitation.
During exploitation, our agent will look at its Q-table and select the action with the highest Q-value (instead of a random action). Over time, our agent will need to explore less, and start exploiting what it knows instead.
Here's our implementation of an exploration-exploitation strategy:
# exploration-exploitation tradeoff epsilon = 1.0 # probability that our agent will explore decay_rate = 0.01 # of epsilon if random.uniform(0,1) < epsilon: # explore action = env.action_space.sample() else: # exploit action = np.argmax(qtable[state,:]) # epsilon decreases exponentially --> our agent will explore less and less epsilon = np.exp(-decay_rate\*episode)
In the example above, we set some value epsilon
between 0 and 1. If epsilon
is 0.7, there is a 70% chance that on this step our agent will explore instead of exploit. epsilon
exponentially decays with each step, so that our agent explores less and less over time.
Bringing it all together
We're done with all the building blocks needed for our reinforcement learning agent. The process for training our agent will look like:
Initialising our Q-table with 0's for all Q-values
Let our agent play Taxi over a large number of games
Continuously update the Q-table using the Q-learning algorithm and an exploration-exploitation strategy
Here's the full implementation:
import numpy as np import gym import random def main(): # create Taxi environment env = gym.make('Taxi-v3') # initialize q-table state_size = env.observation_space.n action_size = env.action_space.n qtable = np.zeros((state_size, action_size)) # hyperparameters learning_rate = 0.9 discount_rate = 0.8 epsilon = 1.0 decay_rate= 0.005 # training variables num_episodes = 1000 max_steps = 99 # per episode # training for episode in range(num_episodes): # reset the environment state = env.reset() done = False for s in range(max_steps): # exploration-exploitation tradeoff if random.uniform(0,1) < epsilon: # explore action = env.action_space.sample() else: # exploit action = np.argmax(qtable[state,:]) # take action and observe reward new_state, reward, done, info = env.step(action) # Q-learning algorithm qtable[state,action] = qtable[state,action] + learning_rate \* (reward + discount_rate \* np.max(qtable[new_state,:])-qtable[state,action]) # Update to our new state state = new_state # if done, finish episode if done == True: break # Decrease epsilon epsilon = np.exp(-decay_rate\*episode) print(f"Training completed over {num_episodes} episodes") input("Press Enter to watch trained agent...") # watch trained agent state = env.reset() done = False rewards = 0 for s in range(max_steps): print(f"TRAINED AGENT") print("Step {}".format(s+1)) action = np.argmax(qtable[state,:]) new_state, reward, done, info = env.step(action) rewards += reward env.render() print(f"score: {rewards}") state = new_state if done == True: break env.close() if **name** == "**main**": main()
👏 What's next?
There are many other environments available on OpenAI Gym for you to try (e.g. Frozen Lake). You can also try optimising the implementation above to solve Taxi in fewer episodes.
Below are some other useful resources to check out.
Lectures and further reading
Tutorials
Reinforcement learning project ideas
P.S. If you enjoyed this article, check out Bomberland: an open machine learning challenge for the community.