Train an AI Agent to play Mountain Car with Reinforcement Learning
A few weeks ago I was chatting with a friend who is just getting into reinforcement learning. He asked me for some resources to help him learn better, so naturally I pointed him to the classic RL playground Gymnasium (formerly known as OpenAI Gym), which I had a lot of fun solving when I first started learning.
This also reminded me of how rusty I am with the Reinforcement Learning techniques, so I thought about challenging myself to complete these puzzles once again.
This post will showcase how to train an AI agent to play the Mountain Car game using Reinforcement Learning.
Mountain Car
I’m starting with the Mountain Car environment here, as it is one of the easiest games from the package and has a dead simple control: move left, no action, or move right. You start with a car somewhere between the bottom of the valley, and the goal is to control the car all the way to the right peak where the yellow flag stands. The tricky part with this game is that the car does not have enough power to accelerate all the way uphill from stop. The player will have to “swing” the car back and forth between the valley to gather enough momentum for the climb.
The code below lets you play the game manually using j
(move left) and k
(move right) as control.
|
|
To train an AI agent to solve this game, we must first model this environment with mathematical terms.
Markov Decision Process
We’ll model the Mountain Car puzzle as a Markov Decision Process (MDP). A Markov Decision Process is made up of States, Transition Probability, Actions, Rewards, and Policy, brief definitions below:
- States ($S$): Current situation of the agent (e.g. position, velocity, etc.).
- Actions ($A$): Set of all possible actions of the agent, allowing the agent to affect its current state.
- Transition Probability ($P(s, a, s^\prime)$): The probability of the agent transitioning from state $s$ to state $s^\prime$ by taking action $a$.
- Rewards ($R(s, a, s^\prime)$): The cost/gain for taking an action $a$ from state $s$ and ending up in state $s^\prime$.
- Policy ($\pi(s)$): The “decision-making” algorithm that decides what action to take given a specific state.
At the core of the Markov Decision Process is the Markov Property. The Markov Property states that the next state of the agent only depends on its current state, irrelevant to any historical events. This implies that the AI agent should be able to predict the optimal action to take based on only the current state of the agent (therefore, the optimal policy could be described as $\pi(s)$).
Here is an example MDP of the average life of a college student.
We can visualize an MDP by drawing out the states, available actions, and the probability of landing in another state after taking an action. Using the sample MDP in the figure above, we see that as a broke college student, a trip to casino gives you a 1% chance of being filthy rich, 29% of evening out (still a broke college student), and a whopping 70% chance of being homeless and broke.
Jokes aside, to “solve” the MDP, thereby finding the optimal policy $\pi(s)$, we need to utilize the Bellman Equation.
To implement MDP in code, we’ll usually start with creating the states of the MDP. In the case of Mountain Car, there are two observable values from the environment: horizontal position $x$, and car velocity $v$. An easy way to do this in Python would to divide $x$ and $v$ into multiple “bins”, and create a state for all possible combinations of $x$ and $v$.
|
|
The code above is an excerpt of the full Python implementation located at the bottom of the article.
It first bins x
and v
into several intervals and then creates a list of all possible combinations of the x
and v
intervals as states.
The resulting list of states looks something like this:
|
|
Next, we have to compute the transition probability $P$ and the corresponding rewards $R$. The most straightforward method is to use the Monte-Carlo method, where we try out all possible actions on all possible states, and record the corresponding destination states and their rewards. Code excerpt below.
|
|
Again, this might not make sense at the moment, see the full implementation in python at the end of the article and try to run it yourself.
Bellman Equation
The Bellman Equation states that the expected long-term reward of an action is equal to the reward of the current action plus the expected reward to be earned from future actions from this time on. With this idea in mind, we can define a Value Function $V(s)$, which is the expected future reward from the current state $s$.
$$ V(s) = \mathbb{E}[R_{t+1} + \gamma V(S_{t+1})] $$
Where $R_{t+1}$ denotes the reward of going to the next state, and $S_{t+1}$ is the next state. Notice how the value function is recursive. The $\gamma$ term here is the discount factor (value between 0 and 1) since rewards further away in the future are less important, and also so that this infinite series converges.
With this value function, the optimal policy would then be picking the action that gives the highest expected rewards, given the current state. Now all that’s left is to solve for the optimal value function $V^*$.
Value Iteration
Value Iteration is a value-based method to solve an MDP by finding the optimal value function $V^*$ using dynamic programming. It essentially runs the Bellman equation iteratively until the value converges. Algorithm below.
repeat
$\delta \leftarrow 0$
for each $s \in S$
$V^\prime(s) \leftarrow max_{a \in A} \sum_{s^\prime \in S} P(s^\prime, a,s)[r(s,a,s^\prime) + \gamma V(s^\prime)]$
$\delta \leftarrow max(\delta, |V^\prime (s) V(s)|)$
$V \leftarrow V^\prime$
until $\delta \leq \theta$
Usually when we are implementing this in code, we store all possible values of $V^\prime(s)$ for each state-actions pair in a table $Q(s, a)$, also known as the Q-value table. This is so that we can easily figure out the optimal policy later on without having to recompute all these values.
repeat
$\delta \leftarrow 0$
for each $s \in S$
for each $a \in A$
$Q(s, a) \leftarrow \sum_{s^\prime \in S} P(s^\prime, a,s)[r(s,a,s^\prime) + \gamma V(s^\prime)]$
$\delta \leftarrow max(\delta, |max_{a \in A} [Q(s, a)] - V(s)|)$
$V(s) \leftarrow max_{a \in A} Q(s, a)$
until $\delta \leq \theta$
Policy Extraction
As we have already discussed earlier, the optimal policy at state $s$ is the action $a$ that maximizes $V(s)$. If you have implemented the Q-Table approach earlier, it is the equivalent of finding the argmax of the Q-table on the axis of action.
$$ \pi(s) = argmax_{x \in a} Q(s, a) $$
|
|
This step is also known as policy extraction, where we find the optimal “solution” for our AI agent to take given the current state.
Python Implementation
Full implementation of the Mountain Car MDP in Python below:
|
|
To create an instance of the Mountain Car MDP, run the following line.
|
|
The x_bin
and vel_bin
specify the number of bins to divide the horizontal axis $x$ and the velocity of the car $v$ to build the MDP.
The resulting MDP will have x_bin * vel_bin
states (18 * 10 = 180 states in this case).
Setting these numbers higher results in a more fine-grained state definition, but at the cost of memory and time.
To manually play the game, run this.
|
|
Note that the game resets once every 100 steps.
To run value iteration, run this.
|
|
You can toy with the theta
argument to stop the value iteration earlier (or later).
Finally, to let the AI agent trained using value iteration play the game, run this.
|
|
That’s it! You can now try to solve all the classic control environments in Gymnasium using this same exact framework.