For my first project with Reinforcement Learning (RL), I wanted to pick a problem that could use straightforward approaches yet was not trivial to design. I wanted to use a full-state (or table-lookup) approach instead of a function-approximation (such as neural network) approach. I also wanted to have a deterministic and fully-observable environment (i.e. I wanted actions to have predictable results, and I wanted just a single agent in the environment.)
The Race Car problem I chose to tackle is simply an agent that can drive through a given racetrack in an optimal manner (i.e. taking the least amount of time to complete the loop). I initially didn’t believe I could tackle this with a table-lookup approach but upon looking at the numbers I realized I could condense it to a manageable number of states and still have a good outcome.
States and Actions
In order to keep the state-space to a manageable size, the main concession I made was to have only 28 x 5 locations on the race track. More specifically, the driver/agent was allowed to take an action (such as steering) at only 28 different “junctures” along the race track (points of progress along the track), each of which was split into 5 “lanes”.
At each of these 140 locations, the car’s instantaneous state was allowed to be at one of 3 “speeds” (i.e., 1, 2 or 3) and facing one of 20 directions (i.e. 0, 18, 36, … 342 degrees). So the total number of car states was 8400.
At each of these 8400 car states, the driver/agent was allowed to choose an action consisting of one of 3 accelerations (i.e. -1, 0 or +1) and one of 3 steerings (left, straight, right).
So, in all, I had to deal with 75,600 states and I suspected that a lot of those states would never need to be visited (e.g. most of the directions faced at a given location), thereby making a full-state (table-lookup) solution tractable.
Environment and Rewards
I started off with a simple circular race track, with the 28 “junctures” (or action points) distributed evenly around the track (one at every 12.8 degrees), the width of each juncture evenly split into 5 lanes.
The “environment” simulates the actual physics of the game, keeping track of the precise (floating-point) location and velocity of the car, and is responsible for incrementing time-steps, “moving” the car between junctures, and for determining whether the car is still within the borders or whether it has crossed the finish line.
The environment also handles the mapping/approximating of these physical quantities into states for the driver’s benefit.
The environment is responsible for handing out rewards for the outcome of the chosen action. Defining the rewards was surprisingly important as well as tricky to get right. There are many options with different pros and cons but I ended up going with this:
- Each time step (of car movement) earned a -1 reward.
- Each juncture reached (“milestone”) earned a +10 reward.
- The end of episode (car crash or finish line) earned a “progress” reward (i.e. 5000 * percentage of track completed)
(Notably, this set of rewards gives the agent substantial information about its progress along the track and therefore makes the problem easier — as compared to, say, an environment which only rewarded the agent at the finish line.)
Results
I learned a lot with this project, which I’ve described extensively under “Notes” below. The end results are as shown here.
Circular track
For the circle track that I started with, the table-lookup q-learn agent was able to learn behavior optimized to minimize the amount of time steps taken. As you can see, having started at the bottom, the car speeds up as quickly as possible initially, and then sticks to the innermost edge as much as possible.
I trained this for 200,000 episodes (~ 8 minutes on my laptop.), and the algorithm was able to find a path that was faster (198 steps) than I was able to find manually (199 steps).
Rectangular track
For a harder problem involving corners, I chose a simple rectangular track. For this track I provided 50 junctures/milestones (unlike the circular track’s 28) and was able to find a fast (thought not optimal) path to the finish line.
Giant X track
Finally, for a longer track involving tight corners, I designed this Giant X track. It was harder to learn driving around the tight corners, and the learning took longer overall, but I was able to get a fairly satisfactory outcome.
Notes and lessons learned
- State space: Despite the “Race Car problem” seeming like a straightforward problem to solve, my initial preferences for the allowable ranges of feature-values was blowing up the state space, and so I had reduce the allowable ranges, e.g. limiting speeds to a range of only 3, having only 3 steering options, etc. A big chunk of the savings came from defining locations as a small number of “junctures” and “lanes” instead of the bigger range of “pixels”. Thankfully despite the simplification the problem was still interesting and visually satisfying.
- Quicker death: Initially I rewarded every car-crash with a single big penalty value -1000. However, with that reward system, the agent learned to crash earlier rather than later, since each time step had an additional -1 penalty. So, I modified the crash reward to be the amount of ‘progress’ along the track at the point of crash.
- Guessing values of unvisited states: I was very tempted to implement some type of guessing/interpolation for the value of an unvisited state based on its nearby states, because it seemed wasteful to have to learn every such state from scratch when interpolation would’ve given a great starting value. I think this is indeed a good idea but it falls under function approximation (!) so I’ll explore that later.
- Debugging with Q heat-map: A useful debugging tool was to get a glimpse into the state values (i.e. the lookup table Q) along certain axes and plotting the value as a color. E.g. to get insight into Q values for speed/direction along each juncture, I collapsed (summed over) the steering and acceleration axes, and obtained:
Similarly, to get insight into steering/acceleration along each juncture, I summed out the speed and direction axes.
I found it useful to plot the total/average Q over each milestone as the iterations progressed.
I also found it useful to plot the rate-of-change of Q per juncture as the iterations progressed.
I eventually landed upon my favorite debugging tool – an animation to show the progression of Q values as iterations progress, where the topmost row represent the innermost of the 5 lanes, and the rightmost column represents the finish line, and for every location we display the maximum of the Q values at that location.
- Debugging with exploration heat-map: I kept track of the number of times every state-action was visited and calculated its “exploration intensity.” Plotting this value as a heat-map was also often useful.
- Debugging with visualizing actual episode: Usually at the end of training, I simply displayed the actual track and played the “best path” followed by the car as when the “best value next action” is picked at every step.
- Debugging with junctures plot: I plotted for every juncture the number of times the episode ended (car crashed) immediately after. This gave me a sense of what the agent’s current “frontier” was for exploration.
- Initialization of Q: In my experiments, the initial value of Q didn’t seem to have much effect on the learning. Perhaps this was due to the high alpha value I use (which lets the initial value be quickly overridden), combined with the high inclination to explore in early stages.
- Adding Exploration: I initially had a fixed epsilon value, (i.e., the probability of picking a random action instead of the best action at that state, a.k.a. epsilon-greedy). This caused the learning to be stuck in the initial sections of the track, and prevented episodes from reaching the later sections very often, thereby slowing down learning. So I switched to using an epsilon that depended upon the number of times the state had been visited, n. Specifically, epsilon = N0 / (N0 + n), where N0 is a hyperparameter that I’m calling “explorate.” With N0 set to something like 2000, the learning improved significantly.
- An agent for finding bugs? It’s funny that the agent was able to point out where my programming bugs were located. For example, at one point the agent always arrived close to the finish line but crashed the car rather than drive across the finish line. This was because my code was bypassing handing a “progress” reward upon crossing the finish line. Crashing the car on the other hand gave a very high progress reward.
- Junctures vs Milestones: Initially I had designed the system such that every juncture could perform an action and collect a reward. Later on, I decided that the locations where actions should take place and the locations of actual rewards need not be the same. So I separated them out, to have Junctures where actions ought to be decided, and Milestones where rewards can be collected. This potentially allows me to increase the number of junctures while keeping the milestones fixed (or even eliminating them completely.)
- Manual driving: A good way to test out the environment and see whether a good solution is at all possible, is to feed a hard-coded sequence of actions to the environment and track the rewards.
Moreover, we can use this “model path” to pinpoint problem areas where the learning is getting “stuck” by looking at the deviations between learned and model path and digging deeper. E.g. At a certain problem location, we could plot the Q value of the “ideal action” at a given state vs. that of the learned/chosen action: - Learning rate parameter alpha: When I started getting my first successes on the circular track, the car was making the full loop but it wasn’t quite doing it the fastest way. Notably, the car often strayed closer to the outside edge of the circle rather than the inside edge. I theorized that this behavior was caused by the nearby outer edges being able to propagate their juicy reward quicker towards the current state than the (higher) rewards acquired further off in the future. I confirmed this by plotting the max Q value at every location along the circle (the top row representing the innermost lane, and the bottom the outermost):
At this early stage my learning parameter alpha was set to 0.5. Change this to alpha=1 fixed this sub-optimal-path issue completely. Now there was less learning lag for distant future rewards as compared to the (smaller) nearby rewards.
Technically this shouldn’t work in general. Setting alpha to 1 works okay for race tracks because the environment is fairly deterministic. - Optimizing explorate: I attempted to discover an optimal value for explorate by running a form of cross-validation across different values and with different random seeds. It seemed like I got better optimizations the higher I set the explorate. Specifically, I got the best result when N0=1000, but it seems that N0=10,000 wasn’t able to arrive at the finish line within the 300k iteration limit.
It makes sense that the higher the propensity to explore, the better it finds a good optimization but the longer it takes to make progress towards the finish. - Dual responsibility of Q: There are two mutually-conflicting goals for updating Q during learning. On the one hand, obviously, we want Q to represent the value of the current state-action as accurately as possible, by exploring its neighbors thoroughly. On the other hand, we want to be able to bypass this current state as quickly as possible so we can move to states further down the line that need more exploration and potentially have richer rewards to propagate back. In the case of the Race car problem, the states further down (closer to the finish line) clearly have the bigger rewards and can potentially override any “optimal paths” discovered earlier at the beginning states. So these two goals need to be kept in balance carefully.
Note that using an algorithm like Q(λ) that uses look-ahead (or back updates) would probably take care of the second goal (propagating values back). - Improving exploration further: I was very tempted to improve the system of exploration somehow. It seemed to me that the simple epsilon-greedy approach (while having the right idea) caused the initial states to get too “entrenched” by the time the future states started propagating good rewards back to them, and therefore unable to explore ways to use those better rewards.
- I considered increasing exploration of a state, if its nearby next-states had Q values that had recently improved a lot. (Motivated by a desire to have the future rewards propagate backwards faster.)
- I considered calculating an “explorability” value for each potential action at a given state and leaning towards choosing paths where there was still much left to explore.
- I also considered keeping track of “time since last exploration” of a given action, and exploring those paths that hadn’t been explored in a long time.
- But in the end, I decided not to pursue them because these solutions might be too specific to table-lookup algorithms (i.e. not applicable to function-approximation algorithms) and moreover because some of these problems might be addressed by using a lookahead algorithm like Q(λ).
- Restarting explorability: One thing I did try, though, was that upon successfully learning up to the finish line (along a not-quite-optimal path), the training would reset the exploration probability (epsilon) of all states back to being unexplored (without resetting the learned values in Q, of course). The idea was to re-explore more thoroughly the earlier sections of the track, now that the algorithm was fairly aware of the higher rewards to be gained further down the track, and thereby hopefully to “iron out” the sub-optimal sections of the track, such as:
Unfortunately, I found limited success with this approach, and in fact I often found the solutions to be getting worse with every restart! I tried doubling the epsilon at every restart but that didn’t do much either. Eventually, I realized that it was probably incorrect in the first place for me to expect the track to get “ironed out” with every exploration restart… because of an underlying unpredictability of state values, which I explain next. - Deterministic, yet unpredictable: It turned out that there was about a 3% probability of a give state’s chosen action producing a different next-state than it usually produced. Why should this be the case, when the environment is completely deterministic? The key to this mystery is that in fact every Q state is only an approximation of the actual physical state of the car. It is very possible that a given Q state is visited by several different paths whose physical lengths differ in slight amounts and whose current physical location is therefore slightly different from the others. Slight differences in physical location can unfortunately have a big effect on what an action will do at that state and what rewards it might produce, e.g., the car could (just barely) avoid crashing into the border, or the car could (just barely) squeeze into a different lane. (When these actions caused “detours”, I found the physical locations to be about 6.9 times further away than otherwise from the average location for that state. The biggest variances appears to occur near the edges, as shown, with the hotter areas in the top and bottom rows.)
Thus, as far as the agent is concerned, despite residing in a deterministic environment, taking the same action at the same state could still result in an unpredictable outcome. - Source code on github
Next steps
There are many things left to try out with the Race Car problem. I could stick with the full-state table-lookup approach but try a different algorithm such as Q(λ) or Sarsa(λ) which I expect to work better or at least faster. I could try increasing the size/density of the state space, e.g. dividing the track into more junctures with more lanes, or allowing finer changes in direction and speed, etc. I could experiment with other reward schemes, such as offering minimal rewards (i.e. hints) until the car actually arrives at the finish line.
Beyond table-lookup approaches, of course, there remains the whole world of Function Approximation (with linear regression or neural networks) which is something I look forward to.
Going further, it would be fantastic to have trained the “driver” agent such that it could successfully race through previously unseen race tracks instead of merely being optimized for a specific one.