XaiJu
Yosh
Yosh

patreon


Bonus video #3 - PPO vs DQN

In the noseboost video, I mentioned reworking the AI’s training algorithm to push its limits. Here’s a more detailed explanation of what happened!

Reinforcement Learning is a vast field, and researchers have developed many different algorithms over the years. Some of the most commonly used ones include Deep Q-Learning (DQN), Soft Actor-Critic (SAC), and Proximal Policy Optimization (PPO). Although these algorithms share a common goal, each has its own strengths and weaknesses. Comparing these different algorithms in Trackmania was one of the key things I focused on while improving my AI's training setup.

In previous videos, I trained my AI using SAC. However, for the noseboost project, I couldn't get the AI to sustain a noseboost with this algorithm. I believe the main issue was that SAC is designed for continuous action spaces, which doesn't work well for doing noseboosts, as it requires quickly chaining very different steering actions. So instead, I experimented with two other algorithms: PPO and DQN. The DQN setup also included several extensions (e.g., IQN, n-step, double DQN, and dueling networks). By using both algorithms with discrete action spaces, I was able to successfully train the AI to sustain a noseboost for several minutes.

Now, an interesting observation is that PPO and DQN performed differently on the benchmark track, which is what I'm showcasing in this bonus video. Visually, I think the main difference is that the AI trained with DQN is faster because it takes more risks—it drives closer to the edges of the road, resulting in better trajectories. I suspect the DQN agent takes more risks because it's based on Q-Learning, whereas PPO is based on SARSA. SARSA selects actions under the assumption that it follows a stochastic policy in subsequent time steps, which tends to make it learn a safer driving style that accounts for possible random exploratory actions.

Additionally, I believe there is another explanation for the highest performance of DQN on this track. In Trackmania, the epsilon-greedy exploration used in DQN appears to be more effective than the stochastic exploration used in PPO. In my experience, PPO tends to quickly assign very low probabilities to most actions (even after tuning the entropy hyperparameter), which severely limits its ability to explore very different strategies.

Oh, and in the second part of this bonus video, I'm comparing the AI's performance with my personal best time on this track—just to give you an idea of how absurdly fast both the PPO and DQN AIs are! Even though I've been playing this game for over 15 years, these AIs are on a completely different level!

Let me know if you're interested in more technical posts like this! Note that I could be wrong in the analysis I've just made. So if any Reinforcement Learning specialists are reading this, don't hesitate to share your thoughts with me :)

Comments

Good question, from my experience it's often tricky to know if more training will be worth it. And like most questions concerning RL, it probably depends on the environment you are working with. It can happen to get a significant improvement after a long period of stagnation (from what I remember, the hide & seek paper from openAI is a good example of that: https://arxiv.org/abs/1909.07528). I've also observed this in Trackmania. Sometimes, after some stagnation, the AI would suddenly start drifting in a specific section of a map, which would lead to faster times. However, in my experience, stagnation isn't a good sign when you're still far from the desired performance. I'd say that the smoother the progress, the better. Then, as you get closer to optimum performance, it's more normal for each additional improvement to take longer.

Yosh

I'm working on an RL project and I'm wondering, if you ever encountered the algorithm suddenly finding a good solution after a long time of flatlining? Said differently, is it worth it to keep training, even if the algorithm seems to not move anywhere?

Tom _____


More Creators