Yosh

Bonus video #3 - PPO vs DQN

Added 2025-02-14 11:54:34 +0000 UTC

In the noseboost video, I mentioned reworking the AI’s training algorithm to push its limits. Here’s a more detailed explanation of what happened!

Reinforcement Learning is a vast field, and researchers have developed many different algorithms over the years. Some of the most commonly used ones include Deep Q-Learning (DQN), Soft Actor-Critic (SAC), and Proximal Policy Optimization (PPO). Although these algorithms share a common goal, each has its own strengths and weaknesses. Comparing these different algorithms in Trackmania was one of the key things I focused on while improving my AI's training setup.

In previous videos, I trained my AI using SAC. However, for the noseboost project, I couldn't get the AI to sustain a noseboost with this algorithm. I believe the main issue was that SAC is designed for continuous action spaces, which doesn't work well for doing noseboosts, as it requires quickly chaining very different steering actions. So instead, I experimented with two other algorithms: PPO and DQN. The DQN setup also included several extensions (e.g., IQN, n-step, double DQN, and dueling networks). By using both algorithms with discrete action spaces, I was able to successfully train the AI to sustain a noseboost for several minutes.

Now, an interesting observation is that PPO and DQN performed differently on the benchmark track, which is what I'm showcasing in this bonus video. Visually, I think the main difference is that the AI trained with DQN is faster because it takes more risks—it drives closer to the edges of the road, resulting in better trajectories. I suspect the DQN agent takes more risks because it's based on Q-Learning, whereas PPO is based on SARSA. SARSA selects actions under the assumption that it follows a stochastic policy in subsequent time steps, which tends to make it learn a safer driving style that accounts for possible random exploratory actions.

Additionally, I believe there is another explanation for the highest performance of DQN on this track. In Trackmania, the epsilon-greedy exploration used in DQN appears to be more effective than the stochastic exploration used in PPO. In my experience, PPO tends to quickly assign very low probabilities to most actions (even after tuning the entropy hyperparameter), which severely limits its ability to explore very different strategies.

Oh, and in the second part of this bonus video, I'm comparing the AI's performance with my personal best time on this track—just to give you an idea of how absurdly fast both the PPO and DQN AIs are! Even though I've been playing this game for over 15 years, these AIs are on a completely different level!

Let me know if you're interested in more technical posts like this! Note that I could be wrong in the analysis I've just made. So if any Reinforcement Learning specialists are reading this, don't hesitate to share your thoughts with me :)