Keras Implementation of a Deep-Q Network on Catcher of Pygame Learning Environment

Recommended Reading

Introduction

One of the highlights of recent machine learning has been the success of convolutional neural networks (CNN) on problems. In 2015, DeepMind created a Deep-Q Network with a CNN to play a number of Atari games, some of them at a high level. What stood out was how well the DQN generalized across many games with only the raw screen images as the input. Since then there have been many improvements on DQN (prioritized experience replay, double DQN, dueling architectures) and there are better RL algorithms for solving these types of problems. Still, I wanted to implement a DQN as it is relatively simple to understand and code from scratch as far as RL algorithms go and I'm hoping it serves as a good entry point into more advanced RL methods.

In hindsight, I'm not sure how useful spending so much time trying to implement a DQN is. There are better methods that converge faster and produce better results. My main takeaway from DQN is that it's hard to get the correct hyperparameters and there are many hyperparameters to tinker with. The Denny Britz DQN implementation linked above has an open issue from 2016 basically asking why the DQN results aren't better? There are some smart people trying to figure out how to tune the DQN properly and not coming up with a good answer. There are many things that can be adjusted and the DQN paper runs for 4,000,000 steps. It's not a fast experiment.

It took me a while to get my DQN to converge to something decent. Along the way I adjusted or considered changing the following:
  • number of timesteps
  • exploration epsilon starting and ending values
  • size of the actions pace (do I include no action?)
  • size of the replay memory
  • batch size
  • learning rate
  • gradient descent algorithm
  • neural network (NN) architecture (if you re-design the network completely, this could add dozens more hyperparameters)
  • loss function on the NN
  • how often to update the target Q network
  • how/whether to clip the gradients

As you learn RL and see different methods being implemented it's easy to forget how tricky some of these set ups are, how carefully the hyperparameters have been adjusted, and that for every successful result there are many, many more failed experiments.

The Code

Once you have the dependencies installed and explored a bit with Pygame Learning Environment (I have some test code here), you can find the code from implementing a Keras DQN on Catcher here. Catcher is a basic game where pixels fall from the ceiling and you have to move the paddle right or left to touch the pixels before they fall below the paddle.

Some of the code highlights:



This is the CNN, as copied from the FlappyBird example. The only change is I added is gradient clipping.

The train_network() function is basically: set up the training loop, add the results to the experience replay, and train on the experience replay. For the training, I combined the parts from the FlappyBird example and Denny Britz's implementation that I liked the best. In particular the FlappyBird example didn't use a Q target network and uses the Q network to get values from the current state instead of the next state (like the DQN Nature paper does). My training section:


The remainder of the code is calling the functions to actually do the training and running the final model to see how the DQN performs. The final result is a bit underwhelming. It has learned to play the game okay but I would have hoped to do better consider how simple catcher is. The final learned network plays a bit oddly as it hangs out on one side of the game until a piece falls. If a piece falls on the opposite side, it swoops over then swoops back, not appreciating the symmetry of the game and that being on one side over the other doesn't matter. This could be a weakness in the DQN approach to this game that a policy gradients method might be better at. One of the things that policy gradients do better than a Q network is that they allow for a stochastic solution rather than a deterministic solution that a Q network provides. Q networks take the highest valued actions. For example, in Catcher the Q network values one side over the other when in reality the two sides are about the same. A stochastic solution that says when no pieces are available to catch go right 51% of the time and go left 49% of the time may be preferable to a Q network that says something like 'when there are no pieces available left is worth .0001 more value points than right, so always go left even though the value of the states is about equal.'

Feel free to comment if anything is unclear or if you have any questions. Up next is a more sophisticated implementation of DQN applicable to all the Pygame Learning Environment Games, piggy-backing on some publicly available code that lets us try out more advanced DQN features.

Comments