Reinforcement learning plays Metroid II. 'Nuff said.
I've changed the model plenty of times, lots is changing, ill document it later
PyBoy provides "game wrappers" for various games to make AI work easier. I am working on implementing one for Metroid II and will eventually get my code pulled into the project. For now, I have enough implemented for it to work.
Currently, a pixel-based observation approach is going to be used. Due to the environment, a tile-based approach could be used, and that may much faster, however it has a lot of its own issues and those may or may not be explored here.
Through testing and some math, roughly 216,000 iterations correspond to an hour of "real life" game time. This was calculated by measuring 1000 iteration's time. The output of the script was as follows. This assumes 1000 iterations
Human Time: 16.650566339492798
Machine time: 0.2655789852142334
Machine is 62.69534589139785 times faster
time per step HUMAN: 0.016650566339492797
time per step FAST: 0.0002655789852142334
One hour of human gameplay = 216208.8620049702
the math for this was simply
time / (measured_time / iterations) --> 3600 / (16.65/1000)
I'm focusing much more on the model, as well as the "real AI" portion of the project. I'm currently tweaking and tuning my reward function a ton.
Currently, the agent makes it pretty reliably out of the main portion of the level, however it runs straight through enemies, and has learned to completely not shoot at all.
At this point, a pull request has been made for PyBoy, and many changes need to be made to that repository, but before I do that, I want to focus more on the training and AI portion of things.
- explore starting area
- Stop shooting randomly and "spazzing out"
- Get out of starting area relatively quickly
- Avoid enemies just outside starting area
- Kill enemies just outside starting area
- Drop down through first major shaft (requires downward jump shooting)
- Find first Metroid
- Kill first Metroid
- get wandb integration working
- Tweak exploration to include a "staleness" i.e. punish not moving, or not getting an exploration reward
- Potentially implement Random Network Distillation
- figure out a way to incentivise killing (sprite based?)
- Add missle count, "gun state", and health to observation space
- Re-write reward functions to punish when losing health and missles, rather than when below max
- Add frame stacking to observation space (Added LSTM instead)
- improve observations of environments to be tiles, rather than pixels
- Change action space so agent can hold buttons (SML code has simmilar)
- Define a baseline test reward function
- make observations of environments pixels of screen
- Write simple exploration function using game coordinate hashing
- Do some kind of extremely bare-bones training to just explore
- Change to Pufferlib native environment, rather than gym wrapper (?)
Environment is pretty much set. Its plenty good enough to start training
- Verify ROM integrity (the rom works)
- Take random actions in the environment (veryify pyboy interface working)
- Implement a
game_wrapperfor Metroid II (makes AI stuff easier)- Potentially use RAM mappings to calculate more useful info (particularly health)
- Improve
game_overto check health, may be slightly faster than using "GAME OVER" screen - Make a pull request for PyBoy to merge my code in
- Fix code to comply with PyBoy coding standards (see PR in PyBoy)
- Implement the
start_gamefunction to skip though the menu - Implement
game_overfunction to check if agent is dead - Integrate and verify the RAM mappings
- Determine and implement all possible button combos for "actions" (may need minor improvements)
- Containerize the program to make running on other machines easy (either with conda, or docker potentially, or a setup script)
- Overall, clean up the code a LOT, including but not limited to complete restructuring for the sake of pufferlib. (I'm using three different repos! my Pufferlib fork, PyBoy fork, and my gymnasium environment.
A majority of this projects initial work (January - February) was dedicated to many many iterations of the PyBoy wrapper, action space, and overall implementation, wrappers, libraries oh my! At this point, I'm switching gears to 90/10 doing AI and "backend" development. Up to this point, most of the training was "proof of concept". And many bugs and kinks were worked out along the way before switching to a more AI focused development.
The first real model was using simply the observation of the screen this worked great for getting past the initial
A list of extremely silly mistakes
- agent couldn't actually jump! My original implementation of the button presses was simply single frame tapping the button. Needed to re-implement the emulator interfacing to check next action, and release buttons not in the next action.
- At one point, I had my exploration reward scale with explorations. I.e. for every N new coordinate points, we'd give N*W reward, the reward very quickly balooned, and the agent would just get stuck going to the right.
- By shrinking the screen by four, after CNN layers, the screen was too small, and got reduced to one single feature! This is sub-optimal for AI.
- Gymnasium can't handle (width, height) for images even if its black and white. It needs to be (width, height, 1) even though its grayscale.