How to train an artificial neural network to play Diablo 2 using visual input?

The problem you are pursuing is intractable in the way you have defined it. It is usually a mistake to think that a neural network would "magically" learn a rich reprsentation of a problem. A good fact to keep in mind when deciding whether ANN is the right tool for a task is that it is an interpolation method. Think, whether you can frame your problem as finding an approximation of a function, where you have many points from this function and lots of time for designing the network and training it.

The problem you propose does not pass this test. Game control is not a function of the image on the screen. There is a lot of information the player has to keep in memory. For a simple example, it is often true that every time you enter a shop in a game, the screen looks the same. However, what you buy depends on the circumstances. No matter how complicated the network, if the screen pixels are its input, it would always perform the same action upon entering the store.

Besides, there is the problem of scale. The task you propose is simply too complicated to learn in any reasonable amount of time. You should see aigamedev.com for how game AI works. Artitificial Neural Networks have been used successfully in some games, but in very limited manner. Game AI is difficult and often expensive to develop. If there was a general approach of constructing functional neural networks, the industry would have most likely seized on it. I recommend that you begin with much, much simpler examples, like tic-tac-toe.


UPDATE 2018-07-26: That's it! We are now approaching the point where this kind of game will be solvable! Using OpenAI and based on the game DotA 2, a team could make an AI that can beat semi-professional gamers in a 5v5 game. If you know DotA 2, you know this game is quite similar to Diablo-like games in terms of mechanics, but one could argue that it is even more complicated because of the team play.

As expected, this was achieved thanks to the latest advances in reinforcement learning with deep learning, and using open game frameworks like OpenAI which eases the development of an AI since you get a neat API and also because you can accelerate the game (the AI played the equivalent of 180 years of gameplay against itself everyday!).

On the 5th of August 2018 (in 10 days!), it is planned to pit this AI against top DotA 2 gamers. If this works out, expect a big revolution, maybe not as mediatized as the solving of the Go game, but it will nonetheless be a huge milestone for games AI!

UPDATE 2017-01: The field is moving very fast since AlphaGo's success, and there are new frameworks to facilitate the development of machine learning algorithms on games almost every months. Here is a list of the latest ones I've found:

  • OpenAI's Universe: a platform to play virtually any game using machine learning. The API is in Python, and it runs the games behind a VNC remote desktop environment, so it can capture the images of any game! You can probably use Universe to play Diablo II through a machine learning algorithm!
  • OpenAI's Gym: Similar to Universe but targeting reinforcement learning algorithms specifically (so it's kind of a generalization of the framework used by AlphaGo but to a lot more games). There is a course on Udemy covering the application of machine learning to games like breakout or Doom using OpenAI Gym.
  • TorchCraft: a bridge between Torch (machine learning framework) and StarCraft: Brood War.
  • pyGTA5: a project to build self-driving cars in GTA5 using only screen captures (with lots of videos online).

Very exciting times!

IMPORTANT UPDATE (2016-06): As noted by OP, this problem of training artificial networks to play games using only visual inputs is now being tackled by several serious institutions, with quite promising results, such as DeepMind Deep-Qlearning-Network (DQN).

And now, if you want to get to take on the next level challenge, you can use one of the various AI vision game development platforms such as ViZDoom, a highly optimized platform (7000 fps) to train networks to play Doom using only visual inputs:

ViZDoom allows developing AI bots that play Doom using only the visual information (the screen buffer). It is primarily intended for research in machine visual learning, and deep reinforcement learning, in particular. ViZDoom is based on ZDoom to provide the game mechanics.

And the results are quite amazing, see the videos on their webpage and the nice tutorial (in Python) here!

There is also a similar project for Quake 3 Arena, called Quagents, which also provides easy API access to underlying game data, but you can scrap it and just use screenshots and the API only to control your agent.

Why is such a platform useful if we only use screenshots? Even if you don't access underlying game data, such a platform provide:

  • high performance implementation of games (you can generate more data/plays/learning generations with less time so that your learning algorithms can converge faster!).
  • a simple and responsive API to control your agents (ie, if you try to use human inputs to control a game, some of your commands may be lost, so you'd also deal with unreliability of your outputs...).
  • easy setup of custom scenarios.
  • customizable rendering (can be useful to "simplify" the images you get to ease processing)
  • synchronized ("turn-by-turn") play (so you don't need your algorithm to work in realtime at first, that's a huge complexity reduction).
  • additional convenience features such as crossplatform compatibility, retrocompatibility (you don't risk your bot not working with the game anymore when there is a new game update), etc.

To summarize, the great thing about these platforms is that they alleviate much of the previous technical issues you had to deal with (how to manipulate game inputs, how to setup scenarios, etc.) so that you just have to deal with the learning algorithm itself.

So now, get to work and make us the best AI visual bot ever ;)


Old post describing the technical issues of developping an AI relying only on visual inputs:

Contrary to some of my colleagues above, I do not think this problem is intractable. But it surely is a hella hard one!

The first problem as pointed out above is that of the representation of the state of the game: you can't represent the full state with just a single image, you need to maintain some kind of memorization (health but also objects equipped and items available to use, quests and goals, etc.). To fetch such informations you have two ways: either by directly accessing the game data, which is the most reliable and easy; or either you can create an abstract representation of these informations by implementing some simple procedures (open inventory, take a screenshot, extract the data). Of course, extracting data from a screenshot will either have you to put in some supervised procedure (that you define completely) or unsupervised (via a machine learning algorithm, but then it'll scale up a lot the complexity...). For unsupervised machine learning, you will need to use a quite recent kind of algorithms called structural learning algorithms (which learn the structure of data rather than how to classify them or predict a value). One such algorithm is the Recursive Neural Network (not to confuse with Recurrent Neural Network) by Richard Socher: http://techtalks.tv/talks/54422/

Then, another problem is that even when you have fetched all the data you need, the game is only partially observable. Thus you need to inject an abstract model of the world and feed it with processed information from the game, for example the location of your avatar, but also the location of quest items, goals and enemies outside the screen. You may maybe look into Mixture Particle Filters by Vermaak 2003 for this.

Also, you need to have an autonomous agent, with goals dynamically generated. A well-known architecture you can try is BDI agent, but you will probably have to tweak it for this architecture to work in your practical case. As an alternative, there is also the Recursive Petri Net, which you can probably combine with all kinds of variations of the petri nets to achieve what you want since it is a very well studied and flexible framework, with great formalization and proofs procedures.

And at last, even if you do all the above, you will need to find a way to emulate the game in accelerated speed (using a video may be nice, but the problem is that your algorithm will only spectate without control, and being able to try for itself is very important for learning). Indeed, it is well-known that current state-of-the-art algorithm takes a lot more time to learn the same thing a human can learn (even more so with reinforcement learning), thus if can't speed up the process (ie, if you can't speed up the game time), your algorithm won't even converge in a single lifetime...

To conclude, what you want to achieve here is at the limit (and maybe a bit beyond) of current state-of-the-art algorithms. I think it may be possible, but even if it is, you are going to spend a hella lot of time, because this is not a theoretical problem but a practical problem you are approaching here, and thus you need to implement and combine a lot of different AI approaches in order to solve it.

Several decades of research with a whole team working on it would may not suffice, so if you are alone and working on it in part-time (as you probably have a job for a living) you may spend a whole lifetime without reaching anywhere near a working solution.

So my most important advice here would be that you lower down your expectations, and try to reduce the complexity of your problem by using all the information you can, and avoid as much as possible relying on screenshots (ie, try to hook directly into the game, look for DLL injection), and simplify some problems by implementing supervised procedures, do not let your algorithm learn everything (ie, drop image processing for now as much as possible and rely on internal game informations, later on if your algorithm works well, you can replace some parts of your AI program with image processing, thus gruadually attaining your full goal, for example if you can get something to work quite well, you can try to complexify your problem and replace supervised procedures and memory game data by unsupervised machine learning algorithms on screenshots).

Good luck, and if it works, make sure to publish an article, you can surely get renowned for solving such a hard practical problem!


I can see that you are worried about how to train the ANN, but this project hides a complexity that you might not be aware of. Object/character recognition on computer games through image processing it's a highly challenging task (not say crazy for FPS and RPG games). I don't doubt of your skills and I'm also not saying it can't be done, but you can easily spend 10x more time working on recognizing stuff than implementing the ANN itself (assuming you already have experience with digital image processing techniques).

I think your idea is very interesting and also very ambitious. At this point you might want to reconsider it. I sense that this project is something you are planning for the university, so if the focus of the work is really ANN you should probably pick another game, something more simple.

I remember that someone else came looking for tips on a different but somehow similar project not too long ago. It's worth checking it out.

On the other hand, there might be better/easier approaches for identifying objects in-game if you're accepting suggestions. But first, let's call this project for what you want it to be: a smart-bot.

One method for implementing bots accesses the memory of the game client to find relevant information, such as the location of the character on the screen and it's health. Reading computer memory is trivial, but figuring out exactly where in memory to look for is not. Memory scanners like Cheat Engine can be very helpful for this.

Another method, which works under the game, involves manipulating rendering information. All objects of the game must be rendered to the screen. This means that the locations of all 3D objects will eventually be sent to the video card for processing. Be ready for some serious debugging.

In this answer I briefly described 2 methods to accomplish what you want through image processing. If you are interested in them you can find more about them on Exploiting Online Games (chapter 6), an excellent book on the subject.