Q Learning Applied To a Two Player Game

In general, directly applying Q-learning to a two-player game (or other kind of multi-agent environment) isn't likely to lead to very good results if you assume that the opponent can also learn. However, you specifically mentioned

for playing against a random agent

and that means it actually can work, because this means the opponent isn't learning / changing its behaviour, so you can reliably treat the opponent as ''a part of the environment''.

Doing exactly that will also likely be the best approach you can take. Treating the opponent (and his actions) as a part of the environment means that you should basically just completely ignore all of the states in which the opponent is to move. Whenever your agent takes an action, you should also immediately generate an action for the opponent, and only then take the resulting state as the next state.

So, in the tuple (s, a, r, s'), we have:

  • s = state in which your agent is to move
  • a = action executed by your agent
  • r = one-step reward
  • s' = next state in which your agent is to move again

The state in which the opponent is to move, and the action they took, do not appear at all. They should simply be treated as unobservable, nondeterministic parts of the environment. From the point of view of your algorithm, there are no other states in between s and s', in which there is an opponent that can take actions. From the point of view of your algorithm, the environment is simply nondeterministic, which means that taking action a in state s will sometimes randomly lead to s', but maybe also sometimes randomly to a different state s''.


Note that this will only work precisely because you wrote that the opponent is a random agent (or, more importantly, a non-learning agent with a fixed policy). As soon as the opponent also gains the ability to learn, this will break down completely, and you'd have to move on to proper multi-agent versions of Reinforcement Learning algorithms.