How to deal with different state space size in reinforcement learning?

For the paper, I'm gonna give the same reference as in the other post already: Benchmarks for reinforcement learning minmixed-autonomy traffic.

In this approach, indeed, an expected number of agents (which are expected to be present in the simulation at any moment in time) is predetermined. During runtime, observations of agents present in the simulation are then retrieved and squashed into a container (tensor) of fixed size (let's call it overall observation container), which can contain as many observations (from individual agents) as there are agents expected to be present at any moment in time in the simulation. Just to be clear: size(overall observation container) = expected number of agents * individual observation size. Since the actual number of agents present in a simulation may vary from time step to time step, the following applies:

  • If less agents than expected are present in the environment, and hence there are less observations provided than would fit into the overall observation container, then zero-padding is used to fill empty observation slots.
  • If the number of agents exceeds the expected number of agents, then only a subset of the observations provided will be used. So, only from a randomly selected subset of the available agents the observations are put into the overall observation container of fixed size. Only for the chosen agents, the controller will compute actions to be performed, while "excess agents" will have to be treated as non-controlled agents in the simulation.

Coming back to your sample code, there are a few things I would do differently.

First, I was wondering why you have both the variable state (passed to the function get_state_new) and the call get_state(env), since I would expect the information returned by get_state(env) to be the same as stored already in the variable state. As a tip, it would make the code a bit nicer to read if you could try to use the state variable only (if the variable and the function call indeed provide the same information).

The second thing I would do differently is how you process states: p = np.exp(p), p = p * (1. / p.sum()). This normalizes the overall observation container by the sum of all exponentiated values present in all individual observations. In contrast, I would normalize each individual observation in isolation.

This has the following reason: If you provide a small number of observations, then the sum of exponentiated values contained in all individual observations can be expected to be smaller than when taking the sum over the exponentiated values contained in a larger amount of individual observations. These differences in the sum, which is then used for normalization, will result in different magnitudes of the normalized values (as a function of the number of individual observations, roughly speaking). Consider the following example:

import numpy as np

# Less state representations
state = np.array([1,1,1])
state = state/state.sum()
state
# Output: array([0.33333333, 0.33333333, 0.33333333])

# More state representations
state = np.array([1,1,1,1,1])
state = state/state.sum()
state
# Output: array([0.2, 0.2, 0.2, 0.2, 0.2])

Actually, the same input state representation, as obtained by an individual agent, shall always result in the same output state representation after normalization, regardless of the number of agents currently present in the simulation. So, please make sure to normalize all observations on their own. I'll give an example below.

Also, please make sure to keep track of which agents' observations (and in which order) have been squashed into your variable statappend. This is important for the following reason.

If there are agents A1 through A5, but the overall observation container can take only three observations, three out of five state representations are going to be selected at random. Say the observations randomly selected to be squashed into the overall observation container stem from from the following agents in the following order: A2, A5, A1. Then, these agents' observations will be squashed into the overall observation container in exactly this order. First the observation of A2, then that of A5, and eventually that of A1. Correspondingly, given the aforementioned overall observation container, the three actions predicted by your Reinforcement Learning controller will correspond to agents A2, A5, and A1 (in order!), respectively. In other words, the order of the agents on the input side also dictates to which agents the predicted actions correspond on the output side.

I would propose something like the following:

import numpy as np

def get_overall_observation(observations, expected_observations=5):
    # Return value:
    #   order_agents: The returned observations stem from this ordered set of agents (in sequence)

    # Get some info
    n_observations = observations.shape[0]  # Actual nr of observations
    observation_size = list(observations.shape[1:])  # Shape of an agent's individual observation

    # Normalitze individual observations
    for i in range(n_observations):
        # TODO: handle possible 0-divisions
        observations[i,:] = observations[i,:] / observations[i,:].max()

    if n_observations == expected_observations:
        # Return (normalized) observations as they are & sequence of agents in order (i.e. no randomization)
        order_agents = np.arange(n_observations)
        return observations, order_agents
    if n_observations < expected_observations:
        # Return padded observations as they are & padded sequence of agents in order (i.e. no randomization)
        padded_observations = np.zeros([expected_observations]+observation_size)
        padded_observations[0:n_observations,:] = observations
        order_agents = list(range(n_observations))+[-1]*(expected_observations-n_observations) # -1 == agent absent
        return padded_observations, order_agents
    if n_observations > expected_observations:
        # Return random selection of observations in random order
        order_agents = np.random.choice(range(n_observations), size=expected_observations, replace=False)
        selected_observations = np.zeros([expected_observations] + observation_size)
        for i_selected, i_given_observations in enumerate(order_agents):
            selected_observations[i_selected,:] = observations[i_given_observations,:]
        return selected_observations, order_agents


# Example usage
n_observations = 5      # Number of actual observations
width = height =  2     # Observation dimension
state = np.random.random(size=[n_observations,height,width])  # Random state
print(state)
print(get_overall_observation(state))