Assistants that simulate users

In many cases it is useful to have an assistant maintain a model of a user, for example to perform single-shot predictions (what will be the next user step?) or complete simulations by means of rollouts (what is the end result if I choose this policy throughout the task?). This predictions can then be used in the decision-making process.

Since in CoopIHC we define users as classes (and/or instances), it seems natural to want to pass a user to an assistant, which could then query it to perform predictions.

Single-shot predictions

Single-shot predictions are easy to implement with basic CoopIHC functions. Below is an example, where we consider a coordination task: the user and assistant have to select the same action to make the task state evolve (+1). The coordination will be successful because the assistant will manage a simulation of the user that provides single-shot predictions of the next user action.

We first modify the ExampleTask that we used in the Quickstart:

class CoordinatedTask(InteractionTask):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.state["x"] = discrete_array_element(init=0, low=0, high=9)

    def reset(self, dic=None):
        self.state["x"] = 0
        return

    def on_user_action(self, *args, **kwargs):
        is_done = False

        if self.state["x"] == 9:
            is_done = True

        if self.round_number == 100:
            is_done = True

        reward = -1
        return self.state, reward, is_done

    def on_assistant_action(self, *args, **kwargs):
        is_done = False

        if self.user_action == self.assistant_action:
            self.state["x"] += 1

        reward = -1
        return self.state, reward, is_done

The premise of the task is very simple: If the assistant and user input the same action (self.user_action == self.assistant_action), then the task state is incremented. When the task state reaches 9 the task is finished.

We then create a pseudorandom user, that uses a pseudorandom policy. It picks an action prescribed by the formula \(8 + p_0 \cdot{} x + p_1 \cdot{} x^2 + p_2 \cdot{} x^3 \pmod{10}\) where the p’s are user parameters and x is the task state.

class PseudoRandomPolicy(BasePolicy):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    @BasePolicy.default_value
    def sample(self, agent_observation=None, agent_state=None):
        x = agent_observation.task_state.x

        _action_value = (
            8 + self.state.p0 * x + self.state.p1 * x * x + self.state.p2 * x * x * x
        ) % 10
        print(f"sampled: {_action_value}")
        return _action_value, 0

The assistant is then constructed as follows:

class CoordinatedAssistant(BaseAgent):
    def __init__(self, user_model=None, *args, **kwargs):

        self.user_model = user_model

        # Call the policy defined above
        action_state = State()
        action_state["action"] = cat_element(N=10, init=0)

        # Use default observation and inference engines
        observation_engine = None
        inference_engine = None

        super().__init__(
            "assistant",
            *args,
            agent_policy=CoordinatedPolicy(action_state=action_state),
            agent_observation_engine=observation_engine,
            agent_inference_engine=inference_engine,
            **kwargs
        )

    def finit(self):
        copy_task = copy.deepcopy(self.task)
        self.simulation_bundle = Bundle(task=copy_task, user=self.user_model)

Notice that:

it expects that a user_model is passed during initialization.

it uses the finit mechanism to create a simulation that can be used by the assistant. That simulation is nothing more than a Bundle between the task and the user model.

This simulation is actually used in the policy of the assistant:

class CoordinatedPolicy(BasePolicy):
    @property
    def simulation_bundle(self):
        return self.host.simulation_bundle

    @BasePolicy.default_value
    def sample(self, agent_observation=None, agent_state=None):

        reset_dic = {"task_state": agent_observation.task_state}

        self.simulation_bundle.reset(dic=reset_dic)
        self.simulation_bundle.step(turn=2)

        _action_value = self.simulation_bundle.user.action

        return _action_value, 0

The policy of the assistant is straightforward:

Observe the current state of the game, and put the simulation in that state, via the reset mechanism.

Play the simulation just enough that the user model takes an action

Observe the action that was taken by the user model and pick the same.

At each turn, the assistant takes the same action as the user model. If we provide the assistant with the true model of the user, then the coordination is perfect:

user = PseudoRandomUser()
user_model = PseudoRandomUser()  # The same as user

assistant = CoordinatedAssistant(user_model=user_model)
bundle = Bundle(task=CoordinatedTask(), user=user, assistant=assistant)
bundle.reset(go_to=3)
while True:
    obs, rewards, is_done = bundle.step()
    # print(bundle.game_state)
    if is_done:
        break

Rollouts

Usually, we need a more comprehensive simulation that spans several steps and that features a user and an assistant. Using the same assistant simultaneously in a bundle and in a simulation is not straightforward, so CoopIHC provides a few helper classes. In particular, the inference engines, policies etc. used during simulation can not be the same as the ones during execution of the bundle (or, you would have infinite recursion). CoopIHC offers the possibility of having two different inference engines and policies, using the so-called DualInferenceEngine and DualPolicy. Depending on the state of the engine, the primary or the dual engine is used (same for the policy).

To illustrate these, let’s go over a variation of the previous example:

task = CoordinatedTask()
task_model = CoordinatedTask()

user = PseudoRandomUserWithParams(p=[1, 5, 7])
user_model = copy.deepcopy(user)


assistant = CoordinatedAssistantWithRollout(task_model, user_model, [5, 7])
bundle = Bundle(task=task, user=user, assistant=assistant)
bundle.reset()

while True:
    obs, rewards, is_done = bundle.step()
    if is_done:
        break

Notice that the parameters of the PseudoRandomPolicy are given at initialization with the PseudoRandomUserWithParams (before, they were hard-coded in the user). If you look at the assistant, you see that we pass it a model of the task, a model of the user, as well as two parameters. These parameters are the last two parameters of the user model. The first one is unknown. The point of the assistant is now to infer that parameter using the models of the task and user it was given.

The code for the assistant is as follows:

class CoordinatedAssistantWithRollout(BaseAgent):
    def __init__(self, task_model, user_model, p, **kwargs):
        state = State()
        state["p0"] = discrete_array_element(init=0, low=0, high=9)
        state["p1"] = discrete_array_element(init=p[0], low=0, high=9)
        state["p2"] = discrete_array_element(init=p[1], low=0, high=9)

        # Use default observation engine
        inference_engine = DualInferenceEngine(
            primary_inference_engine=RolloutCoordinatedInferenceEngine(
                task_model, user_model, self
            ),
            dual_inference_engine=BaseInferenceEngine(),
            primary_kwargs={},
            dual_kwargs={},
        )

        policy = PseudoRandomPolicy(
            action_state=State(
                **{"action": discrete_array_element(init=0, low=0, high=9)}
            )
        )

        super().__init__(
            "assistant",
            agent_state=state,
            agent_policy=policy,
            agent_observation_engine=None,
            agent_inference_engine=inference_engine,
            **kwargs
        )

The state p0 is the one that needs to be determined. Once it is known, the assistant can simply use the PseudoRandomPolicy to select the same action as the user.

The DualInferenceEngine holds two inference engines: the primary RolloutCoordinatedInferenceEngine which is used during the bundle execution, and the dual BaseInferenceEngine which is used for the simulation.

The remaining code is in the RolloutCoordinatedInferenceEngine

class RolloutCoordinatedInferenceEngine(BaseInferenceEngine):
    def __init__(self, task_model, user_model, assistant, **kwargs):
        super().__init__(**kwargs)
        self.task_model = task_model
        self.user_model = user_model
        self.assistant = assistant
        self._simulator = None
        self.__inference_count = 0

    # define the simulator here. Simulator is called like a Bundle, but it will use the dual version of objects if available.
    @property
    def simulator(self):
        if self._simulator is None:
            self._simulator = Simulator(
                task_model=self.task_model,
                user_model=self.user_model,
                assistant=self.assistant,
            )
        return self._simulator

    @BaseInferenceEngine.default_value
    def infer(self, agent_observation=None):

        if (
            self.__inference_count > 0
        ):  # If it is the first time there is inference, continue, else just perform a BaseInference. We can do this because we know the parameter p[0] will not evolve over time.
            return super().infer(agent_observation=agent_observation)

        self.__inference_count += 1

        agent_state = getattr(agent_observation, f"{self.role}_state")

        mem_state = copy.deepcopy(
            agent_state
        )  # agent state will be altered in the simulator, so keep a copy of it for reference.

        # For the 10 possible values, completely simulate them. The right parameter is the one that leads to the maximum rewards
        rew = [0 for i in range(10)]
        for i in range(10):  # Exhaustively try out all cases
            # load the simulation with the right parameters
            reset_dic = copy.deepcopy(agent_observation)
            # try out a new state
            del reset_dic["assistant_state"]
            del reset_dic["game_info"]
            reset_dic = {
                **{"game_info": {"round_index": 0, "turn_index": 0}},
                **reset_dic,
                **{
                    "assistant_state": {
                        "p0": i,
                        "p1": mem_state.p1,
                        "p2": mem_state.p2,
                    }
                },
            }
            self.simulator.reset(dic=reset_dic)
            while True:
                state, rewards, is_done = self.simulator.step()
                rew[i] += sum(rewards.values())

                if is_done:
                    break

        # Don't forget to close the simulator when you are finished.
        self.simulator.close()
        index = numpy.argmax(rew)
        self.state["p0"] = index
        return self.state, 0

First, we define a simulator object. For that, simply instantiate a Simulator as you would a Bundle. The difference between a simulator and a bundle is that the former will consider the dual versions of the objects. The inference is then straightforward: All possible values of p0 are tested, and the correct one is the one that leads to the highest reward.