Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Partner Modelling Emerges in Recurrent Agents (But Only When It Matters)

Authors: Ruaridh Mon-Williams, Max Taylor-Davies, Elizabeth Mieczkowski, Natalia Vélez, Neil Bramley, Yanwei Wang, Tom Griffiths, Christopher G Lucas

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate this question in a fully cooperative setting, where agents optimise a shared goal (i.e., a single reward function), but have no prior knowledge of each other s attributes or action policies. Crucially, we train reinforcement learning agents with generic recurrent architectures and only task reward supervision there are no auxiliary objectives or architectural priors pushing agents to model one another. This stands in contrast to prior work, such as Rabinowitz et al. s Machine Theory of Mind framework, which relies on specialised components optimised explicitly to infer other agents internal states [12]. We find that despite these minimal inductive biases, agents develop structured, internal representations that (i) encode the different competencies of their partners; (ii) generalise to previously unseen collaborators; and (iii) emerge selectively, depending on agents ability to control task allocation. Together, our findings suggest that partner modelling can arise within artificial agents solely from the demands of flexible cooperation, without explicit incentives or specialised architectures.
Researcher Affiliation	Academia	1University of Edinburgh 2Princeton University 3Massachusetts Institute of Technology
Pseudocode	No	The paper describes methods and procedures in paragraph form and does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The full code for this paper is available at: https://github.com/ruaridhmon/emergent_ partner_modelling
Open Datasets	Yes	We provide this through Overcooked-AI [43], a fully cooperative environment where agents work together to prepare soups, tasked with maximising the throughput r = Soup/Time (for additional results in a second cooperative environment, see Appendix A). Each agent must navigate a shared kitchen to gather ingredients, cook them, and serve the completed soups making success heavily dependent on coordination and division of labour. The environment difficulty can be modified via different recipes, which vary in complexity and number of ingredients, and different kitchen layouts, which introduce specific constraints (e.g. encouraging agents to pass items to perform the task successfully, or forcing agents to navigate around each other in cramped spaces). Figure 1 illustrates one such layout.
Dataset Splits	Yes	Each probe is trained for a total of 1e3 steps using the Adam optimiser with a learning rate of 1e-2. We use a train-test split of 80-20 over rollout seeds (i.e. where we have 20 seeds per partner speed combination, we randomly assign 16 of those seeds to the train set and 4 to the test set).
Hardware Specification	Yes	All experiments were run on A100 GPUs, totalling 462 GPU-hours. Policy training used 225 GPU hours for all three experiments. A total of 37,400 rollouts were simulated, totalling 207 GPU-hours.
Software Dependencies	Yes	The ego policy is trained using Proximal Policy Optimisation (PPO) [45], implemented in JAX [46] to facilitate efficient parallel training.
Experiment Setup	Yes	The ego policy is trained on a single GPU using Proximal Policy Optimisation (PPO), running synchronously across 256 parallel Overcooked-AI environment instances. Both agent and environment are implemented in JAX with jax.jit for accelerated gradient updates and rollout collection. Each rollout lasts 400 timesteps in Experiments 1 and 3, and 600 timesteps in Experiment 2, and agents are rewarded for every successful soup delivery. Training runs for 15 million timesteps against a distribution of partner agents. For the first 5 million timesteps, a decayed reward shaping is used to aid policy learning (rewarding the agent for putting onions in the pot and for cooking the soup when it contains the correct ingredients). ... The RNN hyperparameters used for the experiments are shown in Table 1.