On the Critical Role of Conventions in Adaptive Human-AI Collaboration
Authors: Andy Shih, Arjun Sawhney, Jovana Kondic, Stefano Ermon, Dorsa Sadigh
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we propose a learning framework that teases apart rule-dependent representation from convention-dependent representation in a principled way. We show that, under some assumptions, our rule-dependent representation is a sufficient statistic of the distribution over best-response strategies across partners. Using this separation of representations, our agents are able to adapt quickly to new partners, and to coordinate with old partners on new tasks in a zero-shot manner. We experimentally validate our approach on three collaborative tasks varying in complexity: a contextual multi-armed bandit, a block placing task, and the card game Hanabi. |
| Researcher Affiliation | Academia | Andy Shih , Arjun Sawhney , Jovana Kondic , Stefano Ermon & Dorsa Sadigh Stanford University, Princeton University {andyshih,arjunsawhney,ermon,dorsa}@cs.stanford.edu jkondic@princeton.edu |
| Pseudocode | Yes | Algorithm 1: Learning Separate Representations for Partners and Tasks |
| Open Source Code | Yes | Code for the experiments in our paper is available at https://github.com/ Stanford-ILIAD/Conventions-Modular Policy. |
| Open Datasets | Yes | Task setup We used the Hanabi Learning Environment package (Bard et al., 2020), with the following configuration: 1 color, 5 ranks, 2 players, hand size 2, 3 information tokens, and 3 life tokens. The maximum score is 5 points. |
| Dataset Splits | No | The paper discusses training and adapting to partners and tasks, but it does not explicitly provide specific dataset splits (e.g., percentages or sample counts for training, validation, and test sets) for reproducibility of data partitioning. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as CPU models, GPU types, or memory configurations. |
| Software Dependencies | No | The paper mentions using 'Proximal Policy Optimization (PPO)' and the 'Stable Baselines software package (Raffin et al., 2019)' but does not specify version numbers for these software libraries or other key dependencies beyond the publication year of the Stable Baselines paper. |
| Experiment Setup | Yes | Appendix B ARCHITECTURE DETAILS AND HYPERPARAMETERS contains tables that list specific hyperparameters for training and adapting models. For example, Table 2 lists 'Timesteps', 'Minibatch size', 'Num. epochs', and 'Learning Rate' for training self-play partners, and Table 1 details layer sizes for the modules. |