On the Critical Role of Conventions in Adaptive Human-AI Collaboration

Authors: Andy Shih, Arjun Sawhney, Jovana Kondic, Stefano Ermon, Dorsa Sadigh

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we propose a learning framework that teases apart rule-dependent representation from convention-dependent representation in a principled way. We show that, under some assumptions, our rule-dependent representation is a sufficient statistic of the distribution over best-response strategies across partners. Using this separation of representations, our agents are able to adapt quickly to new partners, and to coordinate with old partners on new tasks in a zero-shot manner. We experimentally validate our approach on three collaborative tasks varying in complexity: a contextual multi-armed bandit, a block placing task, and the card game Hanabi.
Researcher Affiliation Academia Andy Shih , Arjun Sawhney , Jovana Kondic , Stefano Ermon & Dorsa Sadigh Stanford University, Princeton University {andyshih,arjunsawhney,ermon,dorsa}@cs.stanford.edu jkondic@princeton.edu
Pseudocode Yes Algorithm 1: Learning Separate Representations for Partners and Tasks
Open Source Code Yes Code for the experiments in our paper is available at https://github.com/ Stanford-ILIAD/Conventions-Modular Policy.
Open Datasets Yes Task setup We used the Hanabi Learning Environment package (Bard et al., 2020), with the following configuration: 1 color, 5 ranks, 2 players, hand size 2, 3 information tokens, and 3 life tokens. The maximum score is 5 points.
Dataset Splits No The paper discusses training and adapting to partners and tasks, but it does not explicitly provide specific dataset splits (e.g., percentages or sample counts for training, validation, and test sets) for reproducibility of data partitioning.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as CPU models, GPU types, or memory configurations.
Software Dependencies No The paper mentions using 'Proximal Policy Optimization (PPO)' and the 'Stable Baselines software package (Raffin et al., 2019)' but does not specify version numbers for these software libraries or other key dependencies beyond the publication year of the Stable Baselines paper.
Experiment Setup Yes Appendix B ARCHITECTURE DETAILS AND HYPERPARAMETERS contains tables that list specific hyperparameters for training and adapting models. For example, Table 2 lists 'Timesteps', 'Minibatch size', 'Num. epochs', and 'Learning Rate' for training self-play partners, and Table 1 details layer sizes for the modules.