Language Instructed Reinforcement Learning for Human-AI Coordination

Authors: Hengyuan Hu, Dorsa Sadigh

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We initially evaluate our method in the purposely designed Say-Select game discussed above, where we show that our method learns intuitive, human-compatible policies as instructed by the language instructions. Then, we evaluate our method in the large scale Hanabi benchmark (Bard et al., 2020).
Researcher Affiliation Academia Hengyuan Hu 1 Dorsa Sadigh 1 1Stanford University. Correspondence to: Hengyuan Hu <hengyuan.hhu@gmail.com>.
Pseudocode Yes We summarize our method in Algorithm 1 and provide an illustration of the instruct Q version in Figure 2.
Open Source Code Yes In this section, we will demonstrate instruct RL in two multiagent coordination game settings: Say-Select in Sec. 5.1 and Hanabi in Sec. 5.2. We will open-source code and models for both experiments.
Open Datasets Yes Then, we evaluate our method in the large scale Hanabi benchmark (Bard et al., 2020).
Dataset Splits No The paper uses reinforcement learning environments (Say-Select, Hanabi) rather than fixed datasets with explicit training, validation, and test splits. It mentions training over mini-batches and running multiple seeds, but does not specify dataset percentages or counts for traditional data splits.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running the experiments.
Software Dependencies No The paper mentions specific LLMs like GPT-J and GPT-3.5, and bases its implementation on an 'open sourced repository of off-belief learning (OBL) (Hu et al., 2021)', but does not provide specific version numbers for general software dependencies like Python, PyTorch, or other libraries.
Experiment Setup Yes We use tabular Qlearning with no neural network as the state space is small enough and regularization weight λ = 0.25 for instruct Q. Details on the hyper-parameters are in Appendix A.1. For instruct Q, we set the regularization weight to λ = 0.15 initially and anneal λ by half every 50K mini-batches. The policy is trained for a total of 250K mini-batches. Each mini-batch contains 128 episodes of games. For instruct PPO, we set the regularization weight to λ = 0.05 initially and linearly anneal it by 0.008 every 50K mini-batches until it reaches 0.01 after 250K mini-batches. We then train for another 500K mini-batches with λ = 0.01.