Language Instructed Reinforcement Learning for Human-AI Coordination
Authors: Hengyuan Hu, Dorsa Sadigh
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We initially evaluate our method in the purposely designed Say-Select game discussed above, where we show that our method learns intuitive, human-compatible policies as instructed by the language instructions. Then, we evaluate our method in the large scale Hanabi benchmark (Bard et al., 2020). |
| Researcher Affiliation | Academia | Hengyuan Hu 1 Dorsa Sadigh 1 1Stanford University. Correspondence to: Hengyuan Hu <hengyuan.hhu@gmail.com>. |
| Pseudocode | Yes | We summarize our method in Algorithm 1 and provide an illustration of the instruct Q version in Figure 2. |
| Open Source Code | Yes | In this section, we will demonstrate instruct RL in two multiagent coordination game settings: Say-Select in Sec. 5.1 and Hanabi in Sec. 5.2. We will open-source code and models for both experiments. |
| Open Datasets | Yes | Then, we evaluate our method in the large scale Hanabi benchmark (Bard et al., 2020). |
| Dataset Splits | No | The paper uses reinforcement learning environments (Say-Select, Hanabi) rather than fixed datasets with explicit training, validation, and test splits. It mentions training over mini-batches and running multiple seeds, but does not specify dataset percentages or counts for traditional data splits. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for running the experiments. |
| Software Dependencies | No | The paper mentions specific LLMs like GPT-J and GPT-3.5, and bases its implementation on an 'open sourced repository of off-belief learning (OBL) (Hu et al., 2021)', but does not provide specific version numbers for general software dependencies like Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | We use tabular Qlearning with no neural network as the state space is small enough and regularization weight λ = 0.25 for instruct Q. Details on the hyper-parameters are in Appendix A.1. For instruct Q, we set the regularization weight to λ = 0.15 initially and anneal λ by half every 50K mini-batches. The policy is trained for a total of 250K mini-batches. Each mini-batch contains 128 episodes of games. For instruct PPO, we set the regularization weight to λ = 0.05 initially and linearly anneal it by 0.008 every 50K mini-batches until it reaches 0.01 after 250K mini-batches. We then train for another 500K mini-batches with λ = 0.01. |