Reward Design with Language Models
Authors: Minae Kwon, Sang Michael Xie, Kalesha Bullard, Dorsa Sadigh
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate whether our approach can train agents aligned with user objectives in the Ultimatum Game, matrix games, and the DEALORNODEAL negotiation task. In all three tasks, we show that RL agents trained with our framework are well-aligned with the user s objectives and outperform RL agents trained with reward functions learned via supervised learning. |
| Researcher Affiliation | Collaboration | Minae Kwon, Sang Michael Xie, Kalesha Bullard , Dorsa Sadigh Stanford University, Deep Mind |
| Pseudocode | No | The paper describes the framework and methods in narrative text and figures, but does not include any explicitly labeled pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Code and prompts can be found here. |
| Open Datasets | Yes | We evaluate our approach on three tasks: the Ultimatum Game, 2-player Matrix Games, and the DEALORNODEAL negotiation task (Lewis et al., 2017). We use a version of the DEALORNODEAL environment used in Kwon et al. (2021). |
| Dataset Splits | No | The paper provides training details and mentions evaluation on a test set, but does not explicitly specify a validation dataset split or a separate validation process for hyperparameter tuning. |
| Hardware Specification | No | The paper mentions using "a large language model (LLM) such as GPT-3" and discusses computational resources in general (e.g., "the rise of compute and data"), but does not provide specific hardware details like GPU models, CPU types, or memory specifications used for experiments. |
| Software Dependencies | No | The paper mentions using "the text-davinci-002 GPT-3 model", "Stable Baselines3 implementation", "DQN", "REINFORCE (Williams, 1992)", "GRUs (Chung et al., 2014)", and "Adam optimizer", but it does not specify version numbers for any of these software components, libraries, or frameworks. |
| Experiment Setup | Yes | We train DQN agents using the Stable Baselines3 implementation for 1e4 timesteps with a learning rate of 1e-4 across 3 seeds. We instantiate our policy as a MLP with the default parameters used in Stable Baselines3. We use a learning rate of 1.0 and batch size of 16. We then fine-tune these agents using RL where they optimize the expected reward of each dialogue act using REINFORCE (Williams, 1992). Agents are trained on 250 contexts for 1 epoch with a learning rate of 0.1. We instantiate our policy with four GRUs (Chung et al., 2014). |