Reward Design with Language Models

Authors: Minae Kwon, Sang Michael Xie, Kalesha Bullard, Dorsa Sadigh

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate whether our approach can train agents aligned with user objectives in the Ultimatum Game, matrix games, and the DEALORNODEAL negotiation task. In all three tasks, we show that RL agents trained with our framework are well-aligned with the user s objectives and outperform RL agents trained with reward functions learned via supervised learning.
Researcher Affiliation Collaboration Minae Kwon, Sang Michael Xie, Kalesha Bullard , Dorsa Sadigh Stanford University, Deep Mind
Pseudocode No The paper describes the framework and methods in narrative text and figures, but does not include any explicitly labeled pseudocode blocks or algorithm listings.
Open Source Code Yes Code and prompts can be found here.
Open Datasets Yes We evaluate our approach on three tasks: the Ultimatum Game, 2-player Matrix Games, and the DEALORNODEAL negotiation task (Lewis et al., 2017). We use a version of the DEALORNODEAL environment used in Kwon et al. (2021).
Dataset Splits No The paper provides training details and mentions evaluation on a test set, but does not explicitly specify a validation dataset split or a separate validation process for hyperparameter tuning.
Hardware Specification No The paper mentions using "a large language model (LLM) such as GPT-3" and discusses computational resources in general (e.g., "the rise of compute and data"), but does not provide specific hardware details like GPU models, CPU types, or memory specifications used for experiments.
Software Dependencies No The paper mentions using "the text-davinci-002 GPT-3 model", "Stable Baselines3 implementation", "DQN", "REINFORCE (Williams, 1992)", "GRUs (Chung et al., 2014)", and "Adam optimizer", but it does not specify version numbers for any of these software components, libraries, or frameworks.
Experiment Setup Yes We train DQN agents using the Stable Baselines3 implementation for 1e4 timesteps with a learning rate of 1e-4 across 3 seeds. We instantiate our policy as a MLP with the default parameters used in Stable Baselines3. We use a learning rate of 1.0 and batch size of 16. We then fine-tune these agents using RL where they optimize the expected reward of each dialogue act using REINFORCE (Williams, 1992). Agents are trained on 250 contexts for 1 epoch with a learning rate of 0.1. We instantiate our policy with four GRUs (Chung et al., 2014).