reproducibilityindex.ai

Reward Design with Language Models

Authors: Minae Kwon, Sang Michael Xie, Kalesha Bullard, Dorsa Sadigh

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate whether our approach can train agents aligned with user objectives in the Ultimatum Game, matrix games, and the DEALORNODEAL negotiation task. In all three tasks, we show that RL agents trained with our framework are well-aligned with the user s objectives and outperform RL agents trained with reward functions learned via supervised learning.
Researcher Affiliation	Collaboration	Minae Kwon, Sang Michael Xie, Kalesha Bullard , Dorsa Sadigh Stanford University, Deep Mind
Pseudocode	No	The paper describes the framework and methods in narrative text and figures, but does not include any explicitly labeled pseudocode blocks or algorithm listings.
Open Source Code	Yes	Code and prompts can be found here.
Open Datasets	Yes	We evaluate our approach on three tasks: the Ultimatum Game, 2-player Matrix Games, and the DEALORNODEAL negotiation task (Lewis et al., 2017). We use a version of the DEALORNODEAL environment used in Kwon et al. (2021).
Dataset Splits	No	The paper provides training details and mentions evaluation on a test set, but does not explicitly specify a validation dataset split or a separate validation process for hyperparameter tuning.
Hardware Specification	No	The paper mentions using "a large language model (LLM) such as GPT-3" and discusses computational resources in general (e.g., "the rise of compute and data"), but does not provide specific hardware details like GPU models, CPU types, or memory specifications used for experiments.
Software Dependencies	No	The paper mentions using "the text-davinci-002 GPT-3 model", "Stable Baselines3 implementation", "DQN", "REINFORCE (Williams, 1992)", "GRUs (Chung et al., 2014)", and "Adam optimizer", but it does not specify version numbers for any of these software components, libraries, or frameworks.
Experiment Setup	Yes	We train DQN agents using the Stable Baselines3 implementation for 1e4 timesteps with a learning rate of 1e-4 across 3 seeds. We instantiate our policy as a MLP with the default parameters used in Stable Baselines3. We use a learning rate of 1.0 and batch size of 16. We then fine-tune these agents using RL where they optimize the expected reward of each dialogue act using REINFORCE (Williams, 1992). Agents are trained on 250 contexts for 1 epoch with a learning rate of 0.1. We instantiate our policy with four GRUs (Chung et al., 2014).