On the Effectiveness of Offline RL for Dialogue Response Generation

Authors: Paloma Sodhi, Felix Wu, Ethan R. Elenberg, Kilian Q Weinberger, Ryan Mcdonald

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a comprehensive evaluation across multiple datasets, models, and metrics.
Researcher Affiliation Collaboration 1ASAPP, New York, United States 2Cornell University, New York, United States.
Pseudocode No The paper describes methods in text and equations but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/ asappresearch/dialogue-offline-rl
Open Datasets Yes Multi Woz 2.2 (Zang et al., 2020) is a widely used dataset created to evaluate performance of dialogue systems in multi-domain settings. Action Based Conversations Dataset (ABCD) (Chen et al., 2021a) contains customer-agent conversations... Task Master-3 (Byrne et al., 2019): contains 23,789 conversations between users and a system on movie ticketing.
Dataset Splits No The paper mentions using 'validation loss' to pick checkpoints, but does not explicitly provide the train/validation/test splits (e.g., percentages or counts) for the primary datasets (Multi Woz 2.2, ABCD, Task Master-3).
Hardware Specification Yes Training is done on an AWS EC2 g5.12xlarge instance which has 4 Nvidia A10G GPUs.
Software Dependencies No The paper mentions using 'huggingface transformers library' (Wolf et al., 2019) and 'trlx' for implementation, but does not specify their version numbers.
Experiment Setup Yes More hyperparameter details in Tables 7, 8 and 9. These tables specify Model, Batch size, Block size, Max number of epochs, Optimizer, Learning rate, Adam (β1, β2), Adam ϵ, Learning rate scheduler, CQL Scale, τ, γ, PPO value coefficient, and PPO KL initial coefficient.