On the Effectiveness of Offline RL for Dialogue Response Generation
Authors: Paloma Sodhi, Felix Wu, Ethan R. Elenberg, Kilian Q Weinberger, Ryan Mcdonald
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a comprehensive evaluation across multiple datasets, models, and metrics. |
| Researcher Affiliation | Collaboration | 1ASAPP, New York, United States 2Cornell University, New York, United States. |
| Pseudocode | No | The paper describes methods in text and equations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/ asappresearch/dialogue-offline-rl |
| Open Datasets | Yes | Multi Woz 2.2 (Zang et al., 2020) is a widely used dataset created to evaluate performance of dialogue systems in multi-domain settings. Action Based Conversations Dataset (ABCD) (Chen et al., 2021a) contains customer-agent conversations... Task Master-3 (Byrne et al., 2019): contains 23,789 conversations between users and a system on movie ticketing. |
| Dataset Splits | No | The paper mentions using 'validation loss' to pick checkpoints, but does not explicitly provide the train/validation/test splits (e.g., percentages or counts) for the primary datasets (Multi Woz 2.2, ABCD, Task Master-3). |
| Hardware Specification | Yes | Training is done on an AWS EC2 g5.12xlarge instance which has 4 Nvidia A10G GPUs. |
| Software Dependencies | No | The paper mentions using 'huggingface transformers library' (Wolf et al., 2019) and 'trlx' for implementation, but does not specify their version numbers. |
| Experiment Setup | Yes | More hyperparameter details in Tables 7, 8 and 9. These tables specify Model, Batch size, Block size, Max number of epochs, Optimizer, Learning rate, Adam (β1, β2), Adam ϵ, Learning rate scheduler, CQL Scale, τ, γ, PPO value coefficient, and PPO KL initial coefficient. |