Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management
Authors: Dhawal Gupta, Yinlam Chow, Azamat Tulepbergenov, Mohammad Ghavamzadeh, Craig Boutilier
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our methods in open-domain dialogue to demonstrate their effectiveness with respect to the diversity of intent in generated utterances and overall DM performance. ... Finally, in Section 6, we evaluate our algorithms in open-domain dialogues against their ability to generate utterances with diverse intents and their overall DM performance. |
| Researcher Affiliation | Collaboration | Dhawal Gupta University of Massachusetts dgupta@cs.umass.edu Yinlam Chow Google Research yinlamchow@google.com Aza Tulepbergenov Google Research atulep@google.com Mohammad Ghavamzadeh Amazon ghavamza@amazon.com Craig Boutilier Google Research cboutilier@google.com |
| Pseudocode | No | The paper does not contain explicit pseudocode or algorithm blocks, though it describes algorithms and provides mathematical formulations. |
| Open Source Code | No | The paper does not provide an explicit statement or link to open-source code for the described methodology. |
| Open Datasets | Yes | We obtained these datasets from the Neural Chat datasets of the MIT Media Lab, which is available at the following link: https://affect.media.mit.edu/neural_chat/datasets. |
| Dataset Splits | No | The paper does not explicitly provide specific percentages or counts for training, validation, and test splits. It mentions using datasets for evaluation but does not detail the splits. |
| Hardware Specification | Yes | Training and evaluation were run on 8 GPU instances with 32GB of RAM and a NVIDIA Tesla P100 graphics card. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers. |
| Experiment Setup | Yes | Table 8 summarizes the hyper-parameters that were used for training the Q,V functions. ... Hyper Parameter Value Number of layers (Q, V ) 3 Activation Re LU Hidden Size 512 Epochs 100 Max Unroll 30 Batch Size 256 Learning Rate 2 10 3 Optimizer Adam τ (IQL) 0.9 Dropout (Ens Q, KLC) 0.5 |