Multi-Action Dialog Policy Learning from Logged User Feedback
Authors: Shuo Zhang, Junzhou Zhao, Pinghui Wang, Tianxiang Wang, Zi Liang, Jing Tao, Yi Huang, Junlan Feng
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on the benchmark Multi WOZ dataset, and empirical results show that our Bandit Match achieves state-of-the-art performance in task success rate while generating much more concise and informative responses (9% 23% increase in Inform F1). |
| Researcher Affiliation | Collaboration | 1 MOE KLINNS Lab, Xi an Jiaotong University, Xi an 710049, P. R. China 2 JIUTIAN Team, China Mobile Research |
| Pseudocode | No | The paper does not contain a dedicated section or figure labeled 'Pseudocode' or 'Algorithm' with structured steps. |
| Open Source Code | Yes | The source code and the appendix of this paper can be obtained from https://github.com/Shuo Zhang XJTU/Bandit Match. |
| Open Datasets | Yes | Data We use Multi WOZ 2.0 (Budzianowski et al. 2018), a large-scale multi-domain benchmark dataset, to validate the effectiveness of our method and apply the agenda-based user simulator as the interaction environment to evaluate the generalization ability. |
| Dataset Splits | No | The paper mentions using the Multi WOZ 2.0 dataset and splitting its training set for specific purposes (creating labeled data and bandit data), but it does not explicitly provide the train/validation/test splits or their percentages/counts for the overall experimental setup for reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory, or cloud instances) used for running its experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers, such as Python versions, deep learning frameworks (e.g., PyTorch, TensorFlow) with their versions, or other libraries. |
| Experiment Setup | Yes | Our proposed Bandit Match algorithm optimizes end-to-end with the overall loss as the weighted sum of four losses: L = LL + λp LP + λb LB + λk LK where λp, λb, and λk are hyper-parameters to balance each term s intensity. We find that simply setting them to 1 can lead to proper performance through experimental evaluation. |