Multi-Action Dialog Policy Learning from Logged User Feedback

Authors: Shuo Zhang, Junzhou Zhao, Pinghui Wang, Tianxiang Wang, Zi Liang, Jing Tao, Yi Huang, Junlan Feng

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on the benchmark Multi WOZ dataset, and empirical results show that our Bandit Match achieves state-of-the-art performance in task success rate while generating much more concise and informative responses (9% 23% increase in Inform F1).
Researcher Affiliation Collaboration 1 MOE KLINNS Lab, Xi an Jiaotong University, Xi an 710049, P. R. China 2 JIUTIAN Team, China Mobile Research
Pseudocode No The paper does not contain a dedicated section or figure labeled 'Pseudocode' or 'Algorithm' with structured steps.
Open Source Code Yes The source code and the appendix of this paper can be obtained from https://github.com/Shuo Zhang XJTU/Bandit Match.
Open Datasets Yes Data We use Multi WOZ 2.0 (Budzianowski et al. 2018), a large-scale multi-domain benchmark dataset, to validate the effectiveness of our method and apply the agenda-based user simulator as the interaction environment to evaluate the generalization ability.
Dataset Splits No The paper mentions using the Multi WOZ 2.0 dataset and splitting its training set for specific purposes (creating labeled data and bandit data), but it does not explicitly provide the train/validation/test splits or their percentages/counts for the overall experimental setup for reproducibility.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory, or cloud instances) used for running its experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers, such as Python versions, deep learning frameworks (e.g., PyTorch, TensorFlow) with their versions, or other libraries.
Experiment Setup Yes Our proposed Bandit Match algorithm optimizes end-to-end with the overall loss as the weighted sum of four losses: L = LL + λp LP + λb LB + λk LK where λp, λb, and λk are hyper-parameters to balance each term s intensity. We find that simply setting them to 1 can lead to proper performance through experimental evaluation.