Efficient Dialog Policy Learning by Reasoning with Contextual Knowledge

Authors: Haodi Zhang, Zhichao Zeng, Keting Lu, Kaishun Wu, Shiqi Zhang11667-11675

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have extensively conducted experiments using a realistic dialog platform Py Dial (Ultes et al. 2017). Compared with baselines from the literature and ablations of our own approach, we observe significant improvements in dialog learning efficiency and policy quality.
Researcher Affiliation Collaboration Haodi Zhang1, Zhichao Zeng1, Keting Lu2, Kaishun Wu1, Shiqi Zhang3 1 Computer Science and Software Engineering, Shenzhen University 2 Baidu, Inc. 3 Computer Science, SUNY Binghamton
Pseudocode Yes Algorithm 1: Dialog policy learning by reasoning with contextual knowledge
Open Source Code Yes More details are available in the supplementary appendix and code1. 1https://github.com/ResearchGroupHdZhang/DPLAAAI22
Open Datasets No In the experiments, we use a revised version of a hotel booking domain in Py Dial (Casanueva et al. 2017), where the main slots, i.e., internal factors, are the same with I in the previous section. Besides the revised evaluation criteria, we also modified the database to evaluate the reasoning capabilities of our developed approach. We enlarged the original database, so that the user goals would not be frequently rejected due to the lack of diverse data entities. More details are available in the supplementary appendix and code1. (The paper mentions a modified database and 'historical data' for MLN without providing explicit access for these specific datasets. While PyDial is cited, the *modified* dataset used is not directly made available or linked.)
Dataset Splits No The environment parameters are selected via a validation set. In each run, we use 40 batches and each of them contains 100 dialogs. After training with each batch, the policy is evaluated using 100 dialogs. (While a validation set is mentioned, its specific size or proportion within the dataset splits is not provided.)
Hardware Specification No No specific hardware details (e.g., CPU, GPU models, memory, or cloud instances) are mentioned in the paper regarding the experimental setup.
Software Dependencies No In the experiment, we used a revised version of a hotel booking domain in Py Dial (Casanueva et al. 2017)... For internal knowledge, we utilize Alchemy (Kok et al. 2005) to train a MLN... For external knowledge, we use Clingo (Gebser et al. 2014) to ground and solve our ASP logic programs. (No specific version numbers are provided for PyDial, Alchemy, Clingo, or any other software dependencies.)
Experiment Setup No In the experiment, we used several popular dialog strategy algorithms as baselines, including A2C (Fatemi et al. 2016), DQN, ACER (Weisz et al. 2018) and BBQN (Lipton et al. 2018). The environment parameters are selected via a validation set. In each run, we use 40 batches and each of them contains 100 dialogs. After training with each batch, the policy is evaluated using 100 dialogs. (Specific hyperparameter values like learning rate, optimizer settings, or epoch counts for the DRL models are not provided.)