Efficient Dialog Policy Learning by Reasoning with Contextual Knowledge
Authors: Haodi Zhang, Zhichao Zeng, Keting Lu, Kaishun Wu, Shiqi Zhang11667-11675
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have extensively conducted experiments using a realistic dialog platform Py Dial (Ultes et al. 2017). Compared with baselines from the literature and ablations of our own approach, we observe significant improvements in dialog learning efficiency and policy quality. |
| Researcher Affiliation | Collaboration | Haodi Zhang1, Zhichao Zeng1, Keting Lu2, Kaishun Wu1, Shiqi Zhang3 1 Computer Science and Software Engineering, Shenzhen University 2 Baidu, Inc. 3 Computer Science, SUNY Binghamton |
| Pseudocode | Yes | Algorithm 1: Dialog policy learning by reasoning with contextual knowledge |
| Open Source Code | Yes | More details are available in the supplementary appendix and code1. 1https://github.com/ResearchGroupHdZhang/DPLAAAI22 |
| Open Datasets | No | In the experiments, we use a revised version of a hotel booking domain in Py Dial (Casanueva et al. 2017), where the main slots, i.e., internal factors, are the same with I in the previous section. Besides the revised evaluation criteria, we also modified the database to evaluate the reasoning capabilities of our developed approach. We enlarged the original database, so that the user goals would not be frequently rejected due to the lack of diverse data entities. More details are available in the supplementary appendix and code1. (The paper mentions a modified database and 'historical data' for MLN without providing explicit access for these specific datasets. While PyDial is cited, the *modified* dataset used is not directly made available or linked.) |
| Dataset Splits | No | The environment parameters are selected via a validation set. In each run, we use 40 batches and each of them contains 100 dialogs. After training with each batch, the policy is evaluated using 100 dialogs. (While a validation set is mentioned, its specific size or proportion within the dataset splits is not provided.) |
| Hardware Specification | No | No specific hardware details (e.g., CPU, GPU models, memory, or cloud instances) are mentioned in the paper regarding the experimental setup. |
| Software Dependencies | No | In the experiment, we used a revised version of a hotel booking domain in Py Dial (Casanueva et al. 2017)... For internal knowledge, we utilize Alchemy (Kok et al. 2005) to train a MLN... For external knowledge, we use Clingo (Gebser et al. 2014) to ground and solve our ASP logic programs. (No specific version numbers are provided for PyDial, Alchemy, Clingo, or any other software dependencies.) |
| Experiment Setup | No | In the experiment, we used several popular dialog strategy algorithms as baselines, including A2C (Fatemi et al. 2016), DQN, ACER (Weisz et al. 2018) and BBQN (Lipton et al. 2018). The environment parameters are selected via a validation set. In each run, we use 40 batches and each of them contains 100 dialogs. After training with each batch, the policy is evaluated using 100 dialogs. (Specific hyperparameter values like learning rate, optimizer settings, or epoch counts for the DRL models are not provided.) |