Goal-Oriented Dialogue Policy Learning from Failures
Authors: Keting Lu, Shiqi Zhang, Xiaoping Chen2596-2603
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments using a realistic user simulator show that our HER methods perform better than existing experience replay methods (as applied to deep Q-networks) in learning rate. |
| Researcher Affiliation | Academia | Keting Lu,1 Shiqi Zhang,2 Xiaoping Chen1 1School of Computer Science, University of Science and Technology of China 2Department of Computer Science, SUNY Binghamton |
| Pseudocode | Yes | Algorithm 1 Dialogue Segmentation |
| Open Source Code | No | No explicit statement or link providing access to the open-source code for the described methodology was found. |
| Open Datasets | Yes | Our complex HER methods were evaluated using a dialogue simulation environment, where a dialogue agent communicates with simulated users on movie-booking tasks (Li et al. 2016; 2017). |
| Dataset Splits | No | The paper describes the number of dialogue episodes and runs but does not specify explicit training, validation, and test dataset splits as it uses a simulation environment where dialogues are generated. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running experiments. |
| Software Dependencies | No | The paper mentions the use of Deep Q-Networks (DQNs) but does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | The size of experience pool is 100k, and experience replay strategy is uniform sampling. The value of α in Equation 2 is 1.0, and ϵ greedy policy is used, where ϵ is initialized with 0.3, and decayed to 0.01 during training. Each experiment includes 1000 epochs. Each epoch includes 100 dialogue episodes. By the end of each epoch, we update the weights of target network using the current behavior network, and this update operation executes once every epoch. |