Dynamic Reward-Based Dueling Deep Dyna-Q: Robust Policy Learning in Noisy Environments

Authors: Yangyang Zhao, Zhenyu Wang, Kai Yin, Rui Zhang, Zhenhua Huang, Pei Wang9676-9684

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments using simulation and human evaluation show that the DR-D3Q significantly improve the performance of policy learning tasks in noisy environments.1
Researcher Affiliation Academia Yangyang Zhao, Zhenyu Wang, Kai Yin, Rui Zhang, Zhenhua Huang, Pei Wang School of Software, South China University of Technology {msyyz, sekaiyin, sewangpei}@mail.scut.edu.cn, wangzy@scut.edu.cn, zhang1rui4@outlook.com, zhhuangscut@gmail.com
Pseudocode Yes More detailed procedure is shown in Algorithm 1. Algorithm 1 Dynamic Dialogue Segmentation
Open Source Code Yes Source code is at https://github.com/zhaoyangyangHH/DRD3Q.
Open Datasets No In the experiment, we use a movie-ticket booking dataset which contains raw conversational data collected via Amazon Mechanical Turk. The dataset has been manually labeled based on a schema defined by domain experts, as shown in Table 1, consisting of 11 intents and 16 slots. No specific link, DOI, or citation to a publicly available version of this dataset is provided.
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits (e.g., percentages or sample counts). It mentions buffer sizes for reinforcement learning experience replay, but these are not the same as dataset splits for supervised learning.
Hardware Specification No The paper does not explicitly describe the hardware used for its experiments, such as specific GPU or CPU models.
Software Dependencies No The paper mentions using MLPs and specific activation functions but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes For all the models the world models (DQN, DDQ, and DR-D3Q) and their variants, we use MLPs to parameterize the value networks Q( ) with one hidden layer of size 80 and Re LU activation. [...] We set the discount factor γ = 0.9. The buffer size of Bu and Bs is set to 2000 and 2000 K planning steps, respectively. The batch size is 16, and the learning rate is 0.001. We applied gradient clipping on all the model parameters with a maximum norm of 1 to prevent gradient explosion. The target network is updated at the beginning of each training episode.