reproducibilityindex.ai

Dynamic Reward-Based Dueling Deep Dyna-Q: Robust Policy Learning in Noisy Environments

Authors: Yangyang Zhao, Zhenyu Wang, Kai Yin, Rui Zhang, Zhenhua Huang, Pei Wang9676-9684

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments using simulation and human evaluation show that the DR-D3Q signiﬁcantly improve the performance of policy learning tasks in noisy environments.1
Researcher Affiliation	Academia	Yangyang Zhao, Zhenyu Wang, Kai Yin, Rui Zhang, Zhenhua Huang, Pei Wang School of Software, South China University of Technology {msyyz, sekaiyin, sewangpei}@mail.scut.edu.cn, wangzy@scut.edu.cn, zhang1rui4@outlook.com, zhhuangscut@gmail.com
Pseudocode	Yes	More detailed procedure is shown in Algorithm 1. Algorithm 1 Dynamic Dialogue Segmentation
Open Source Code	Yes	Source code is at https://github.com/zhaoyangyangHH/DRD3Q.
Open Datasets	No	In the experiment, we use a movie-ticket booking dataset which contains raw conversational data collected via Amazon Mechanical Turk. The dataset has been manually labeled based on a schema deﬁned by domain experts, as shown in Table 1, consisting of 11 intents and 16 slots. No specific link, DOI, or citation to a publicly available version of this dataset is provided.
Dataset Splits	No	The paper does not explicitly provide training/validation/test dataset splits (e.g., percentages or sample counts). It mentions buffer sizes for reinforcement learning experience replay, but these are not the same as dataset splits for supervised learning.
Hardware Specification	No	The paper does not explicitly describe the hardware used for its experiments, such as specific GPU or CPU models.
Software Dependencies	No	The paper mentions using MLPs and specific activation functions but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	For all the models the world models (DQN, DDQ, and DR-D3Q) and their variants, we use MLPs to parameterize the value networks Q( ) with one hidden layer of size 80 and Re LU activation. [...] We set the discount factor γ = 0.9. The buffer size of Bu and Bs is set to 2000 and 2000 K planning steps, respectively. The batch size is 16, and the learning rate is 0.001. We applied gradient clipping on all the model parameters with a maximum norm of 1 to prevent gradient explosion. The target network is updated at the beginning of each training episode.