Dynamic Reward-Based Dueling Deep Dyna-Q: Robust Policy Learning in Noisy Environments
Authors: Yangyang Zhao, Zhenyu Wang, Kai Yin, Rui Zhang, Zhenhua Huang, Pei Wang9676-9684
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments using simulation and human evaluation show that the DR-D3Q significantly improve the performance of policy learning tasks in noisy environments.1 |
| Researcher Affiliation | Academia | Yangyang Zhao, Zhenyu Wang, Kai Yin, Rui Zhang, Zhenhua Huang, Pei Wang School of Software, South China University of Technology {msyyz, sekaiyin, sewangpei}@mail.scut.edu.cn, wangzy@scut.edu.cn, zhang1rui4@outlook.com, zhhuangscut@gmail.com |
| Pseudocode | Yes | More detailed procedure is shown in Algorithm 1. Algorithm 1 Dynamic Dialogue Segmentation |
| Open Source Code | Yes | Source code is at https://github.com/zhaoyangyangHH/DRD3Q. |
| Open Datasets | No | In the experiment, we use a movie-ticket booking dataset which contains raw conversational data collected via Amazon Mechanical Turk. The dataset has been manually labeled based on a schema defined by domain experts, as shown in Table 1, consisting of 11 intents and 16 slots. No specific link, DOI, or citation to a publicly available version of this dataset is provided. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits (e.g., percentages or sample counts). It mentions buffer sizes for reinforcement learning experience replay, but these are not the same as dataset splits for supervised learning. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for its experiments, such as specific GPU or CPU models. |
| Software Dependencies | No | The paper mentions using MLPs and specific activation functions but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | For all the models the world models (DQN, DDQ, and DR-D3Q) and their variants, we use MLPs to parameterize the value networks Q( ) with one hidden layer of size 80 and Re LU activation. [...] We set the discount factor γ = 0.9. The buffer size of Bu and Bs is set to 2000 and 2000 K planning steps, respectively. The batch size is 16, and the learning rate is 0.001. We applied gradient clipping on all the model parameters with a maximum norm of 1 to prevent gradient explosion. The target network is updated at the beginning of each training episode. |