Hill Climbing on Value Estimates for Search-control in Dyna
Authors: Yangchen Pan, Hengshuai Yao, Amir-massoud Farahmand, Martha White
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide an empirical demonstration on four classical domains that our algorithm, HC-Dyna, can obtain significant sample efficiency improvements. We conduct experiments showing improved performance in four benchmark domains. |
| Researcher Affiliation | Collaboration | Yangchen Pan1 , Hengshuai Yao2 , Amir-massoud Farahmand3,4 and Martha White1 1Department of Computing Science, University of Alberta, Canada 2Huawei Hi Silicon, Canada 3Vector Institute, Canada 4Department of Computer Science, University of Toronto, Canada pan6@ualberta.ca, hengshuai.yao@huawei.com, farahmand@vectorinstitute.ai, whitem@ualberta.ca |
| Pseudocode | Yes | Algorithm 1 HC-Dyna |
| Open Source Code | No | The paper does not provide any explicit statement about making its source code publicly available or provide a link to a code repository. |
| Open Datasets | Yes | In this section, we present empirical results on four classic domains: the Grid World (Figure 1(a)), Mountain Car, Cart Pole and Acrobot. We test on a simplified Tabular Grid World domain of size 20 × 20. |
| Dataset Splits | No | The paper discusses training and evaluating models within reinforcement learning environments, but it does not specify explicit train/validation/test dataset splits in the conventional sense of data partitioning for supervised learning. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to conduct the experiments (e.g., GPU/CPU models, memory specifications). |
| Software Dependencies | No | The paper mentions using DQN and DDPG algorithms and a two-layer NN, but does not specify any software libraries (e.g., TensorFlow, PyTorch) or their version numbers. |
| Experiment Setup | Yes | Input:budget k for the number of gradient ascent steps (e.g., k=100), stochasticity η for gradient ascent (e.g., η=0.1), ρ percentage of updates from SC queue (e.g., ρ=0.5), d the number of state variables, i.e. S Rd. The agents all use a two-layer NN, with ReLU activations and 32 nodes in each layer. We set the step size to α=0.1/||ˆΣsg|| across all results in this work. In all further experiments in this paper, we set ρ=0.5. The continuous-state setting uses NNs ... with a minibatch size of 32. For the tabular setting, the mini-batch size is 1. We further include multiple planning steps n, where for each real environment step, the agent does n updates with a mini-batch of size 32. |