Hill Climbing on Value Estimates for Search-control in Dyna

Authors: Yangchen Pan, Hengshuai Yao, Amir-massoud Farahmand, Martha White

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide an empirical demonstration on four classical domains that our algorithm, HC-Dyna, can obtain significant sample efficiency improvements. We conduct experiments showing improved performance in four benchmark domains.
Researcher Affiliation Collaboration Yangchen Pan1 , Hengshuai Yao2 , Amir-massoud Farahmand3,4 and Martha White1 1Department of Computing Science, University of Alberta, Canada 2Huawei Hi Silicon, Canada 3Vector Institute, Canada 4Department of Computer Science, University of Toronto, Canada pan6@ualberta.ca, hengshuai.yao@huawei.com, farahmand@vectorinstitute.ai, whitem@ualberta.ca
Pseudocode Yes Algorithm 1 HC-Dyna
Open Source Code No The paper does not provide any explicit statement about making its source code publicly available or provide a link to a code repository.
Open Datasets Yes In this section, we present empirical results on four classic domains: the Grid World (Figure 1(a)), Mountain Car, Cart Pole and Acrobot. We test on a simplified Tabular Grid World domain of size 20 × 20.
Dataset Splits No The paper discusses training and evaluating models within reinforcement learning environments, but it does not specify explicit train/validation/test dataset splits in the conventional sense of data partitioning for supervised learning.
Hardware Specification No The paper does not provide any specific details about the hardware used to conduct the experiments (e.g., GPU/CPU models, memory specifications).
Software Dependencies No The paper mentions using DQN and DDPG algorithms and a two-layer NN, but does not specify any software libraries (e.g., TensorFlow, PyTorch) or their version numbers.
Experiment Setup Yes Input:budget k for the number of gradient ascent steps (e.g., k=100), stochasticity η for gradient ascent (e.g., η=0.1), ρ percentage of updates from SC queue (e.g., ρ=0.5), d the number of state variables, i.e. S Rd. The agents all use a two-layer NN, with ReLU activations and 32 nodes in each layer. We set the step size to α=0.1/||ˆΣsg|| across all results in this work. In all further experiments in this paper, we set ρ=0.5. The continuous-state setting uses NNs ... with a minibatch size of 32. For the tabular setting, the mini-batch size is 1. We further include multiple planning steps n, where for each real environment step, the agent does n updates with a mini-batch of size 32.