Frequency-based Search-control in Dyna

Authors: Yangchen Pan, Jincheng Mei, Amir-massoud Farahmand

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically show that a high frequency function is more difficult to approximate. This suggests a search-control strategy: we should use states from high frequency regions of the value function to query the model to acquire more samples. We develop a simple strategy to locally measure the frequency of a function by gradient and hessian norms, and provide theoretical justification for this approach. We then apply our strategy to search-control in Dyna, and conduct experiments to show its property and effectiveness on benchmark domains.
Researcher Affiliation Academia Yangchen Pan & Jincheng Mei Department of Computing Science University of Alberta Edmonton, AB, Canada {pan6,jmei2}@ualberta.ca Amir-massoud Farahmand Vector Institute & University of Toronto Toronto, ON, Canada farahmand@vectorinstitute.ai
Pseudocode Yes Algorithm 1 Dyna architecture with Frequency-based search-control ... Algorithm 4 Dyna architecture with Frequency-based search-control with additional details
Open Source Code No The paper does not provide a statement about releasing its source code or a link to a code repository for the methodology described.
Open Datasets Yes The Mountain Car (Brockman et al., 2016) domain is well-studied... Then we illustrate the utility of our algorithm on a challenging self-designed Maze Grid World domain... Hopper-v2 and Walker2d-v2 from Mujoco (Todorov et al., 2012)
Dataset Splits No The paper does not specify training, validation, and test dataset splits with percentages, sample counts, or references to predefined splits for its main reinforcement learning experiments.
Hardware Specification No The paper does not specify the hardware used for running experiments, such as particular GPU or CPU models.
Software Dependencies Yes All of our implementations are based on tensorflow with version 1.13.0 (Abadi et al., 2015). For DQN update, we use Adam optimizer (Kingma & Ba, 2014).
Experiment Setup Yes For DQN update, we use Adam optimizer (Kingma & Ba, 2014). We use mini-batch size b = 32 except on the supervised learning experiment where we use 128. For reinforcement learning experiment, we use buffer size 100k. All activation functions are tanh except the output layer of the Q-value is linear. Except the output layer parameters which were initialized from a uniform distribution [ 0.003, 0.003], all other parameters are initialized using Xavier initialization (Glorot & Bengio, 2010). For model learning, we use a 64 × 64 relu units neural network to predict s'−s given a state-action pair with mini-batch size 128 and learning rate 0.0001.