Frequency-based Search-control in Dyna
Authors: Yangchen Pan, Jincheng Mei, Amir-massoud Farahmand
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically show that a high frequency function is more difficult to approximate. This suggests a search-control strategy: we should use states from high frequency regions of the value function to query the model to acquire more samples. We develop a simple strategy to locally measure the frequency of a function by gradient and hessian norms, and provide theoretical justification for this approach. We then apply our strategy to search-control in Dyna, and conduct experiments to show its property and effectiveness on benchmark domains. |
| Researcher Affiliation | Academia | Yangchen Pan & Jincheng Mei Department of Computing Science University of Alberta Edmonton, AB, Canada {pan6,jmei2}@ualberta.ca Amir-massoud Farahmand Vector Institute & University of Toronto Toronto, ON, Canada farahmand@vectorinstitute.ai |
| Pseudocode | Yes | Algorithm 1 Dyna architecture with Frequency-based search-control ... Algorithm 4 Dyna architecture with Frequency-based search-control with additional details |
| Open Source Code | No | The paper does not provide a statement about releasing its source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | The Mountain Car (Brockman et al., 2016) domain is well-studied... Then we illustrate the utility of our algorithm on a challenging self-designed Maze Grid World domain... Hopper-v2 and Walker2d-v2 from Mujoco (Todorov et al., 2012) |
| Dataset Splits | No | The paper does not specify training, validation, and test dataset splits with percentages, sample counts, or references to predefined splits for its main reinforcement learning experiments. |
| Hardware Specification | No | The paper does not specify the hardware used for running experiments, such as particular GPU or CPU models. |
| Software Dependencies | Yes | All of our implementations are based on tensorflow with version 1.13.0 (Abadi et al., 2015). For DQN update, we use Adam optimizer (Kingma & Ba, 2014). |
| Experiment Setup | Yes | For DQN update, we use Adam optimizer (Kingma & Ba, 2014). We use mini-batch size b = 32 except on the supervised learning experiment where we use 128. For reinforcement learning experiment, we use buffer size 100k. All activation functions are tanh except the output layer of the Q-value is linear. Except the output layer parameters which were initialized from a uniform distribution [ 0.003, 0.003], all other parameters are initialized using Xavier initialization (Glorot & Bengio, 2010). For model learning, we use a 64 × 64 relu units neural network to predict s'−s given a state-action pair with mini-batch size 128 and learning rate 0.0001. |