Value Prediction Network
Authors: Junhyuk Oh, Satinder Singh, Honglak Lee
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show that VPN has several advantages over both model-free and model-based baselines in a stochastic environment where careful planning is required but building an accurate observation-prediction model is difficult. Furthermore, VPN outperforms Deep Q-Network (DQN) on several Atari games even with short-lookahead planning, demonstrating its potential as a new way of learning a good state representation. Our experiments investigated the following questions: 1) Does VPN outperform model-free baselines (e.g., DQN)? 2) What is the advantage of planning with a VPN over observation-based planning? 3) Is VPN useful for complex domains with high-dimensional sensory inputs, such as Atari games? |
| Researcher Affiliation | Collaboration | Junhyuk Oh Satinder Singh Honglak Lee , University of Michigan Google Brain {junhyuk,baveja,honglak}@umich.edu, honglak@google.com |
| Pseudocode | Yes | Algorithm 1 Q-value from d-step planning |
| Open Source Code | Yes | The code is available on https://github.com/junhyukoh/value-prediction-network. |
| Open Datasets | Yes | Furthermore, we show that our VPN outperforms DQN on several Atari games [2] even with short-lookahead planning... (Ref [2] is for The Arcade Learning Environment, a standard public platform/dataset). |
| Dataset Splits | No | The paper does not explicitly provide specific train/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction. It mentions training trajectories and evaluation but lacks explicit split details. |
| Hardware Specification | No | The paper mentions that '16 threads are used' for asynchronous n-step Q-learning but does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | No | Our implementation is based on Tensor Flow [1]. (While TensorFlow is named, no specific version number is provided for it or any other software dependencies.) |
| Experiment Setup | Yes | The target network is synchronized after every 10K steps. We used the Adam optimizer [14], and the best learning rate and its decay were chosen from {0.0001, 0.0002, 0.0005, 0.001} and {0.98, 0.95, 0.9, 0.8} respectively. The learning rate is multiplied by the decay every 1M steps. Our implementation is based on Tensor Flow [1].1 VPN has four more hyperparameters: 1) the number of predictions steps (k) during training, 2) the plan depth (dtrain) during training, 3) the plan depth (dtest) during evaluation, and 4) the branching factor (b) which indicates the number of options to be simulated for each expansion step during planning. We used k = dtrain = dtest throughout the experiment unless otherwise stated. VPN(d) represents our model which learns to predict and simulate up to d-step futures during training and evaluation. The branching factor (b) was set to 4 until depth of 3 and set to 1 after depth of 3, which means that VPN simulates 4-best options up to depth of 3 and only the best option after that. |