Value Prediction Network

Authors: Junhyuk Oh, Satinder Singh, Honglak Lee

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results show that VPN has several advantages over both model-free and model-based baselines in a stochastic environment where careful planning is required but building an accurate observation-prediction model is difficult. Furthermore, VPN outperforms Deep Q-Network (DQN) on several Atari games even with short-lookahead planning, demonstrating its potential as a new way of learning a good state representation. Our experiments investigated the following questions: 1) Does VPN outperform model-free baselines (e.g., DQN)? 2) What is the advantage of planning with a VPN over observation-based planning? 3) Is VPN useful for complex domains with high-dimensional sensory inputs, such as Atari games?
Researcher Affiliation Collaboration Junhyuk Oh Satinder Singh Honglak Lee , University of Michigan Google Brain {junhyuk,baveja,honglak}@umich.edu, honglak@google.com
Pseudocode Yes Algorithm 1 Q-value from d-step planning
Open Source Code Yes The code is available on https://github.com/junhyukoh/value-prediction-network.
Open Datasets Yes Furthermore, we show that our VPN outperforms DQN on several Atari games [2] even with short-lookahead planning... (Ref [2] is for The Arcade Learning Environment, a standard public platform/dataset).
Dataset Splits No The paper does not explicitly provide specific train/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction. It mentions training trajectories and evaluation but lacks explicit split details.
Hardware Specification No The paper mentions that '16 threads are used' for asynchronous n-step Q-learning but does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No Our implementation is based on Tensor Flow [1]. (While TensorFlow is named, no specific version number is provided for it or any other software dependencies.)
Experiment Setup Yes The target network is synchronized after every 10K steps. We used the Adam optimizer [14], and the best learning rate and its decay were chosen from {0.0001, 0.0002, 0.0005, 0.001} and {0.98, 0.95, 0.9, 0.8} respectively. The learning rate is multiplied by the decay every 1M steps. Our implementation is based on Tensor Flow [1].1 VPN has four more hyperparameters: 1) the number of predictions steps (k) during training, 2) the plan depth (dtrain) during training, 3) the plan depth (dtest) during evaluation, and 4) the branching factor (b) which indicates the number of options to be simulated for each expansion step during planning. We used k = dtrain = dtest throughout the experiment unless otherwise stated. VPN(d) represents our model which learns to predict and simulate up to d-step futures during training and evaluation. The branching factor (b) was set to 4 until depth of 3 and set to 1 after depth of 3, which means that VPN simulates 4-best options up to depth of 3 and only the best option after that.