Proper Value Equivalence

Authors: Christopher Grimm, Andre Barreto, Greg Farquhar, David Silver, Satinder Singh

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first provide results from tabular experiments on a stochastic version of the Four Rooms domain which serve to corroborate our theoretical claims. Then, we present results from experiments across the full Atari 57 benchmark [3] showcasing that the insights from studying PVE and its relationship to Mu Zero can provide a benefit in practice at scale.
Researcher Affiliation Collaboration Christopher Grimm Computer Science & Engineering University of Michigan crgrimm@umich.edu Andre Barreto, Gregory Farquhar, David Silver, Satinder Singh Deep Mind {andrebarreto,gregfar, davidsilver,baveja}@google.com
Pseudocode No The paper describes algorithms and derivations in text and mathematical formulas but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code for the illustrative experiments is available at a URL provided in Appendix A.3.
Open Datasets Yes We use the standard OpenAI Gym wrapper for Atari environments [3].
Dataset Splits No The paper mentions using the Atari 57 benchmark and states that hyperparameters were not explicitly tuned but rather used default values from a previous Mu Zero paper, implying standard splits. However, it does not explicitly state the train/validation/test dataset splits within the provided text.
Hardware Specification Yes We run our experiments on a custom internal cluster using Nvidia V100 GPUs.
Software Dependencies No The paper mentions using the "OpenAI Gym wrapper for Atari environments [3]" but does not specify any version numbers for this or other software dependencies.
Experiment Setup Yes All experiments ran on 64 actors/1 learner on a custom internal cluster for 500 million frames with a batch size of 2048. For the Four Rooms domain experiments we use a learning rate of 1e-4 and a discount factor of 0.99 for all agents. The model updates are performed via Adam with epsilon 1e-3, beta1 0.9, and beta2 0.999. Atari experiments used a learning rate of 2e-4, Adam with epsilon 1e-3, beta1 0.9, and beta2 0.999. The agent was trained for 500M frames for all games.