Proper Value Equivalence
Authors: Christopher Grimm, Andre Barreto, Greg Farquhar, David Silver, Satinder Singh
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first provide results from tabular experiments on a stochastic version of the Four Rooms domain which serve to corroborate our theoretical claims. Then, we present results from experiments across the full Atari 57 benchmark [3] showcasing that the insights from studying PVE and its relationship to Mu Zero can provide a benefit in practice at scale. |
| Researcher Affiliation | Collaboration | Christopher Grimm Computer Science & Engineering University of Michigan crgrimm@umich.edu Andre Barreto, Gregory Farquhar, David Silver, Satinder Singh Deep Mind {andrebarreto,gregfar, davidsilver,baveja}@google.com |
| Pseudocode | No | The paper describes algorithms and derivations in text and mathematical formulas but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for the illustrative experiments is available at a URL provided in Appendix A.3. |
| Open Datasets | Yes | We use the standard OpenAI Gym wrapper for Atari environments [3]. |
| Dataset Splits | No | The paper mentions using the Atari 57 benchmark and states that hyperparameters were not explicitly tuned but rather used default values from a previous Mu Zero paper, implying standard splits. However, it does not explicitly state the train/validation/test dataset splits within the provided text. |
| Hardware Specification | Yes | We run our experiments on a custom internal cluster using Nvidia V100 GPUs. |
| Software Dependencies | No | The paper mentions using the "OpenAI Gym wrapper for Atari environments [3]" but does not specify any version numbers for this or other software dependencies. |
| Experiment Setup | Yes | All experiments ran on 64 actors/1 learner on a custom internal cluster for 500 million frames with a batch size of 2048. For the Four Rooms domain experiments we use a learning rate of 1e-4 and a discount factor of 0.99 for all agents. The model updates are performed via Adam with epsilon 1e-3, beta1 0.9, and beta2 0.999. Atari experiments used a learning rate of 2e-4, Adam with epsilon 1e-3, beta1 0.9, and beta2 0.999. The agent was trained for 500M frames for all games. |