Gamma-Models: Generative Temporal Difference Learning for Infinite-Horizon Prediction

Authors: Michael Janner, Igor Mordatch, Sergey Levine

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental evaluation is designed to study the viability of γ-models as a replacement of conventional single-step models for long-horizon state prediction and model-based control. Figure 5 shows learning curves for all methods. We find that γ-MVE converges faster than prior algorithms, twice as quickly as SAC, while retaining their asymptotic performance.
Researcher Affiliation Collaboration Michael Janner1 Igor Mordatch2 Sergey Levine12 1UC Berkeley 2Google Brain {janner, svlevine}@eecs.berkeley.edu imordatch@google.com
Pseudocode Yes Algorithm 1 γ-model training without density evaluation. Algorithm 2 γ-model training with density evaluation.
Open Source Code No The paper does not provide an explicit statement about the release of source code or a link to a code repository.
Open Datasets Yes We investigate γ-model predictions as a function of discount in continuous-action versions of two benchmark environments suitable for visualization: acrobot (Sutton, 1996) and pendulum. The training data come from a mixture distribution over all intermediate policies of 200 epochs of optimization with soft-actor critic (SAC; Haarnoja et al. 2018).
Dataset Splits No The paper mentions 'training data' but does not specify explicit training, validation, or test dataset splits with percentages, counts, or a detailed splitting methodology.
Hardware Specification No The acknowledgements section mentions 'computational resource donations from Amazon', but no specific hardware details such as GPU/CPU models, processors, or cloud instance specifications are provided for running the experiments.
Software Dependencies Yes All models were implemented using PyTorch 1.5.0 and CUDA 10.2.
Experiment Setup Yes Further implementation details, including all hyperparameter settings and network architectures, are included in Appendix C. (Appendix C contains: MLPs with 2 hidden layers of 256 units and ReLU activation functions, learning rate of 3e-4, batch size of 256, target update coefficient of 0.005, discount factor γ = 0.99 for all Mujoco experiments, etc.)