Value Gradient weighted Model-Based Reinforcement Learning
Authors: Claas A Voelcker, Victor Liao, Animesh Garg, Amir-massoud Farahmand
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify our analysis by showing that our loss function is able to achieve high returns on the Mujoco benchmark suite while being more robust than maximum likelihood based approaches. We verify our analysis by showing that our loss function is able to achieve high returns on the Mujoco benchmark suite while being more robust than maximum likelihood based approaches. Our experiments show, qualitatively and quantitatively, that the Va Gra M loss impacts the resulting state and value prediction accuracy, and that it solves the optimization problems of previously published approaches. Beyond pedagogical domains, we show that Va Gra M performs on par with a current state-of-the art MBRL algorithms in more complex continuous control domains, while improving robustness to irrelevant dimensions in the state-space and smaller model sizes. |
| Researcher Affiliation | Collaboration | Claas A. Voelcker1,2, Victor Liao1,3, Animesh Garg1,2,4, Amir-massoud Farahmand1,2 1 Vector Institute, 2 University of Toronto, 3 University of Waterloo, 4 Nvidia Correspondence to c.voelcker@cs.toronto.edu |
| Pseudocode | Yes | Algorithm 1: Value-Gradient weighted Model learning (Va Gra M) |
| Open Source Code | Yes | We provide an open-source version of our code at https://github.com/pairlab/vagram. |
| Open Datasets | Yes | We used the Hopper environment from the Open AI gym benchmark (Brockman et al., 2016). |
| Dataset Splits | No | The paper mentions evaluating on a "held out dataset" but does not provide specific percentages or counts for training, validation, and test splits. It describes data usage for training the value function (e.g., "linearly increasing the amount of model samples from 0 to 95% of the SAC replay buffer over the first 40 epochs of training"), which is a training strategy, not a dataset split description. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as CPU models, GPU models, or memory. It only mentions general support from institutions. |
| Software Dependencies | No | The paper states: "All experiments were implemented in Py Torch (Paszke et al., 2019)". While PyTorch is mentioned and cited, a specific version number (e.g., PyTorch 1.9) is not explicitly stated within the text of the paper itself. It notes that versions are documented in the source code, but this information is not directly in the paper. |
| Experiment Setup | Yes | More information on the implementation and hyperparameters of all of our experiments can be found in Appendix E. In Appendix E: For the Pendulum experiments, we use a simple fully connected neural network with a single layer, and a linear regression without feature transformations as architectures. The used non-linearity is Re LU. ... To assure a fair comparison we used the hyperparameters provided by Janner et al. (2019) for all experiments with our approach and the NLL loss function used for the baseline. All models used were fully connected neural networks with Si LU non-linearities and standard initialization. For all experiments with full model size we followed MBPO and used seven ensemble members with four layers and 200 neurons per layer. ... it was necessary to make a small alteration to the training setup: in the provided implementation, the value function is solely trained on model samples. Since our model is directly dependent on the value function, we need to break the inter-dependency between model and value function in the early training iterations. Hence, we used both real environment data and model data to train the value function, linearly increasing the amount of model samples from 0 to 95% of the SAC replay buffer over the first 40 epochs of training (corresponding to 40.000 real environment steps). |