Value Gradient weighted Model-Based Reinforcement Learning

Authors: Claas A Voelcker, Victor Liao, Animesh Garg, Amir-massoud Farahmand

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify our analysis by showing that our loss function is able to achieve high returns on the Mujoco benchmark suite while being more robust than maximum likelihood based approaches. We verify our analysis by showing that our loss function is able to achieve high returns on the Mujoco benchmark suite while being more robust than maximum likelihood based approaches. Our experiments show, qualitatively and quantitatively, that the Va Gra M loss impacts the resulting state and value prediction accuracy, and that it solves the optimization problems of previously published approaches. Beyond pedagogical domains, we show that Va Gra M performs on par with a current state-of-the art MBRL algorithms in more complex continuous control domains, while improving robustness to irrelevant dimensions in the state-space and smaller model sizes.
Researcher Affiliation Collaboration Claas A. Voelcker1,2, Victor Liao1,3, Animesh Garg1,2,4, Amir-massoud Farahmand1,2 1 Vector Institute, 2 University of Toronto, 3 University of Waterloo, 4 Nvidia Correspondence to c.voelcker@cs.toronto.edu
Pseudocode Yes Algorithm 1: Value-Gradient weighted Model learning (Va Gra M)
Open Source Code Yes We provide an open-source version of our code at https://github.com/pairlab/vagram.
Open Datasets Yes We used the Hopper environment from the Open AI gym benchmark (Brockman et al., 2016).
Dataset Splits No The paper mentions evaluating on a "held out dataset" but does not provide specific percentages or counts for training, validation, and test splits. It describes data usage for training the value function (e.g., "linearly increasing the amount of model samples from 0 to 95% of the SAC replay buffer over the first 40 epochs of training"), which is a training strategy, not a dataset split description.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as CPU models, GPU models, or memory. It only mentions general support from institutions.
Software Dependencies No The paper states: "All experiments were implemented in Py Torch (Paszke et al., 2019)". While PyTorch is mentioned and cited, a specific version number (e.g., PyTorch 1.9) is not explicitly stated within the text of the paper itself. It notes that versions are documented in the source code, but this information is not directly in the paper.
Experiment Setup Yes More information on the implementation and hyperparameters of all of our experiments can be found in Appendix E. In Appendix E: For the Pendulum experiments, we use a simple fully connected neural network with a single layer, and a linear regression without feature transformations as architectures. The used non-linearity is Re LU. ... To assure a fair comparison we used the hyperparameters provided by Janner et al. (2019) for all experiments with our approach and the NLL loss function used for the baseline. All models used were fully connected neural networks with Si LU non-linearities and standard initialization. For all experiments with full model size we followed MBPO and used seven ensemble members with four layers and 200 neurons per layer. ... it was necessary to make a small alteration to the training setup: in the provided implementation, the value function is solely trained on model samples. Since our model is directly dependent on the value function, we need to break the inter-dependency between model and value function in the early training iterations. Hence, we used both real environment data and model data to train the value function, linearly increasing the amount of model samples from 0 to 95% of the SAC replay buffer over the first 40 epochs of training (corresponding to 40.000 real environment steps).