Control-Oriented Model-Based Reinforcement Learning with Implicit Differentiation

Authors: Evgenii Nikishin, Romina Abachi, Rishabh Agarwal, Pierre-Luc Bacon7886-7894

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide theoretical and empirical evidence highlighting the benefits of our approach in the model misspecification regime compared to likelihood-based methods. and This section aims to test the following hypotheses: The OMD agent with approximations from Section 6 achieves near-optimal returns. The performance of OMD is better compared to MLE under the model misspecification. Parameters θ of the OMD model have low likelihood, yet an agent acting with the Q-function trained with the model achieves near-optimal returns in the true MDP. and 7 Experiments with Function Approximation
Researcher Affiliation Collaboration 1Mila, Universit e de Montr eal, 2Vector Institute, University of Toronto 3Google Research, Brain Team, 4Facebook CIFAR AI Chair
Pseudocode Yes Algorithm 1: Model Based RL with OMD
Open Source Code No The paper links to a third-party implementation of Soft Actor-Critic (https://github.com/ikostrikov/jaxrl) which was used as an inner optimizer, but it does not provide concrete access to its own source code for the OMD methodology.
Open Datasets Yes Setup. We first use Cart Pole (Barto, Sutton, and Anderson 1983) and later include results on Mu Jo Co Half Cheetah (Todorov, Erez, and Tassa 2012) with similar findings further supporting our conclusions.
Dataset Splits No The paper mentions using Cart Pole and Half Cheetah environments but does not provide specific details on the training, validation, and test dataset splits used for reproduction, such as exact percentages or sample counts.
Hardware Specification No The paper mentions 'Compute Canada for computational resources' but does not provide specific hardware details such as exact GPU or CPU models, processor types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions JAX and refers to a JAX-based implementation of Soft Actor-Critic, but it does not provide specific version numbers for these or any other software dependencies required to replicate the experiment.
Experiment Setup No The paper mentions some general settings like a soft Bellman operator with temperature α = 0.01 and using K steps for optimization, but it does not provide a comprehensive list of specific experimental setup details, such as learning rates, batch sizes, exact optimizer settings, or other hyperparameters required for full reproducibility.