Control-Oriented Model-Based Reinforcement Learning with Implicit Differentiation
Authors: Evgenii Nikishin, Romina Abachi, Rishabh Agarwal, Pierre-Luc Bacon7886-7894
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide theoretical and empirical evidence highlighting the benefits of our approach in the model misspecification regime compared to likelihood-based methods. and This section aims to test the following hypotheses: The OMD agent with approximations from Section 6 achieves near-optimal returns. The performance of OMD is better compared to MLE under the model misspecification. Parameters θ of the OMD model have low likelihood, yet an agent acting with the Q-function trained with the model achieves near-optimal returns in the true MDP. and 7 Experiments with Function Approximation |
| Researcher Affiliation | Collaboration | 1Mila, Universit e de Montr eal, 2Vector Institute, University of Toronto 3Google Research, Brain Team, 4Facebook CIFAR AI Chair |
| Pseudocode | Yes | Algorithm 1: Model Based RL with OMD |
| Open Source Code | No | The paper links to a third-party implementation of Soft Actor-Critic (https://github.com/ikostrikov/jaxrl) which was used as an inner optimizer, but it does not provide concrete access to its own source code for the OMD methodology. |
| Open Datasets | Yes | Setup. We first use Cart Pole (Barto, Sutton, and Anderson 1983) and later include results on Mu Jo Co Half Cheetah (Todorov, Erez, and Tassa 2012) with similar findings further supporting our conclusions. |
| Dataset Splits | No | The paper mentions using Cart Pole and Half Cheetah environments but does not provide specific details on the training, validation, and test dataset splits used for reproduction, such as exact percentages or sample counts. |
| Hardware Specification | No | The paper mentions 'Compute Canada for computational resources' but does not provide specific hardware details such as exact GPU or CPU models, processor types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions JAX and refers to a JAX-based implementation of Soft Actor-Critic, but it does not provide specific version numbers for these or any other software dependencies required to replicate the experiment. |
| Experiment Setup | No | The paper mentions some general settings like a soft Bellman operator with temperature α = 0.01 and using K steps for optimization, but it does not provide a comprehensive list of specific experimental setup details, such as learning rates, batch sizes, exact optimizer settings, or other hyperparameters required for full reproducibility. |