On-Policy Model Errors in Reinforcement Learning
Authors: Lukas Froehlich, Maksym Lefarov, Melanie Zeilinger, Felix Berkenkamp
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on Mu Jo Coand Py Bullet-benchmarks show that our method can drastically improve existing model-based approaches without introducing additional tuning parameters. 4 EXPERIMENTAL RESULTS |
| Researcher Affiliation | Collaboration | Lukas P. Fröhlich Institute for Dynamic Systems and Control ETH Zürich Zurich, Switzerland lukasfro@ethz.ch Maksym Lefarov Bosch Center for Artificial Intelligence Renningen, Germany Maksym.Lefarov@de.bosch.com Melanie N. Zeilinger Institute for Dynamic Systems and Control ETH Zürich Zurich, Switzerland mzeilinger@ethz.ch Felix Berkenkamp Bosch Center for Artificial Intelligence Renningen, Germany Felix.Berkenkamp@de.bosch.com |
| Pseudocode | Yes | Algorithm 2 Branched rollout scheme with OPC model (differences to MBPO highlighted in blue) |
| Open Source Code | No | Our implementation is based on the code from MBPO (Janner et al., 2019), which is open-sourced under the MIT license. The paper states that *their implementation is based on* open-source code from another paper, but does not explicitly state that the code for *their method* (OPC) is open-sourced or provide a link to it. |
| Open Datasets | Yes | on various continuous control tasks from the Mu Jo Co control suite and their Py Bullet variants. Mu Jo Co control suite (Todorov et al., 2012) and their respective Py Bullet variants (Ellenberger, 2018 2019). |
| Dataset Splits | No | The paper mentions 'training data' and 'evaluation return' but does not specify explicit train/validation/test dataset splits. RL often evaluates directly on the environment rather than using a dedicated validation set from a static dataset split. |
| Hardware Specification | Yes | All experiments were run on an HPC cluster, where each individual experiment used one Nvidia V100 GPU and four Intel Xeon CPUs. |
| Software Dependencies | No | The paper mentions using the Soft Actor-Critic (SAC) algorithm and that its implementation is based on MBPO's code, but it does not provide specific version numbers for any software libraries, frameworks, or dependencies used in the experiments. |
| Experiment Setup | Yes | The rollout horizon to generate training data is set to H = 10 for all experiments. Table 1: Hyperparameter settings for OPC (blue) and MBPO( ) (red) for results shown in Fig. 3. Note that the respective hyperparameters for each environment are shared across the different implementations, i.e., Mu Jo Co and Py Bullet. |