reproducibilityindex.ai

On-Policy Model Errors in Reinforcement Learning

Authors: Lukas Froehlich, Maksym Lefarov, Melanie Zeilinger, Felix Berkenkamp

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on Mu Jo Coand Py Bullet-benchmarks show that our method can drastically improve existing model-based approaches without introducing additional tuning parameters. 4 EXPERIMENTAL RESULTS
Researcher Affiliation	Collaboration	Lukas P. Fröhlich Institute for Dynamic Systems and Control ETH Zürich Zurich, Switzerland lukasfro@ethz.ch Maksym Lefarov Bosch Center for Artiﬁcial Intelligence Renningen, Germany Maksym.Lefarov@de.bosch.com Melanie N. Zeilinger Institute for Dynamic Systems and Control ETH Zürich Zurich, Switzerland mzeilinger@ethz.ch Felix Berkenkamp Bosch Center for Artiﬁcial Intelligence Renningen, Germany Felix.Berkenkamp@de.bosch.com
Pseudocode	Yes	Algorithm 2 Branched rollout scheme with OPC model (differences to MBPO highlighted in blue)
Open Source Code	No	Our implementation is based on the code from MBPO (Janner et al., 2019), which is open-sourced under the MIT license. The paper states that their implementation is based on open-source code from another paper, but does not explicitly state that the code for their method (OPC) is open-sourced or provide a link to it.
Open Datasets	Yes	on various continuous control tasks from the Mu Jo Co control suite and their Py Bullet variants. Mu Jo Co control suite (Todorov et al., 2012) and their respective Py Bullet variants (Ellenberger, 2018 2019).
Dataset Splits	No	The paper mentions 'training data' and 'evaluation return' but does not specify explicit train/validation/test dataset splits. RL often evaluates directly on the environment rather than using a dedicated validation set from a static dataset split.
Hardware Specification	Yes	All experiments were run on an HPC cluster, where each individual experiment used one Nvidia V100 GPU and four Intel Xeon CPUs.
Software Dependencies	No	The paper mentions using the Soft Actor-Critic (SAC) algorithm and that its implementation is based on MBPO's code, but it does not provide specific version numbers for any software libraries, frameworks, or dependencies used in the experiments.
Experiment Setup	Yes	The rollout horizon to generate training data is set to H = 10 for all experiments. Table 1: Hyperparameter settings for OPC (blue) and MBPO( ) (red) for results shown in Fig. 3. Note that the respective hyperparameters for each environment are shared across the different implementations, i.e., Mu Jo Co and Py Bullet.