reproducibilityindex.ai

Constrained Policy Optimization via Bayesian World Models

Authors: Yarden As, Ilnura Usmanova, Sebastian Curi, Andreas Krause

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate LAMBDA s state of the art performance on the Safety-Gym benchmark suite in terms of sample efﬁciency and constraint violation. We conduct our experiments with the SG6 benchmark as described by Ray et al. (2019), aiming to answer the following questions: How does our model-based approach compare to model-free variants in terms of performance, sample efﬁciency and constraint violation?
Researcher Affiliation	Academia	Yarden As ETH Zurich Ilnura Usmanova, Sebastian Curi ETH Zurich Andreas Krause ETH Zurich
Pseudocode	Yes	Algorithm 1 Upper conﬁdence bounds estimation via posterior sampling. Algorithm 2 LAMBDA. Algorithm 3 Sampling from the predictive density pθ(sτ:τ+H\|sτ 1, aτ 1:τ+H 1, θ).
Open Source Code	Yes	We provide an open-source code for our experiments, including videos of the trained agents at https://github.com/yardenas/la-mbda.
Open Datasets	Yes	We conduct our experiments with the SG6 benchmark as described by Ray et al. (2019)
Dataset Splits	No	The paper mentions 'evaluation episodes' used for computing performance metrics but does not specify a separate 'validation' dataset split for hyperparameter tuning or model selection during training.
Hardware Specification	Yes	the main bottleneck is the gradient step computation which takes roughly 0.5 seconds on a single unit of Nvidia Ge Force RTX2080Ti GPU.
Software Dependencies	No	The paper mentions software like ADAM optimizer, ELU, and ReLU activation functions, but does not provide specific version numbers for these software components or libraries (e.g., PyTorch, TensorFlow, or specific library versions for the mentioned optimizers/activations).
Experiment Setup	Yes	Table 1: Hyperparameters for LAMBDA. For other safety tasks, we recommend ﬁrst tuning the initial Lagrangian, penalty and penalty power factor at different scales and then ﬁne-tune the safety discount factor to improve constraint satisfaction. We emphasize that it is possible to improve the performance of each task separately by ﬁne-tuning the hyperparameters on a per-task basis.