Constrained Policy Optimization via Bayesian World Models

Authors: Yarden As, Ilnura Usmanova, Sebastian Curi, Andreas Krause

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate LAMBDA s state of the art performance on the Safety-Gym benchmark suite in terms of sample efficiency and constraint violation. We conduct our experiments with the SG6 benchmark as described by Ray et al. (2019), aiming to answer the following questions: How does our model-based approach compare to model-free variants in terms of performance, sample efficiency and constraint violation?
Researcher Affiliation Academia Yarden As ETH Zurich Ilnura Usmanova, Sebastian Curi ETH Zurich Andreas Krause ETH Zurich
Pseudocode Yes Algorithm 1 Upper confidence bounds estimation via posterior sampling. Algorithm 2 LAMBDA. Algorithm 3 Sampling from the predictive density pθ(sτ:τ+H|sτ 1, aτ 1:τ+H 1, θ).
Open Source Code Yes We provide an open-source code for our experiments, including videos of the trained agents at https://github.com/yardenas/la-mbda.
Open Datasets Yes We conduct our experiments with the SG6 benchmark as described by Ray et al. (2019)
Dataset Splits No The paper mentions 'evaluation episodes' used for computing performance metrics but does not specify a separate 'validation' dataset split for hyperparameter tuning or model selection during training.
Hardware Specification Yes the main bottleneck is the gradient step computation which takes roughly 0.5 seconds on a single unit of Nvidia Ge Force RTX2080Ti GPU.
Software Dependencies No The paper mentions software like ADAM optimizer, ELU, and ReLU activation functions, but does not provide specific version numbers for these software components or libraries (e.g., PyTorch, TensorFlow, or specific library versions for the mentioned optimizers/activations).
Experiment Setup Yes Table 1: Hyperparameters for LAMBDA. For other safety tasks, we recommend first tuning the initial Lagrangian, penalty and penalty power factor at different scales and then fine-tune the safety discount factor to improve constraint satisfaction. We emphasize that it is possible to improve the performance of each task separately by fine-tuning the hyperparameters on a per-task basis.