Constrained Policy Optimization via Bayesian World Models
Authors: Yarden As, Ilnura Usmanova, Sebastian Curi, Andreas Krause
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate LAMBDA s state of the art performance on the Safety-Gym benchmark suite in terms of sample efficiency and constraint violation. We conduct our experiments with the SG6 benchmark as described by Ray et al. (2019), aiming to answer the following questions: How does our model-based approach compare to model-free variants in terms of performance, sample efficiency and constraint violation? |
| Researcher Affiliation | Academia | Yarden As ETH Zurich Ilnura Usmanova, Sebastian Curi ETH Zurich Andreas Krause ETH Zurich |
| Pseudocode | Yes | Algorithm 1 Upper confidence bounds estimation via posterior sampling. Algorithm 2 LAMBDA. Algorithm 3 Sampling from the predictive density pθ(sτ:τ+H|sτ 1, aτ 1:τ+H 1, θ). |
| Open Source Code | Yes | We provide an open-source code for our experiments, including videos of the trained agents at https://github.com/yardenas/la-mbda. |
| Open Datasets | Yes | We conduct our experiments with the SG6 benchmark as described by Ray et al. (2019) |
| Dataset Splits | No | The paper mentions 'evaluation episodes' used for computing performance metrics but does not specify a separate 'validation' dataset split for hyperparameter tuning or model selection during training. |
| Hardware Specification | Yes | the main bottleneck is the gradient step computation which takes roughly 0.5 seconds on a single unit of Nvidia Ge Force RTX2080Ti GPU. |
| Software Dependencies | No | The paper mentions software like ADAM optimizer, ELU, and ReLU activation functions, but does not provide specific version numbers for these software components or libraries (e.g., PyTorch, TensorFlow, or specific library versions for the mentioned optimizers/activations). |
| Experiment Setup | Yes | Table 1: Hyperparameters for LAMBDA. For other safety tasks, we recommend first tuning the initial Lagrangian, penalty and penalty power factor at different scales and then fine-tune the safety discount factor to improve constraint satisfaction. We emphasize that it is possible to improve the performance of each task separately by fine-tuning the hyperparameters on a per-task basis. |