Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety Constraints in Finite MDPs

Authors: harsh satija, Philip S. Thomas, Joelle Pineau, Romain Laroche

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively test our approach on a synthetic safety-gridworld task in Section 4 and show that the proposed algorithm achieves better data efficiency than the existing approaches. Finally, we show its benefits on a critical-care task in Section 5.
Researcher Affiliation Collaboration Harsh Satija Mc Gill University, Mila harsh.satija@mail.mcgill.ca Philip S. Thomas University of Massachusetts pthomas@cs.umass.edu Joelle Pineau Mc Gill University, Mila, Facebook AI Research jpineau@cs.mcgill.ca Romain Laroche Microsoft Research romain.laroche@microsoft.com
Pseudocode No The paper describes the proposed algorithms using mathematical formulations and textual descriptions, but it does not include any explicitly labeled pseudocode blocks or algorithm listings.
Open Source Code Yes The accompanying codebase is available at https://github.com/hercky/mo-spibb-codebase.
Open Datasets Yes We use the publicly available ICU dataset MIMIC-III (Johnson et al., 2016), with the setup described by Komorowski et al. (2018); Tang et al. (2020) and build on top of their data pre-processing and MDP construction methodology.4 This leaves us with a cohort of 20,954 unique patients.
Dataset Splits Yes We run our methods for 10 runs with different random seeds, where for each run the cohort dataset was split into train/valid/test sets in the ratios of 0.7/0.1/0.2.
Hardware Specification Yes The full pipeline including data processing and training took roughly 2 days on a single GPU (NVIDIA 1080 Ti).
Software Dependencies No The paper mentions using "standard solvers, such as cvxpy" but does not specify version numbers for cvxpy or any other software dependencies, making it difficult to reproduce the exact software environment.
Experiment Setup Yes We test on different combinations of user preference (λ) and baseline s quality (ρ) on 100 randomly generated CMDPs, where λi {0, 1}, ρ {0.1, 0.4, 0.7, 0.9} and |D| {10, 50, 500, 2000}. We evaluate under two settings: (i) we use a fixed set of parameters across different (λ, ρ) combinations, where we run S-OPT with ϵ {0.01, 0.1, 1.0} and H-OPT with Doubly Robust IS estimator (Jiang and Li, 2015) and Student s t-test concentration inequality