Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety Constraints in Finite MDPs
Authors: harsh satija, Philip S. Thomas, Joelle Pineau, Romain Laroche
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively test our approach on a synthetic safety-gridworld task in Section 4 and show that the proposed algorithm achieves better data efficiency than the existing approaches. Finally, we show its benefits on a critical-care task in Section 5. |
| Researcher Affiliation | Collaboration | Harsh Satija Mc Gill University, Mila harsh.satija@mail.mcgill.ca Philip S. Thomas University of Massachusetts pthomas@cs.umass.edu Joelle Pineau Mc Gill University, Mila, Facebook AI Research jpineau@cs.mcgill.ca Romain Laroche Microsoft Research romain.laroche@microsoft.com |
| Pseudocode | No | The paper describes the proposed algorithms using mathematical formulations and textual descriptions, but it does not include any explicitly labeled pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | The accompanying codebase is available at https://github.com/hercky/mo-spibb-codebase. |
| Open Datasets | Yes | We use the publicly available ICU dataset MIMIC-III (Johnson et al., 2016), with the setup described by Komorowski et al. (2018); Tang et al. (2020) and build on top of their data pre-processing and MDP construction methodology.4 This leaves us with a cohort of 20,954 unique patients. |
| Dataset Splits | Yes | We run our methods for 10 runs with different random seeds, where for each run the cohort dataset was split into train/valid/test sets in the ratios of 0.7/0.1/0.2. |
| Hardware Specification | Yes | The full pipeline including data processing and training took roughly 2 days on a single GPU (NVIDIA 1080 Ti). |
| Software Dependencies | No | The paper mentions using "standard solvers, such as cvxpy" but does not specify version numbers for cvxpy or any other software dependencies, making it difficult to reproduce the exact software environment. |
| Experiment Setup | Yes | We test on different combinations of user preference (λ) and baseline s quality (ρ) on 100 randomly generated CMDPs, where λi {0, 1}, ρ {0.1, 0.4, 0.7, 0.9} and |D| {10, 50, 500, 2000}. We evaluate under two settings: (i) we use a fixed set of parameters across different (λ, ρ) combinations, where we run S-OPT with ϵ {0.01, 0.1, 1.0} and H-OPT with Doubly Robust IS estimator (Jiang and Li, 2015) and Student s t-test concentration inequality |