Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding
Authors: Alizée Pace, Hugo Yèche, Bernhard Schölkopf, Gunnar Ratsch, Guy Tennenholtz
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate through extensive experiments and ablations the efficacy of our approach on a sepsis management benchmark, as well as on electronic health records. |
| Researcher Affiliation | Collaboration | 1 ETH AI Center 2 Department of Computer Science, ETH Z urich 3 Max Planck Institute for Intelligent Systems, T ubingen 4 Google Research |
| Pseudocode | Yes | Algorithm 1: Delphic Offline Reinforcement Learning |
| Open Source Code | Yes | We include source code as supplementary material. |
| Open Datasets | Yes | Our real-world data experiment is based on the publicly available Hi RID dataset (Hyland et al., 2020). |
| Dataset Splits | Yes | Training is carried out for 50 epochs or until loss on the validation subset (10% of training data) increases for more than 5 consecutive epochs. |
| Hardware Specification | Yes | Training is carried out on NVIDIA RTX2080Ti GPUs on our local cluster, using the Adam optimiser with default learning rate and a batch size of 32. |
| Software Dependencies | No | All reinforcement learning algorithms and baselines are implemented based on the open access d3rlpy library (Seno and Imai, 2022). |
| Experiment Setup | Yes | Between world models w, the confounder space dimensionality is randomly varied over |Z| = {1, 2, 4, 8, 16}, and the prior for p(z) = N(z; 0, Σ2) is randomly varied through the variance for each z-dimension, Σ2 ii = {1.0, 0.1, 0.01}. The discount factor used is γ = 0.99, and state and actions are normalised to mean 0 and variance 1 (Fujimoto and Gu, 2021) for all algorithms. Training is carried out... with a batch size of 32. Discrete CQL (Kumar et al., 2019) is implemented with a penalty hyperparameter α of 1.0 (sepsis environment) and 0.5 (ICU dataset), tuned over the following values: {0.1, 0.5, 1.0, 2.0, 5.0}. |