Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding

Authors: Alizée Pace, Hugo Yèche, Bernhard Schölkopf, Gunnar Ratsch, Guy Tennenholtz

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate through extensive experiments and ablations the efficacy of our approach on a sepsis management benchmark, as well as on electronic health records.
Researcher Affiliation Collaboration 1 ETH AI Center 2 Department of Computer Science, ETH Z urich 3 Max Planck Institute for Intelligent Systems, T ubingen 4 Google Research
Pseudocode Yes Algorithm 1: Delphic Offline Reinforcement Learning
Open Source Code Yes We include source code as supplementary material.
Open Datasets Yes Our real-world data experiment is based on the publicly available Hi RID dataset (Hyland et al., 2020).
Dataset Splits Yes Training is carried out for 50 epochs or until loss on the validation subset (10% of training data) increases for more than 5 consecutive epochs.
Hardware Specification Yes Training is carried out on NVIDIA RTX2080Ti GPUs on our local cluster, using the Adam optimiser with default learning rate and a batch size of 32.
Software Dependencies No All reinforcement learning algorithms and baselines are implemented based on the open access d3rlpy library (Seno and Imai, 2022).
Experiment Setup Yes Between world models w, the confounder space dimensionality is randomly varied over |Z| = {1, 2, 4, 8, 16}, and the prior for p(z) = N(z; 0, Σ2) is randomly varied through the variance for each z-dimension, Σ2 ii = {1.0, 0.1, 0.01}. The discount factor used is γ = 0.99, and state and actions are normalised to mean 0 and variance 1 (Fujimoto and Gu, 2021) for all algorithms. Training is carried out... with a batch size of 32. Discrete CQL (Kumar et al., 2019) is implemented with a penalty hyperparameter α of 1.0 (sepsis environment) and 0.5 (ICU dataset), tuned over the following values: {0.1, 0.5, 1.0, 2.0, 5.0}.