Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding
Authors: Alizée Pace, Hugo Yèche, Bernhard Schölkopf, Gunnar Ratsch, Guy Tennenholtz
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate through extensive experiments and ablations the efficacy of our approach on a sepsis management benchmark, as well as on electronic health records. |
| Researcher Affiliation | Collaboration | 1 ETH AI Center 2 Department of Computer Science, ETH Z urich 3 Max Planck Institute for Intelligent Systems, T ubingen 4 Google Research |
| Pseudocode | Yes | Algorithm 1: Delphic Offline Reinforcement Learning |
| Open Source Code | Yes | We include source code as supplementary material. |
| Open Datasets | Yes | Our real-world data experiment is based on the publicly available Hi RID dataset (Hyland et al., 2020). |
| Dataset Splits | Yes | Training is carried out for 50 epochs or until loss on the validation subset (10% of training data) increases for more than 5 consecutive epochs. |
| Hardware Specification | Yes | Training is carried out on NVIDIA RTX2080Ti GPUs on our local cluster, using the Adam optimiser with default learning rate and a batch size of 32. |
| Software Dependencies | No | All reinforcement learning algorithms and baselines are implemented based on the open access d3rlpy library (Seno and Imai, 2022). |
| Experiment Setup | Yes | Between world models w, the confounder space dimensionality is randomly varied over |Z| = {1, 2, 4, 8, 16}, and the prior for p(z) = N(z; 0, Σ2) is randomly varied through the variance for each z-dimension, Σ2 ii = {1.0, 0.1, 0.01}. The discount factor used is γ = 0.99, and state and actions are normalised to mean 0 and variance 1 (Fujimoto and Gu, 2021) for all algorithms. Training is carried out... with a batch size of 32. Discrete CQL (Kumar et al., 2019) is implemented with a penalty hyperparameter α of 1.0 (sepsis environment) and 0.5 (ICU dataset), tuned over the following values: {0.1, 0.5, 1.0, 2.0, 5.0}. |