Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Automatic Reward Shaping from Confounded Offline Data
Authors: Mingxuan Li, Junzhe Zhang, Elias Bareinboim
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Simulations support the theoretical findings. In this section, we show simulation results verifying that: (1) Q-UCB with our proposed shaping function enjoys better sample efficiency , and (2) the policy learned by our shaping pipeline at convergence is the optimal policy for an interventional agent. |
| Researcher Affiliation | Academia | 1Causal AI Lab, Columbia University, New York, USA 2Department of Electrical Engineering and Computer Science, Syracuse University, New York, USA. Correspondence to: Mingxuan Li <EMAIL>. |
| Pseudocode | Yes | Algo. 2 in App. C shows the full pseudo-code for approximating the optimal value upper bound from offline datasets. Details of the algorithm is described in Algo. 1. See also App. F for the pseudo-code of the vanilla Q-UCB. |
| Open Source Code | No | The paper does not contain any explicit statements about the release of source code for the methodology described, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We test those algorithms in a series of customized windy Mini Grid environments (Zhang & Bareinboim, 2024; Chevalier-Boisvert et al., 2018). ... Chevalier-Boisvert, M., Willems, L., and Pal, S. Minimalistic gridworld environment for gymnasium, 2018. URL https://github.com/Farama-Foundation/Minigrid. |
| Dataset Splits | No | The paper describes the collection of 'offline datasets' and 'data-generating process' for different behavioral policies, but it does not specify any training, test, or validation splits for the experimental evaluation of their methods. |
| Hardware Specification | Yes | All of our experiment results are obtained from a 2021 Mac Book Pro with M1 chip and 32GB memory. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | Q-UCB (Jin et al., 2018), to leverage the potential function ϕ extrapolated from offline data. Details of the algorithm is described in Algo. 1. Compared with the original Q-UCB, we make a few modifications for Q-UCB to work with PBRS: (1) Zero initializing the Q-values; (2) Using potential function dependent UCB bonus and value clipping; and finally, (3) Incorporating shaped reward during learning updates. ... the episode length is set to 15 while the Lava Cross series has a horizon of 20. To compensate for the hard exploration situation, we allow random initial starting states over the whole map walkable area. For training steps, we set a total of 100K environment steps for Windy Empty World and 20K for the Lava Cross series. ... There is a step penalty of 0.1, +0.2 for getting a coin, 0 for reaching the goal, and -1 for touching the lava. |