Information-Directed Pessimism for Offline Reinforcement Learning

Authors: Alec Koppel, Sujay Bhatt, Jiacheng Guo, Joe Eappen, Mengdi Wang, Sumitra Ganesh

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment with a Portfolio Optimization task (Neuneier, 1997; Moody & Saffell, 2001), Frozen Lake (Brockman et al., 2016), Deep Sea (Osband & Van Roy, 2017a), Prior MDP (Markou & Rasmussen, 2019) see Appendix G for detailed descriptions, and Appendix G.2.5 for additional experiments with a random walk MDP. [...] Table 2: Results across tasks for fixed batch size & sampling regime.
Researcher Affiliation Collaboration 1J.P. Morgan AI Research, 383 Madison Ave., 9th floor, New York, NY 10017 2Dept. of ECE, Princeton University, Princeton, NJ 08544, Country 3Dept. of ECE, Purdue University, West Lafayette,IN, 47906.
Pseudocode Yes Algorithm 1 Estimating True Transition Model; Algorithm 2 IDP-VI Information-Directed Pessimistic Value Iteration; Algorithm 3 IDP-Q Information-Directed Pessimistic Asynchronous Q Learning
Open Source Code Yes 2Code is available here: https://github.com/jeappen/idp-offline-rl
Open Datasets Yes We experiment with a Portfolio Optimization task (Neuneier, 1997; Moody & Saffell, 2001), Frozen Lake (Brockman et al., 2016), Deep Sea (Osband & Van Roy, 2017a), Prior MDP (Markou & Rasmussen, 2019) see Appendix G for detailed descriptions, and Appendix G.2.5 for additional experiments with a random walk MDP.
Dataset Splits No The paper mentions creating datasets by sampling different policies and dataset sampling ratios (Easy (1:1:1), Hard (0:1:0.1), and Random (0:1:0)), and evaluates performance on a 'test' set (e.g., 'Test cumulative return'), but it does not specify explicit train/validation/test dataset splits with percentages or counts for reproducibility. (Section 5. Experiments)
Hardware Specification Yes All experiments were run on an AWS c5.2xlarge instance except for the Random MDP experiments which used a cluster with an an Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz and 252 GB of RAM.
Software Dependencies No The paper mentions software like 'OpenAI Gym' and 'pymdptoolbox' (with a GitHub link for the latter), but it does not specify exact version numbers for these or any other key software libraries or solvers used for the experiments. (Section 5, Appendix G)
Experiment Setup Yes In this section, we detail the hyperparameter selection of the pessimistic penalty coefficient α in equation 3.8-equation 3.9, the coefficients used in the penalty in (Rashidinejad et al., 2021)... In Table 4 we present the range of values used, as well as the actually selected value for each experimental instance over all seeds. Similarly, Table 5 requires specifying a learning rate, a penalty coefficient α for DSD, or otherwise a multiplicative constant Cb that determines the scale of the penalty in (Yan et al., 2023).