reproducibilityindex.ai

OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation

Authors: Jongmin Lee, Wonseok Jeon, Byungjun Lee, Joelle Pineau, Kee-Eung Kim

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using an extensive set of benchmark datasets for ofﬂine RL, we show that Opti DICE performs competitively with the state-of-the-art methods. In the experiments, we demonstrate that Opti DICE performs competitively with the state-of-the-art methods using the D4RL ofﬂine RL benchmarks (Fu et al., 2021). 4. Experiments In this section, we evaluate Opti DICE for both tabular and continuous MDPs.
Researcher Affiliation	Collaboration	Jongmin Lee 1 * Wonseok Jeon 2 3 * Byung-Jun Lee 4 Joelle Pineau 2 3 5 Kee-Eung Kim 1 6 1School of Computing, KAIST 2Mila, Quebec AI Institute 3School of Computer Science, Mc Gill University 4Gauss Labs Inc. 5Facebook AI Research 6Graduate School of AI, KAIST.
Pseudocode	Yes	Algorithm 1 Opti DICE
Open Source Code	No	The paper does not state that its own source code is openly available, nor does it provide a direct link to it. It only mentions using the original code for a baseline: "For CQL, we use the original code by authors with hyperparameters reported in the CQL paper (Kumar et al., 2020)."
Open Datasets	Yes	D4RL ofﬂine RL benchmarks (Fu et al., 2021). Fu et al., 2021. URL https://openreview.net/ forum?id=px0-N3_Kj A.
Dataset Splits	No	The paper uses benchmark datasets for evaluation but does not specify how it performs train/validation/test splits of these datasets for its experiments.
Hardware Specification	No	The paper does not specify any hardware details (e.g., GPU, CPU models, or cloud resources) used for running the experiments.
Software Dependencies	No	The paper mentions using deep neural networks and refers to models like CQL, implying common ML frameworks, but it does not specify versions for any software dependencies (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup	Yes	For the f-divergence, we chose f(x) = 1 2(x 1)2, i.e. χ2-divergence for the tabular-MDP experiment, while we use its softened version for continuous MDPs (See Appendix E for details). We provide detailed information of the experimental setup in Appendix F.2. γ = 0.99 used for all algorithms.