OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation
Authors: Jongmin Lee, Wonseok Jeon, Byungjun Lee, Joelle Pineau, Kee-Eung Kim
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using an extensive set of benchmark datasets for offline RL, we show that Opti DICE performs competitively with the state-of-the-art methods. In the experiments, we demonstrate that Opti DICE performs competitively with the state-of-the-art methods using the D4RL offline RL benchmarks (Fu et al., 2021). 4. Experiments In this section, we evaluate Opti DICE for both tabular and continuous MDPs. |
| Researcher Affiliation | Collaboration | Jongmin Lee 1 * Wonseok Jeon 2 3 * Byung-Jun Lee 4 Joelle Pineau 2 3 5 Kee-Eung Kim 1 6 1School of Computing, KAIST 2Mila, Quebec AI Institute 3School of Computer Science, Mc Gill University 4Gauss Labs Inc. 5Facebook AI Research 6Graduate School of AI, KAIST. |
| Pseudocode | Yes | Algorithm 1 Opti DICE |
| Open Source Code | No | The paper does not state that its own source code is openly available, nor does it provide a direct link to it. It only mentions using the original code for a baseline: "For CQL, we use the original code by authors with hyperparameters reported in the CQL paper (Kumar et al., 2020)." |
| Open Datasets | Yes | D4RL offline RL benchmarks (Fu et al., 2021). Fu et al., 2021. URL https://openreview.net/ forum?id=px0-N3_Kj A. |
| Dataset Splits | No | The paper uses benchmark datasets for evaluation but does not specify how it performs train/validation/test splits of these datasets for its experiments. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU, CPU models, or cloud resources) used for running the experiments. |
| Software Dependencies | No | The paper mentions using deep neural networks and refers to models like CQL, implying common ML frameworks, but it does not specify versions for any software dependencies (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | Yes | For the f-divergence, we chose f(x) = 1 2(x 1)2, i.e. χ2-divergence for the tabular-MDP experiment, while we use its softened version for continuous MDPs (See Appendix E for details). We provide detailed information of the experimental setup in Appendix F.2. γ = 0.99 used for all algorithms. |