Optimistic Model Rollouts for Pessimistic Offline Policy Optimization

Authors: Yuanzhao Zhai, Yiying Li, Zijian Gao, Xudong Gong, Kele Xu, Dawei Feng, Ding Bo, Huaimin Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our framework significantly outperforms P-MDP baselines by a margin of 30%, achieving state-of-the-art performance on the widely used benchmark. Moreover, ORPO exhibits notable advantages in problems that require generalization. In our experiment, we aim to investigate three primary research questions (RQs): RQ1 (Performance): How does ORPO perform on standard offline RL benchmarks and tasks requiring generalization compared to state-of-the-art baselines? RQ2 (Effectiveness of optimistic rollout policy): How does the proposed optimistic rollout policy compare to various other rollout policies? RQ3 (Ablation study): How does each design in ORPO affect performance?
Researcher Affiliation Collaboration Yuanzhao Zhai1,2, Yiying Li3, Zijian Gao1,2, Xudong Gong1,2, Kele Xu1,2, Dawei Feng1,2 , Ding Bo1,2, Huaimin Wang1,2 1National University of Defense Technology, Changsha, China 2State Key Laboratory of Complex & Critical Software Environment 3Artificial Intelligence Research Center, DII, Beijing, China
Pseudocode Yes Algorithm 1: Framework for Optimistic Rollout for Pessimistic Policy Optimization (ORPO)
Open Source Code No The paper does not contain any explicit statement about releasing its source code or a link to a code repository.
Open Datasets Yes To answer the above questions, we conducted our experiments on the D4RL benchmark suite (Fu et al. 2020) as well as two datasets that require generalization to related but previously unseen tasks using the Mu Jo Co simulator (Todorov, Erez, and Tassa 2012). For the practical implementation of the ORPO algorithm, we utilized the SAC (Haarnoja et al. 2018) to train the optimistic rollout policy, and for pessimistic offline policy optimization, we used TD3+BC (Fujimoto and Gu 2021). Most of the hyper-parameters were inherited from the optimized MOPO (Lu et al. 2022). We evaluated on Halfcheetah-jump dataset proposed by Yu et al. (Yu et al. 2020).
Dataset Splits No The paper mentions using the D4RL benchmark suite, which has predefined splits, but does not explicitly state the training/validation/test splits within the text or cite how these splits are applied.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions utilizing SAC and TD3+BC algorithms but does not provide specific version numbers for these or any other ancillary software dependencies.
Experiment Setup No The paper states that 'Most of the hyper-parameters were inherited from the optimized MOPO (Lu et al. 2022)' and refers to 'Appendix C.1 for the detailed experimental setup' (Figure 2 caption), but it does not provide concrete hyperparameter values or detailed training configurations in the main text.