Efficient Offline Policy Optimization with a Learned Model

Authors: Zichen Liu, Siyi Li, Wee Sun Lee, Shuicheng YAN, Zhongwen Xu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive empirical studies with BSuite environments to verify the hypotheses and then run our algorithm on the RL Unplugged Atari benchmark. Experimental results show that our proposed approach achieves stable performance even with an inaccurate learned model.
Researcher Affiliation Collaboration Sea AI Lab National University of Singapore {liuzc,xuzw}@sea.com {zichen,leews}@comp.nus.edu.sg
Pseudocode Yes A.1 PSEUDOCODE We present the detailed learning procedure of ROSMO in Algorithm 2.
Open Source Code Yes Our implementation is open-sourced at https://github.com/sail-sg/rosmo.
Open Datasets Yes We conduct extensive empirical studies with BSuite environments to verify the hypotheses and then run our algorithm on the RL Unplugged Atari benchmark. The datasets we collected do not contain any sensitive information and will be released.
Dataset Splits No No, the paper does not explicitly provide training/validation/test dataset splits with specific percentages or counts. It mentions using 'sub-sampled datasets of different fractions' but not specific splits for train/validation/test.
Hardware Specification Yes We use TPUv3-8 machines for all the experiments in Atari and use CPU servers with 60 cores for BSuite experiments.
Software Dependencies No No, the paper mentions 'JAX (Bradbury et al., 2018)' but does not provide specific version numbers for JAX or any other software dependencies.
Experiment Setup Yes The hyperparameters shared by ROSMO and Mu Zero Unplugged for Atari environments is given in Table 3, and that for BSuite environments is given in Table 4. In addition, the behavior regularization strength (α) used in ROSMO is chosen to be 0.2.