SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

Authors: Shenghua Wan, Ziyuan Chen, Le Gan, Shuai Feng, De-Chuan Zhan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our method substantially outperforms all baseline methods, and further analytical experiments validate the critical designs in our method.
Researcher Affiliation Academia 1School of Artificial Intelligence, Nanjing University, China 2National Key Laboratory for Novel Software Technology, Nanjing University, China 3School of Mathematical Sciences, Center for Statistical Science, Peking University, Beijing, China 4School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing, China.
Pseudocode Yes The overall method of Se MOPO is shown in Figure 1 and summarized in Algorithm 1.
Open Source Code Yes The project website is https://sites.google.com/view/semopo.
Open Datasets Yes The backgrounds of each locomotion task s observations are replaced with videos from the driving car category of the Kinetics dataset (Kay et al., 2017), as used in DBC (Zhang et al., 2021).
Dataset Splits No The paper mentions training, testing, and evaluation metrics but does not specify validation splits explicitly in terms of percentages or counts, nor does it cite predefined validation splits.
Hardware Specification Yes We implement the proposed algorithm using Tensor Flow 2 and conduct all experiments on an NVIDIA RTX 3090, totaling approximately 1000 GPU hours.
Software Dependencies Yes We implement the proposed algorithm using Tensor Flow 2 and conduct all experiments on an NVIDIA RTX 3090, totaling approximately 1000 GPU hours. The recurrent state-space model from Dreamer V2 (Hafner et al., 2021) is employed for both forward dynamics and the posterior encoder.
Experiment Setup Yes The ADAM optimizer is employed to train the network with batches of 64 sequences, each of length 50. The learning rate is 6e-5 for both the endogenous and exogenous models and 8e-5 for the action and value nets. We stabilize the training process by clipping gradient norms to 100 and set (λ = 10) for the uncertainty penalty term. The imagine horizon of 5, as used in Offline DV2 (Lu et al., 2022), is adopted for policy optimization.