reproducibilityindex.ai

SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

Authors: Shenghua Wan, Ziyuan Chen, Le Gan, Shuai Feng, De-Chuan Zhan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our method substantially outperforms all baseline methods, and further analytical experiments validate the critical designs in our method.
Researcher Affiliation	Academia	1School of Artificial Intelligence, Nanjing University, China 2National Key Laboratory for Novel Software Technology, Nanjing University, China 3School of Mathematical Sciences, Center for Statistical Science, Peking University, Beijing, China 4School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing, China.
Pseudocode	Yes	The overall method of Se MOPO is shown in Figure 1 and summarized in Algorithm 1.
Open Source Code	Yes	The project website is https://sites.google.com/view/semopo.
Open Datasets	Yes	The backgrounds of each locomotion task s observations are replaced with videos from the driving car category of the Kinetics dataset (Kay et al., 2017), as used in DBC (Zhang et al., 2021).
Dataset Splits	No	The paper mentions training, testing, and evaluation metrics but does not specify validation splits explicitly in terms of percentages or counts, nor does it cite predefined validation splits.
Hardware Specification	Yes	We implement the proposed algorithm using Tensor Flow 2 and conduct all experiments on an NVIDIA RTX 3090, totaling approximately 1000 GPU hours.
Software Dependencies	Yes	We implement the proposed algorithm using Tensor Flow 2 and conduct all experiments on an NVIDIA RTX 3090, totaling approximately 1000 GPU hours. The recurrent state-space model from Dreamer V2 (Hafner et al., 2021) is employed for both forward dynamics and the posterior encoder.
Experiment Setup	Yes	The ADAM optimizer is employed to train the network with batches of 64 sequences, each of length 50. The learning rate is 6e-5 for both the endogenous and exogenous models and 8e-5 for the action and value nets. We stabilize the training process by clipping gradient norms to 100 and set (λ = 10) for the uncertainty penalty term. The imagine horizon of 5, as used in Offline DV2 (Lu et al., 2022), is adopted for policy optimization.