SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets
Authors: Shenghua Wan, Ziyuan Chen, Le Gan, Shuai Feng, De-Chuan Zhan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our method substantially outperforms all baseline methods, and further analytical experiments validate the critical designs in our method. |
| Researcher Affiliation | Academia | 1School of Artificial Intelligence, Nanjing University, China 2National Key Laboratory for Novel Software Technology, Nanjing University, China 3School of Mathematical Sciences, Center for Statistical Science, Peking University, Beijing, China 4School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing, China. |
| Pseudocode | Yes | The overall method of Se MOPO is shown in Figure 1 and summarized in Algorithm 1. |
| Open Source Code | Yes | The project website is https://sites.google.com/view/semopo. |
| Open Datasets | Yes | The backgrounds of each locomotion task s observations are replaced with videos from the driving car category of the Kinetics dataset (Kay et al., 2017), as used in DBC (Zhang et al., 2021). |
| Dataset Splits | No | The paper mentions training, testing, and evaluation metrics but does not specify validation splits explicitly in terms of percentages or counts, nor does it cite predefined validation splits. |
| Hardware Specification | Yes | We implement the proposed algorithm using Tensor Flow 2 and conduct all experiments on an NVIDIA RTX 3090, totaling approximately 1000 GPU hours. |
| Software Dependencies | Yes | We implement the proposed algorithm using Tensor Flow 2 and conduct all experiments on an NVIDIA RTX 3090, totaling approximately 1000 GPU hours. The recurrent state-space model from Dreamer V2 (Hafner et al., 2021) is employed for both forward dynamics and the posterior encoder. |
| Experiment Setup | Yes | The ADAM optimizer is employed to train the network with batches of 64 sequences, each of length 50. The learning rate is 6e-5 for both the endogenous and exogenous models and 8e-5 for the action and value nets. We stabilize the training process by clipping gradient norms to 100 and set (λ = 10) for the uncertainty penalty term. The imagine horizon of 5, as used in Offline DV2 (Lu et al., 2022), is adopted for policy optimization. |