Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

UniZero: Generalized and Efficient Planning with Scalable Latent World Models

Authors: Yuan Pu, Yazhe Niu, Zhenjie Yang, Jiyuan Ren, Hongsheng Li, Yu Liu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that Uni Zero significantly outperforms existing baselines in benchmarks that require long-term memory. Additionally, Uni Zero demonstrates superior scalability in multitask learning experiments conducted on Atari benchmarks. In standard single-task RL settings, such as Atari and DMControl, Uni Zero matches or even surpasses the performance of current state-of-the-art methods. Finally, extensive ablation studies and visual analyses validate the effectiveness and scalability of Uni Zero s design choices.
Researcher Affiliation Academia Yuan Pu EMAIL Shanghai Artificial Intelligence Laboratory Yazhe Niu EMAIL The Chinese University of Hong Kong Zhenjie Yang EMAIL Shanghai Jiao Tong University Jiyuan Ren EMAIL Shanghai Artificial Intelligence Laboratory Hongsheng Li EMAIL The Chinese University of Hong Kong Yu Liu EMAIL Shanghai Artificial Intelligence Laboratory
Pseudocode Yes Algorithm 1 presents the pseudocode for the entire training pipeline.
Open Source Code Yes Our code is available at https://github.com/opendilab/Light Zero.
Open Datasets Yes Specifically, we evaluate Uni Zero on the Atari 100k benchmark (short-term dependency, discrete actions) (Bellemare et al., 2013; Kaiser et al., 2024), DMControl (short-term dependency, continuous actions) (Tunyasuvunakool et al., 2020), and Visual Match (long-term dependency, discrete actions) (Ni et al., 2024).
Dataset Splits No The paper uses online reinforcement learning, where data is generated through interaction with environments (Atari 100k, DMControl, Visual Match) rather than from a pre-defined static dataset with explicit train/test/validation splits. It specifies interaction limits (e.g., '100,000 steps per game') and episode lengths, but not traditional dataset splits for reproducibility.
Hardware Specification Yes The following computational overhead experiments were conducted on a Kubernetes cluster with the following specifications: a single NVIDIA A100 80GB GPU, a 24-core CPU, and 100GB of memory.
Software Dependencies No Our Mu Zero implementation is based on the Light Zero (Niu et al., 2024) framework. Unless otherwise stated, all references to Mu Zero in this work denote its variant augmented with self-supervised learning regularization (Mu Zero w/ SSL), as discussed in Section 6. (1) Visual Match Baselines: We compare against Mu Zero and the SAC-Discrete variant combined with the GPT backbone, as proposed in Ni et al. (2024), referred to as SAC-GPT. (2) Atari 100k Baselines: The baseline used is Mu Zero. (3) DMControl Baselines: Dreamer V3 (Hafner et al., 2023) is used as the baseline, a model-based approach that optimizes a model-free policy using rollouts generated from a learned environment model. For a comprehensive comparison with prior model-based RL algorithms such as TWM (Robine et al., 2023), IRIS (Micheli et al., 2022), Dreamer V3 (Hafner et al., 2023), STORM (Zhang et al., 2023), TDMPC2 (Hansen et al., 2023) and Mu Zero (Schrittwieser et al., 2019), please refer to Table 12. The paper mentions the use of
Experiment Setup Yes Table 8 outlines the key hyperparameters for Uni Zero, which are closely aligned with those reported in Niu et al. (2024).