Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

UniZero: Generalized and Efficient Planning with Scalable Latent World Models

Authors: Yuan Pu, Yazhe Niu, Zhenjie Yang, Jiyuan Ren, Hongsheng Li, Yu Liu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that Uni Zero significantly outperforms existing baselines in benchmarks that require long-term memory. Additionally, Uni Zero demonstrates superior scalability in multitask learning experiments conducted on Atari benchmarks. In standard single-task RL settings, such as Atari and DMControl, Uni Zero matches or even surpasses the performance of current state-of-the-art methods. Finally, extensive ablation studies and visual analyses validate the effectiveness and scalability of Uni Zero s design choices.
Researcher Affiliation	Academia	Yuan Pu EMAIL Shanghai Artificial Intelligence Laboratory Yazhe Niu EMAIL The Chinese University of Hong Kong Zhenjie Yang EMAIL Shanghai Jiao Tong University Jiyuan Ren EMAIL Shanghai Artificial Intelligence Laboratory Hongsheng Li EMAIL The Chinese University of Hong Kong Yu Liu EMAIL Shanghai Artificial Intelligence Laboratory
Pseudocode	Yes	Algorithm 1 presents the pseudocode for the entire training pipeline.
Open Source Code	Yes	Our code is available at https://github.com/opendilab/Light Zero.
Open Datasets	Yes	Specifically, we evaluate Uni Zero on the Atari 100k benchmark (short-term dependency, discrete actions) (Bellemare et al., 2013; Kaiser et al., 2024), DMControl (short-term dependency, continuous actions) (Tunyasuvunakool et al., 2020), and Visual Match (long-term dependency, discrete actions) (Ni et al., 2024).
Dataset Splits	No	The paper uses online reinforcement learning, where data is generated through interaction with environments (Atari 100k, DMControl, Visual Match) rather than from a pre-defined static dataset with explicit train/test/validation splits. It specifies interaction limits (e.g., '100,000 steps per game') and episode lengths, but not traditional dataset splits for reproducibility.
Hardware Specification	Yes	The following computational overhead experiments were conducted on a Kubernetes cluster with the following specifications: a single NVIDIA A100 80GB GPU, a 24-core CPU, and 100GB of memory.
Software Dependencies	No	Our Mu Zero implementation is based on the Light Zero (Niu et al., 2024) framework. Unless otherwise stated, all references to Mu Zero in this work denote its variant augmented with self-supervised learning regularization (Mu Zero w/ SSL), as discussed in Section 6. (1) Visual Match Baselines: We compare against Mu Zero and the SAC-Discrete variant combined with the GPT backbone, as proposed in Ni et al. (2024), referred to as SAC-GPT. (2) Atari 100k Baselines: The baseline used is Mu Zero. (3) DMControl Baselines: Dreamer V3 (Hafner et al., 2023) is used as the baseline, a model-based approach that optimizes a model-free policy using rollouts generated from a learned environment model. For a comprehensive comparison with prior model-based RL algorithms such as TWM (Robine et al., 2023), IRIS (Micheli et al., 2022), Dreamer V3 (Hafner et al., 2023), STORM (Zhang et al., 2023), TDMPC2 (Hansen et al., 2023) and Mu Zero (Schrittwieser et al., 2019), please refer to Table 12. The paper mentions the use of
Experiment Setup	Yes	Table 8 outlines the key hyperparameters for Uni Zero, which are closely aligned with those reported in Niu et al. (2024).