EfficientZero V2: Mastering Discrete and Continuous Control with Limited Data

Authors: Shengjie Wang, Shaohuai Liu, Weirui Ye, Jiacheng You, Yang Gao

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method in multiple benchmarks, outperforming the previous SOTA algorithms under limited data. As shown in Fig.1, the performance of EZV2 exceeds Dreamer V3, a universal algorithm, by a large margin covering multiple domains with a data budget of 50k to 200k interactions.
Researcher Affiliation Academia 1 Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 2 Shanghai Qi Zhi Institute, Shanghai, China 3 Shanghai Artificial Intelligence Laboratory, Shanghai, China 4 Texas A&M University. Correspondence to: Yang Gao <gaoyangiiis@mail.tsinghua.edu.cn>.
Pseudocode No The paper describes the components and training process of EZ-V2 using prose and diagrams (Figure 2), but does not include a structured pseudocode or algorithm block.
Open Source Code No The paper provides links to official implementations for various baselines (e.g., Dreamer V3, TD-MPC2), but does not provide an explicit statement or link for the open-source code of their own method, Efficient Zero V2.
Open Datasets Yes In discrete control, we use the Atari 100k benchmark (Brockman et al., 2016), encompassing 26 Atari games and limiting training to 400k environment steps, equivalent to 100k steps with action repeats of 4. For continuous control evaluation, we utilize the Deep Mind Control Suite (DMControl; (Tassa et al., 2018)).
Dataset Splits No To assess sample efficiency, we measure algorithm performance with limited environment steps. In discrete control, we use the Atari 100k benchmark (Brockman et al., 2016), encompassing 26 Atari games and limiting training to 400k environment steps, equivalent to 100k steps with action repeats of 4. While standard benchmarks are used, the paper does not explicitly define specific training, validation, and test splits with percentages or sample counts for its own experiments beyond stating training limits for interactions.
Hardware Specification Yes The methods were benchmarked on a server equipped with 8 RTX 3090 graphics cards.
Software Dependencies No The paper mentions optimizers (Adam, SGD) and general software components like 'neural networks', but does not provide specific version numbers for any key software libraries or environments (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes Table 3 lists comprehensive hyperparameters for Efficient Zero V2, including 'Batch size B 256', 'Discount γ 0.997', 'Number of simulations in search Nsim 32', 'Optimizer Adam Optimizer: learning rate 3 10 4', and many others.