Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Mixture Of Surprises for Unsupervised Reinforcement Learning

Authors: Andrew Zhao, Matthieu Lin, Yangguang Li, Yong-jin Liu, Gao Huang

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our simple method achieves state-of-the-art performance on the URLB benchmark, outperforming previous pure surprise maximization-based objectives. Our code is available at: https: //github.com/Leap Lab THU/MOSS. Experimental results in Section 5 shows that on URLB [29] and Vi ZDoom [27], our MOSS method improves upon previous pure maximization and minimization methods.
Researcher Affiliation Collaboration 1 Department of Automation, BNRist, Tsinghua University 2 Department of Computer Science, BNRist, Tsinghua University 3 Sense Time
Pseudocode Yes Pseudocode for MOSS is provided in Alg. 1.
Open Source Code Yes Our code is available at: https: //github.com/Leap Lab THU/MOSS.
Open Datasets Yes We present the main results of method by evaluating on the Unsupervised Reinforcement Learning Benchmark (URLB) [29], which is a standard unsupervised RL benchmark for continuous control. We also evaluated our method on Vi ZDoom [27].
Dataset Splits No The paper describes pretraining and finetuning steps, and uses a standard benchmark procedure, but it does not specify explicit train/validation/test dataset splits with percentages or sample counts.
Hardware Specification No The paper mentions 'generous donation of computing resources by High-Flyer AI' in the Acknowledgement section but does not specify any particular hardware details like GPU/CPU models or memory.
Software Dependencies No The paper mentions using 'DDPG [31] as implemented in Dr Q-V2 [51]' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes To ensure fairness, all hyper-parameters are kept as in CIC [28] and we use the same RL algorithm (i.e., DDPG [31] as implemented in Dr Q-V2 [51]). Specifically, unless specified otherwise, we set M = 0 for the first half of steps and M = 1 for the other half of steps in the episode. We follow the benchmark s standard training procedure by pretraining the agent for 2 million steps in each of the three domains and then finetuning the pre-trained agent for 100k steps with downstream task rewards.