A Mixture Of Surprises for Unsupervised Reinforcement Learning

Authors: Andrew Zhao, Matthieu Lin, Yangguang Li, Yong-jin Liu, Gao Huang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our simple method achieves state-of-the-art performance on the URLB benchmark, outperforming previous pure surprise maximization-based objectives. Our code is available at: https: //github.com/Leap Lab THU/MOSS. Experimental results in Section 5 shows that on URLB [29] and Vi ZDoom [27], our MOSS method improves upon previous pure maximization and minimization methods.
Researcher Affiliation Collaboration 1 Department of Automation, BNRist, Tsinghua University 2 Department of Computer Science, BNRist, Tsinghua University 3 Sense Time
Pseudocode Yes Pseudocode for MOSS is provided in Alg. 1.
Open Source Code Yes Our code is available at: https: //github.com/Leap Lab THU/MOSS.
Open Datasets Yes We present the main results of method by evaluating on the Unsupervised Reinforcement Learning Benchmark (URLB) [29], which is a standard unsupervised RL benchmark for continuous control. We also evaluated our method on Vi ZDoom [27].
Dataset Splits No The paper describes pretraining and finetuning steps, and uses a standard benchmark procedure, but it does not specify explicit train/validation/test dataset splits with percentages or sample counts.
Hardware Specification No The paper mentions 'generous donation of computing resources by High-Flyer AI' in the Acknowledgement section but does not specify any particular hardware details like GPU/CPU models or memory.
Software Dependencies No The paper mentions using 'DDPG [31] as implemented in Dr Q-V2 [51]' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes To ensure fairness, all hyper-parameters are kept as in CIC [28] and we use the same RL algorithm (i.e., DDPG [31] as implemented in Dr Q-V2 [51]). Specifically, unless specified otherwise, we set M = 0 for the first half of steps and M = 1 for the other half of steps in the episode. We follow the benchmark s standard training procedure by pretraining the agent for 2 million steps in each of the three domains and then finetuning the pre-trained agent for 100k steps with downstream task rewards.