A Mixture Of Surprises for Unsupervised Reinforcement Learning
Authors: Andrew Zhao, Matthieu Lin, Yangguang Li, Yong-jin Liu, Gao Huang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our simple method achieves state-of-the-art performance on the URLB benchmark, outperforming previous pure surprise maximization-based objectives. Our code is available at: https: //github.com/Leap Lab THU/MOSS. Experimental results in Section 5 shows that on URLB [29] and Vi ZDoom [27], our MOSS method improves upon previous pure maximization and minimization methods. |
| Researcher Affiliation | Collaboration | 1 Department of Automation, BNRist, Tsinghua University 2 Department of Computer Science, BNRist, Tsinghua University 3 Sense Time |
| Pseudocode | Yes | Pseudocode for MOSS is provided in Alg. 1. |
| Open Source Code | Yes | Our code is available at: https: //github.com/Leap Lab THU/MOSS. |
| Open Datasets | Yes | We present the main results of method by evaluating on the Unsupervised Reinforcement Learning Benchmark (URLB) [29], which is a standard unsupervised RL benchmark for continuous control. We also evaluated our method on Vi ZDoom [27]. |
| Dataset Splits | No | The paper describes pretraining and finetuning steps, and uses a standard benchmark procedure, but it does not specify explicit train/validation/test dataset splits with percentages or sample counts. |
| Hardware Specification | No | The paper mentions 'generous donation of computing resources by High-Flyer AI' in the Acknowledgement section but does not specify any particular hardware details like GPU/CPU models or memory. |
| Software Dependencies | No | The paper mentions using 'DDPG [31] as implemented in Dr Q-V2 [51]' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | To ensure fairness, all hyper-parameters are kept as in CIC [28] and we use the same RL algorithm (i.e., DDPG [31] as implemented in Dr Q-V2 [51]). Specifically, unless specified otherwise, we set M = 0 for the first half of steps and M = 1 for the other half of steps in the episode. We follow the benchmark s standard training procedure by pretraining the agent for 2 million steps in each of the three domains and then finetuning the pre-trained agent for 100k steps with downstream task rewards. |