EUCLID: Towards Efficient Unsupervised Reinforcement Learning with Multi-choice Dynamics Model
Authors: Yifu Yuan, Jianye HAO, Fei Ni, Yao Mu, YAN ZHENG, Yujing Hu, Jinyi Liu, Yingfeng Chen, Changjie Fan
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results in the manipulation and locomotion domains demonstrate that EUCLID achieves state-of-the-art performance with high sample efficiency, basically solving the state-based URLB benchmark and reaching a mean normalized score of 104.0 1.2% in downstream tasks with 100k fine-tuning steps, which is equivalent to DDPG s performance at 2M interactive steps with 20 more data. More visualization videos are released on our homepage. |
| Researcher Affiliation | Collaboration | Yifu Yuan1, Jianye Hao ,1, Fei Ni1, Yao Mu3, Yan Zheng1, Yujing Hu2, Jinyi Liu1, Yingfeng Chen2, Changjie Fan2 1College of Intelligence and Computing, Tianjin University, 2Fuxi AI Lab, Netease, Inc., Hangzhou, China, 3The University of Hong Kong |
| Pseudocode | Yes | The detail pseudo code is given in Algorithm 1. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a direct link to a code repository. |
| Open Datasets | Yes | Benchmarks: We evaluate our approach on tasks from URLB (Laskin et al., 2021), which consists of three domains (walker, quadruped, jaco) and twelve challenging continuous control downstream tasks. Besides, we extend the URLB benchmark (URLB-Extension) by adding a more complex humanoid domain and three corresponding downstream tasks based on the Deep Mind Control Suite (DMC) (Tunyasuvunakool et al., 2020) to further demonstrate the efficiency improvement of EUCLID on more challenging environments. |
| Dataset Splits | No | The paper does not explicitly specify exact training/test/validation dataset splits (e.g., percentages or sample counts). It refers to the URLB benchmark and fine-tuning steps, but not data partitioning specifics. |
| Hardware Specification | Yes | We conducted our experiments on an Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz processor based system. The system consists of 2 processors, each with 26 cores running at 2.60GHz (52 cores in total) with 32KB of L1, 1024 KB of L2, 40MB of unified L3 cache, and 250 GB of memory. Besides, we use a single Nvidia RTX3090 GPU to facilitate the training procedure. The operating system is Ubuntu 16.04. |
| Software Dependencies | No | The paper mentions DDPG, PyTorch (implied by typical ML papers of this type, though not explicitly stated), but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | Table 4: Hyper-parameters of the world model, policy, and planner. World model Value Batch size 1024 Max buffer size 1e6 Latent dim 50 (default) 100 (Humanoid) MLP hidden dim 256 (Encoder) 1024 (otherwise) MLP activation ELU Optimizer (θ) Adam Learning rate 1e-4 (PT) 1e-3 (FT) Reward loss coefficient (c1) 0.5 Consistency loss coefficient (c2) 2 Value loss coefficient (c3) 0.1 θ update frequency 2 Policy Value Seed steps 0 (PT) 4000 (FT) Discount factor (γ) 0.99 Action repeat 2 (default) 4 (Quadruped) Planning (Only for FT phase) Value Iteration 6 Planning horizon (L) 5 CEM population size 512 CEM elite fraction 12 CEM policy fraction (Policy/CEM) 0.05 CEM Temperature 0.5 |