DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization

Authors: Guowei Xu, Ruijie Zheng, Yongyuan Liang, Xiyao Wang, Zhecheng Yuan, Tianying Ji, Yu Luo, Xiaoyu Liu, Jiaxin Yuan, Pu Hua, Shuzhen Li, Yanjie Ze, Hal Daumé III, Furong Huang, Huazhe Xu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that Dr M achieves significant improvements in sample efficiency and asymptotic performance with no broken seeds (76 seeds in total) across three continuous control benchmark environments, including Deep Mind Control Suite, Meta World, and Adroit.
Researcher Affiliation Academia Guowei Xu1 Ruijie Zheng2 Yongyuan Liang2 Xiyao Wang2 Zhecheng Yuan1 Tianying Ji1 Yu Luo1 Xiaoyu Liu2 Jiaxin Yuan2 Pu Hua1 Shuzhen Li1 Yanjie Ze34 Hal Daum e III2 Furong Huang2 Huazhe Xu145 1 Tsinghua University 2 University of Maryland, College Park 3 Shanghai Jiao Tong University 4 Shanghai Qi Zhi Institute 5 Shanghai AI Lab
Pseudocode Yes Algorithm 1 Dormant Ratio Calculation; Algorithm 2 Dormant-ratio-guided Perturbation; Algorithm 3 Awaken Exploration Scheduler; Algorithm 4 Dormant-ratio-guided exploitation
Open Source Code No Please refer to https://drm-rl.github.io/ for experiment videos and benchmark results.
Open Datasets Yes Dr M is evaluated across three different domains, Deepmind Control Suite (Tassa et al., 2018), Meta World (Yu et al., 2019), and Adroit (Rajeswaran et al., 2018)
Dataset Splits No The paper mentions evaluating performance over millions of frames and using mini-batch sizes for training, but it does not specify explicit training/validation/test dataset splits (e.g., percentages or sample counts) for the environments used.
Hardware Specification Yes To assess the algorithms speed, we measure their frames per second (FPS) on the same Deep Mind Control Suite task, Dog Walk, using an identical Nvidia RTX A5000 GPU.
Software Dependencies No The paper mentions using 'Optimizer Adam' and building 'Dr M upon the publicly available source code of Dr Q-v2', but does not specify version numbers for any software dependencies (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We summarize all the hyperparameters of Dr M in Table 1. While we are trying to keep the settings identical for each of the task, there are a few specific deviations of Dr M hyperparameters for some tasks. Parameter Setting Replay buffer capacity 10^6 Action repeat 2 Seed frames 4000 Exploration steps 2000 n-step returns 3 Mini-batch size 256 Discount γ 0.99 Optimizer Adam Learning rate 8e-5 (Deep Mind Control Suite) 10e-4 (Meta World & Adroit) Agent update frequency 2 Soft update rate 0.01 Features dimension 100 (Humanoid & Dog) 50 (Others) Hidden dimension 1024 τ-Dormant ratio 0.025 Dormant ratio threshold ˆβ 0.2 Minimum perturb factor αmin 0.2 Maximum perturb factor αmax 0.9 Perturb rate k 2 Perturb frames 200000 Linear exploration stddev. clip 0.3 Linear exploration stddev. schedule linear(1.0, 0.1, 2000000) (Deep Mind Control Suite) linear(1.0,0.1,300000) (Meta World & Adroit) Awaken exploration temperature T 0.1 Target exploitation parameter ˆλ 0.6 Exploitation temperature T 0.02 Exploitation expectile 0.9