Exploration and Anti-Exploration with Distributional Random Network Distillation

Authors: Kai Yang, Jian Tao, Jiafei Lyu, Xiu Li

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Both theoretical analysis and experimental results demonstrate the superiority of our approach over the original RND algorithm. Our method excels in challenging online exploration scenarios and effectively serves as an anti-exploration mechanism in D4RL offline tasks. In this section, we provide empirical evaluations of DRND. Initially, we demonstrate that DRND offers a better bonus than RND, both before and after training. Our online experiments reveal that DRND surpasses numerous baselines, achieving the best results in exploration-intensive environments.
Researcher Affiliation Academia 1Tsinghua Shenzhen International Graduate School, Tsinghua University. Correspondence to: Xiu Li <li.xiu@sz.tsinghua.edu.cn>.
Pseudocode Yes Algorithm 1 PPO-DRND online pseudo-code and Algorithm 2 SAC-DRND offline pseudo-code
Open Source Code Yes Our code is publicly available at https://github.com/yk7333/DRND.
Open Datasets Yes Our code is publicly available at https://github.com/yk7333/DRND. Furthermore, we demonstrate that DRND can also serve as a good anti-exploration penalty term in the offline setting, confirming its ability to provide a better bonus based on the dataset distribution. We follow the setting of SAC-RND (Nikulin et al., 2023) and propose a novel offline RL algorithm, SAC-DRND. We run experiments in D4RL (Fu et al., 2020) offline tasks and find that SAC-DRND outperforms many recent strong baselines across various D4RL locomotion and Antmaze datasets. We chose three Atari games Montezuma s Revenge, Gravitar, and Venture to evaluate our algorithms. We further delve into the Adroit continuous control tasks. (Rajeswaran et al., 2017) and The Fetch manipulation tasks involve various gym-robotics environments (Plappert et al., 2018).
Dataset Splits No The paper uses established datasets like D4RL, but it does not explicitly describe how these datasets are split into training, validation, or test sets within the paper's text, nor does it reference a specific split methodology for these datasets.
Hardware Specification Yes GPUs: NVIDIA GeForce RTX 3090 and CPU (Intel(R) Xeon(R) Gold 6226R CPU) (from Figure 10 caption)
Software Dependencies Yes Python 3.10.8, Numpy 1.23.4, Gymnasium 0.28.1, Gymnasium-robotics 1.2.2, Pytorch 1.13.0, Mu Jo Co-py 2.1.2.14, Mu Jo Co 2.3.1
Experiment Setup Yes The hyperparameters are shown in Table 6 in online experiments. We employ distinct parameters and networks for Atari games and continuous control environments because Atari game observations are images, while observations for Adroit and Fetch tasks consist of states. The hyperparameters we use in the D4RL offline experiment are shown in Table 4. In D4RL offline datasets, we apply varying scales in each experiment due to the differing dataset qualities, as illustrated in Table 5.