Exploration and Anti-Exploration with Distributional Random Network Distillation
Authors: Kai Yang, Jian Tao, Jiafei Lyu, Xiu Li
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Both theoretical analysis and experimental results demonstrate the superiority of our approach over the original RND algorithm. Our method excels in challenging online exploration scenarios and effectively serves as an anti-exploration mechanism in D4RL offline tasks. In this section, we provide empirical evaluations of DRND. Initially, we demonstrate that DRND offers a better bonus than RND, both before and after training. Our online experiments reveal that DRND surpasses numerous baselines, achieving the best results in exploration-intensive environments. |
| Researcher Affiliation | Academia | 1Tsinghua Shenzhen International Graduate School, Tsinghua University. Correspondence to: Xiu Li <li.xiu@sz.tsinghua.edu.cn>. |
| Pseudocode | Yes | Algorithm 1 PPO-DRND online pseudo-code and Algorithm 2 SAC-DRND offline pseudo-code |
| Open Source Code | Yes | Our code is publicly available at https://github.com/yk7333/DRND. |
| Open Datasets | Yes | Our code is publicly available at https://github.com/yk7333/DRND. Furthermore, we demonstrate that DRND can also serve as a good anti-exploration penalty term in the offline setting, confirming its ability to provide a better bonus based on the dataset distribution. We follow the setting of SAC-RND (Nikulin et al., 2023) and propose a novel offline RL algorithm, SAC-DRND. We run experiments in D4RL (Fu et al., 2020) offline tasks and find that SAC-DRND outperforms many recent strong baselines across various D4RL locomotion and Antmaze datasets. We chose three Atari games Montezuma s Revenge, Gravitar, and Venture to evaluate our algorithms. We further delve into the Adroit continuous control tasks. (Rajeswaran et al., 2017) and The Fetch manipulation tasks involve various gym-robotics environments (Plappert et al., 2018). |
| Dataset Splits | No | The paper uses established datasets like D4RL, but it does not explicitly describe how these datasets are split into training, validation, or test sets within the paper's text, nor does it reference a specific split methodology for these datasets. |
| Hardware Specification | Yes | GPUs: NVIDIA GeForce RTX 3090 and CPU (Intel(R) Xeon(R) Gold 6226R CPU) (from Figure 10 caption) |
| Software Dependencies | Yes | Python 3.10.8, Numpy 1.23.4, Gymnasium 0.28.1, Gymnasium-robotics 1.2.2, Pytorch 1.13.0, Mu Jo Co-py 2.1.2.14, Mu Jo Co 2.3.1 |
| Experiment Setup | Yes | The hyperparameters are shown in Table 6 in online experiments. We employ distinct parameters and networks for Atari games and continuous control environments because Atari game observations are images, while observations for Adroit and Fetch tasks consist of states. The hyperparameters we use in the D4RL offline experiment are shown in Table 4. In D4RL offline datasets, we apply varying scales in each experiment due to the differing dataset qualities, as illustrated in Table 5. |