Exploration by Maximizing Renyi Entropy for Reward-Free RL Framework
Authors: Chuheng Zhang, Yuanying Cai, Longbo Huang, Jian Li10859-10867
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that our exploration algorithm is effective and sample efficient, and results in superior policies for arbitrary reward functions in the planning phase. We conduct experiments on several environments with discrete, continuous or high-dimensional state spaces. |
| Researcher Affiliation | Academia | Chuheng Zhang , Yuanying Cai , Longbo Huang, Jian Li Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University |
| Pseudocode | Yes | Algorithm 1 Maximizing the state-action space R enyi entropy for the reward-free RL framework |
| Open Source Code | No | The paper does not explicitly state that the source code for the described methodology is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We first conduct experiments on the Multi Rooms environment from minigrid (Chevalier-Boisvert, Willems, and Pal 2018)... Then, we conduct experiments on a set of Atari (with image-based observations) (Machado et al. 2018) and Mujoco (Todorov, Erez, and Tassa 2012) tasks |
| Dataset Splits | No | The paper describes collecting samples and datasets (e.g., 'collect a dataset with 100M (5M) samples'), but it does not specify explicit numerical training/validation/test dataset splits (e.g., '80/10/10 split' or exact sample counts for each split) in the main text for reproducibility of data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library names with specific versions, or solver names with versions) needed to replicate the experiments. |
| Experiment Setup | Yes | In the exploration phase, we run different exploration algorithms in the rewardfree environment of Atari (Mujoco) for 200M (10M) steps and collect a dataset with 100M (5M) samples by executing the learned policy. More experiments and the detailed experiment settings/hyperparameters can be found in Appendix G. |