Exploration by Maximizing Renyi Entropy for Reward-Free RL Framework

Authors: Chuheng Zhang, Yuanying Cai, Longbo Huang, Jian Li10859-10867

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that our exploration algorithm is effective and sample efficient, and results in superior policies for arbitrary reward functions in the planning phase. We conduct experiments on several environments with discrete, continuous or high-dimensional state spaces.
Researcher Affiliation Academia Chuheng Zhang , Yuanying Cai , Longbo Huang, Jian Li Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University
Pseudocode Yes Algorithm 1 Maximizing the state-action space R enyi entropy for the reward-free RL framework
Open Source Code No The paper does not explicitly state that the source code for the described methodology is publicly available, nor does it provide a link to a code repository.
Open Datasets Yes We first conduct experiments on the Multi Rooms environment from minigrid (Chevalier-Boisvert, Willems, and Pal 2018)... Then, we conduct experiments on a set of Atari (with image-based observations) (Machado et al. 2018) and Mujoco (Todorov, Erez, and Tassa 2012) tasks
Dataset Splits No The paper describes collecting samples and datasets (e.g., 'collect a dataset with 100M (5M) samples'), but it does not specify explicit numerical training/validation/test dataset splits (e.g., '80/10/10 split' or exact sample counts for each split) in the main text for reproducibility of data partitioning.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory specifications) used for running the experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library names with specific versions, or solver names with versions) needed to replicate the experiments.
Experiment Setup Yes In the exploration phase, we run different exploration algorithms in the rewardfree environment of Atari (Mujoco) for 200M (10M) steps and collect a dataset with 100M (5M) samples by executing the learned policy. More experiments and the detailed experiment settings/hyperparameters can be found in Appendix G.