reproducibilityindex.ai

Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate

Authors: Mirco Mutti, Lorenzo Pratissoli, Marcello Restelli9028-9036

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Then, we empirically show that MEPOL allows learning a maximum-entropy exploration policy in high-dimensional, continuous-control domains, and how this policy facilitates learning meaningful reward-based tasks downstream.
Researcher Affiliation	Academia	Mirco Mutti1,2,*, Lorenzo Pratissoli1, , and Marcello Restelli1 1 Politecnico di Milano, Milan, Italy 2 Universit a di Bologna, Bologna, Italy mirco.mutti@polimi.it, lorenzo.pratissoli@mail.polimi.it, marcello.restelli@polimi.it
Pseudocode	Yes	Algorithm 1 MEPOL
Open Source Code	Yes	The implementation of MEPOL can be found at https://github.com/muttimirco/mepol.
Open Datasets	Yes	Then, we consider a set of continuous control, high-dimensional environments from the Mujoco suite (Todorov, Erez, and Tassa 2012): Ant (29D, 8D), Humanoid (47D, 20D), Hand Reach (63D, 20D).
Dataset Splits	No	The paper uses an interactive reinforcement learning setup where data is collected through trajectories, rather than fixed train/validation/test dataset splits. No explicit percentages or counts for such splits are provided.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, or cloud instance types) used for running the experiments.
Software Dependencies	No	The paper mentions software like the 'Mujoco suite' and 'scikit-learn' but does not provide specific version numbers for any software dependencies required to replicate the experiments.
Experiment Setup	Yes	Algorithm 1 MEPOL, Inputs: exploration horizon T, sample-size N, trust-region threshold δ, learning rate α, nearest neighbors k