Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate

Authors: Mirco Mutti, Lorenzo Pratissoli, Marcello Restelli9028-9036

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Then, we empirically show that MEPOL allows learning a maximum-entropy exploration policy in high-dimensional, continuous-control domains, and how this policy facilitates learning meaningful reward-based tasks downstream.
Researcher Affiliation Academia Mirco Mutti1,2,*, Lorenzo Pratissoli1, , and Marcello Restelli1 1 Politecnico di Milano, Milan, Italy 2 Universit a di Bologna, Bologna, Italy mirco.mutti@polimi.it, lorenzo.pratissoli@mail.polimi.it, marcello.restelli@polimi.it
Pseudocode Yes Algorithm 1 MEPOL
Open Source Code Yes The implementation of MEPOL can be found at https://github.com/muttimirco/mepol.
Open Datasets Yes Then, we consider a set of continuous control, high-dimensional environments from the Mujoco suite (Todorov, Erez, and Tassa 2012): Ant (29D, 8D), Humanoid (47D, 20D), Hand Reach (63D, 20D).
Dataset Splits No The paper uses an interactive reinforcement learning setup where data is collected through trajectories, rather than fixed train/validation/test dataset splits. No explicit percentages or counts for such splits are provided.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions software like the 'Mujoco suite' and 'scikit-learn' but does not provide specific version numbers for any software dependencies required to replicate the experiments.
Experiment Setup Yes Algorithm 1 MEPOL, Inputs: exploration horizon T, sample-size N, trust-region threshold δ, learning rate α, nearest neighbors k