Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate
Authors: Mirco Mutti, Lorenzo Pratissoli, Marcello Restelli9028-9036
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Then, we empirically show that MEPOL allows learning a maximum-entropy exploration policy in high-dimensional, continuous-control domains, and how this policy facilitates learning meaningful reward-based tasks downstream. |
| Researcher Affiliation | Academia | Mirco Mutti1,2,*, Lorenzo Pratissoli1, , and Marcello Restelli1 1 Politecnico di Milano, Milan, Italy 2 Universit a di Bologna, Bologna, Italy mirco.mutti@polimi.it, lorenzo.pratissoli@mail.polimi.it, marcello.restelli@polimi.it |
| Pseudocode | Yes | Algorithm 1 MEPOL |
| Open Source Code | Yes | The implementation of MEPOL can be found at https://github.com/muttimirco/mepol. |
| Open Datasets | Yes | Then, we consider a set of continuous control, high-dimensional environments from the Mujoco suite (Todorov, Erez, and Tassa 2012): Ant (29D, 8D), Humanoid (47D, 20D), Hand Reach (63D, 20D). |
| Dataset Splits | No | The paper uses an interactive reinforcement learning setup where data is collected through trajectories, rather than fixed train/validation/test dataset splits. No explicit percentages or counts for such splits are provided. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions software like the 'Mujoco suite' and 'scikit-learn' but does not provide specific version numbers for any software dependencies required to replicate the experiments. |
| Experiment Setup | Yes | Algorithm 1 MEPOL, Inputs: exploration horizon T, sample-size N, trust-region threshold δ, learning rate α, nearest neighbors k |