Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Task-Agnostic Exploration via Policy Gradient of a Non-Parametric State Entropy Estimate
Authors: Mirco Mutti, Lorenzo Pratissoli, Marcello Restelli9028-9036
AAAI 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Then, we empirically show that MEPOL allows learning a maximum-entropy exploration policy in high-dimensional, continuous-control domains, and how this policy facilitates learning meaningful reward-based tasks downstream. |
| Researcher Affiliation | Academia | Mirco Mutti1,2,*, Lorenzo Pratissoli1, , and Marcello Restelli1 1 Politecnico di Milano, Milan, Italy 2 Universit a di Bologna, Bologna, Italy EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 MEPOL |
| Open Source Code | Yes | The implementation of MEPOL can be found at https://github.com/muttimirco/mepol. |
| Open Datasets | Yes | Then, we consider a set of continuous control, high-dimensional environments from the Mujoco suite (Todorov, Erez, and Tassa 2012): Ant (29D, 8D), Humanoid (47D, 20D), Hand Reach (63D, 20D). |
| Dataset Splits | No | The paper uses an interactive reinforcement learning setup where data is collected through trajectories, rather than fixed train/validation/test dataset splits. No explicit percentages or counts for such splits are provided. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions software like the 'Mujoco suite' and 'scikit-learn' but does not provide specific version numbers for any software dependencies required to replicate the experiments. |
| Experiment Setup | Yes | Algorithm 1 MEPOL, Inputs: exploration horizon T, sample-size N, trust-region threshold δ, learning rate α, nearest neighbors k |