Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Unsupervised Reinforcement Learning in Multiple Environments
Authors: Mirco Mutti, Mattia Mancassola, Marcello Restelli7850-7858
AAAI 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 Empirical Evaluation We provide an extensive empirical evaluation of the proposed methodology over the two-phase learning process described in Figure 1, which is organized as follows: 6.1 We show the ability of our method in pre-training an exploration policy in a class of continuous gridworlds, emphasizing the importance of the percentile sensitivity; 6.2 We discuss how the choice of the percentile of interest affects the exploration strategy; 6.3 We highlight the benefit that the pre-trained strategy provides to the supervised fine-tuning on the same class; |
| Researcher Affiliation | Academia | Mirco Mutti1,2,*, Mattia Mancassola1, , and Marcello Restelli1 1 Politecnico di Milano, Milan, Italy 2 Universit a di Bologna, Bologna, Italy |
| Pseudocode | Yes | Algorithm 1 provides the pseudocode of αMEPOL. |
| Open Source Code | Yes | The αMEPOL algorithm is implemented at https://github.com/muttimirco/alphamepol. |
| Open Datasets | Yes | Mini Grid (Chevalier-Boisvert, Willems, and Pal 2018) environments |
| Dataset Splits | No | The paper describes various environments (Grid World, Ant, Mini Grid) and experimental settings, but does not specify explicit training/validation/test dataset splits with percentages or counts. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as exact GPU or CPU models. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | We compare the performance of the optimal exploration strategy obtained by running αMEPOL (α = 0.2) and MEPOL for 150 epochs on the Grid World with Slope class (p M = [0.8, 0.2])... αMEPOL (α = 0.2) against MEPOL on the exploration performance E1 M achieved after 500 epochs. The algorithm operates as a typical policy gradient approach (Deisenroth, Neumann, and Peters 2013). It directly searches for an optimal policy by navigating a set of parametric differentiable policies ΠΘ := {πθ : θ Θ Rn}. It does so by repeatedly updating the parameters θ in the gradient direction, until a stationary point is reached. This update has the form θ = θ + β θEα M(πθ), where β is a learning rate. We employ a principled k-Nearest Neighbors (k-NN) entropy estimator (Singh et al. 2003) of the form T t=0 log k Γ( p T st,τi sk-NN t,τi p 2. |