reproducibilityindex.ai

Unsupervised Reinforcement Learning in Multiple Environments

Authors: Mirco Mutti, Mattia Mancassola, Marcello Restelli7850-7858

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	6 Empirical Evaluation We provide an extensive empirical evaluation of the proposed methodology over the two-phase learning process described in Figure 1, which is organized as follows: 6.1 We show the ability of our method in pre-training an exploration policy in a class of continuous gridworlds, emphasizing the importance of the percentile sensitivity; 6.2 We discuss how the choice of the percentile of interest affects the exploration strategy; 6.3 We highlight the beneﬁt that the pre-trained strategy provides to the supervised ﬁne-tuning on the same class;
Researcher Affiliation	Academia	Mirco Mutti1,2,*, Mattia Mancassola1, , and Marcello Restelli1 1 Politecnico di Milano, Milan, Italy 2 Universit a di Bologna, Bologna, Italy
Pseudocode	Yes	Algorithm 1 provides the pseudocode of αMEPOL.
Open Source Code	Yes	The αMEPOL algorithm is implemented at https://github.com/muttimirco/alphamepol.
Open Datasets	Yes	Mini Grid (Chevalier-Boisvert, Willems, and Pal 2018) environments
Dataset Splits	No	The paper describes various environments (Grid World, Ant, Mini Grid) and experimental settings, but does not specify explicit training/validation/test dataset splits with percentages or counts.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as exact GPU or CPU models.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	We compare the performance of the optimal exploration strategy obtained by running αMEPOL (α = 0.2) and MEPOL for 150 epochs on the Grid World with Slope class (p M = [0.8, 0.2])... αMEPOL (α = 0.2) against MEPOL on the exploration performance E1 M achieved after 500 epochs. The algorithm operates as a typical policy gradient approach (Deisenroth, Neumann, and Peters 2013). It directly searches for an optimal policy by navigating a set of parametric differentiable policies ΠΘ := {πθ : θ Θ Rn}. It does so by repeatedly updating the parameters θ in the gradient direction, until a stationary point is reached. This update has the form θ = θ + β θEα M(πθ), where β is a learning rate. We employ a principled k-Nearest Neighbors (k-NN) entropy estimator (Singh et al. 2003) of the form T t=0 log k Γ( p T st,τi sk-NN t,τi p 2.