Unsupervised Reinforcement Learning in Multiple Environments

Authors: Mirco Mutti, Mattia Mancassola, Marcello Restelli7850-7858

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 Empirical Evaluation We provide an extensive empirical evaluation of the proposed methodology over the two-phase learning process described in Figure 1, which is organized as follows: 6.1 We show the ability of our method in pre-training an exploration policy in a class of continuous gridworlds, emphasizing the importance of the percentile sensitivity; 6.2 We discuss how the choice of the percentile of interest affects the exploration strategy; 6.3 We highlight the benefit that the pre-trained strategy provides to the supervised fine-tuning on the same class;
Researcher Affiliation Academia Mirco Mutti1,2,*, Mattia Mancassola1, , and Marcello Restelli1 1 Politecnico di Milano, Milan, Italy 2 Universit a di Bologna, Bologna, Italy
Pseudocode Yes Algorithm 1 provides the pseudocode of αMEPOL.
Open Source Code Yes The αMEPOL algorithm is implemented at https://github.com/muttimirco/alphamepol.
Open Datasets Yes Mini Grid (Chevalier-Boisvert, Willems, and Pal 2018) environments
Dataset Splits No The paper describes various environments (Grid World, Ant, Mini Grid) and experimental settings, but does not specify explicit training/validation/test dataset splits with percentages or counts.
Hardware Specification No The paper does not provide specific details about the hardware used for running experiments, such as exact GPU or CPU models.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes We compare the performance of the optimal exploration strategy obtained by running αMEPOL (α = 0.2) and MEPOL for 150 epochs on the Grid World with Slope class (p M = [0.8, 0.2])... αMEPOL (α = 0.2) against MEPOL on the exploration performance E1 M achieved after 500 epochs. The algorithm operates as a typical policy gradient approach (Deisenroth, Neumann, and Peters 2013). It directly searches for an optimal policy by navigating a set of parametric differentiable policies ΠΘ := {πθ : θ Θ Rn}. It does so by repeatedly updating the parameters θ in the gradient direction, until a stationary point is reached. This update has the form θ = θ + β θEα M(πθ), where β is a learning rate. We employ a principled k-Nearest Neighbors (k-NN) entropy estimator (Singh et al. 2003) of the form T t=0 log k Γ( p T st,τi sk-NN t,τi p 2.