Unsupervised Reinforcement Learning in Multiple Environments
Authors: Mirco Mutti, Mattia Mancassola, Marcello Restelli7850-7858
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 Empirical Evaluation We provide an extensive empirical evaluation of the proposed methodology over the two-phase learning process described in Figure 1, which is organized as follows: 6.1 We show the ability of our method in pre-training an exploration policy in a class of continuous gridworlds, emphasizing the importance of the percentile sensitivity; 6.2 We discuss how the choice of the percentile of interest affects the exploration strategy; 6.3 We highlight the benefit that the pre-trained strategy provides to the supervised fine-tuning on the same class; |
| Researcher Affiliation | Academia | Mirco Mutti1,2,*, Mattia Mancassola1, , and Marcello Restelli1 1 Politecnico di Milano, Milan, Italy 2 Universit a di Bologna, Bologna, Italy |
| Pseudocode | Yes | Algorithm 1 provides the pseudocode of αMEPOL. |
| Open Source Code | Yes | The αMEPOL algorithm is implemented at https://github.com/muttimirco/alphamepol. |
| Open Datasets | Yes | Mini Grid (Chevalier-Boisvert, Willems, and Pal 2018) environments |
| Dataset Splits | No | The paper describes various environments (Grid World, Ant, Mini Grid) and experimental settings, but does not specify explicit training/validation/test dataset splits with percentages or counts. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as exact GPU or CPU models. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | We compare the performance of the optimal exploration strategy obtained by running αMEPOL (α = 0.2) and MEPOL for 150 epochs on the Grid World with Slope class (p M = [0.8, 0.2])... αMEPOL (α = 0.2) against MEPOL on the exploration performance E1 M achieved after 500 epochs. The algorithm operates as a typical policy gradient approach (Deisenroth, Neumann, and Peters 2013). It directly searches for an optimal policy by navigating a set of parametric differentiable policies ΠΘ := {πθ : θ Θ Rn}. It does so by repeatedly updating the parameters θ in the gradient direction, until a stationary point is reached. This update has the form θ = θ + β θEα M(πθ), where β is a learning rate. We employ a principled k-Nearest Neighbors (k-NN) entropy estimator (Singh et al. 2003) of the form T t=0 log k Γ( p T st,τi sk-NN t,τi p 2. |