Policy Optimization as Online Learning with Mediator Feedback

Authors: Alberto Maria Metelli, Matteo Papini, Pierluca D'Oro, Marcello Restelli8958-8966

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we provide numerical simulations on finite and compact policy spaces, in comparison with PO and bandit baselines. We present the numerical simulations, starting with an illustrative example and then moving to RL benchmarks.
Researcher Affiliation Academia Alberto Maria Metelli*, Matteo Papini*, Pierluca D Oro, Marcello Restelli Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano Piazza Leonardo da Vinci, 32, 20133, Milano, Italy {albertomaria.metelli, matteo.papini, marcello.restelli}@polimi.it, pierluca.doro@mail.polimi.it
Pseudocode Yes Algorithm 1 OPTIMIST; Algorithm 2 RANDOMIST
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a direct link to a code repository. The only external link provided is to an arXiv preprint of the extended paper.
Open Datasets Yes Linear Quadratic Gaussian Regulator (LQG, Curtain 1997); Mountain Car environment (Sutton and Barto 2018)
Dataset Splits No The paper utilizes reinforcement learning environments (LQG and Mountain Car) where data is generated through interaction, rather than using pre-defined static datasets with explicit train/validation/test splits. Therefore, it does not provide specific dataset split information.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies, such as library names with version numbers (e.g., 'Python 3.8, PyTorch 1.9, and CUDA 11.1') needed to replicate the experiments.
Experiment Setup Yes For the RL experiments, similarly to Papini et al. (2019), the evaluation is carried out in the parameter-based PO setting (Sehnke et al. 2008), where the policy parameters θ are sampled from a hyperpolicy νξ and the optimization is performed in the space of hyperparameters Ξ (Appendix A). This setting is particularly convenient since the R enyi divergence between hyperpolicies can be computed exactly (at least for Gaussians). Details and an additional experiment on the Cartpole domain are reported in Appendix F. We consider the monodimensional case and a Gaussian hyperpolicy νξ Npξ, 0.152q where ξ is the learned parameter. taking 10 steps of the Metropolis-Hastings algorithm (Owen 2013) with Gaussian proposal qm Npθm, diagp0.15, 3q2q.