Policy Optimization as Online Learning with Mediator Feedback
Authors: Alberto Maria Metelli, Matteo Papini, Pierluca D'Oro, Marcello Restelli8958-8966
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we provide numerical simulations on finite and compact policy spaces, in comparison with PO and bandit baselines. We present the numerical simulations, starting with an illustrative example and then moving to RL benchmarks. |
| Researcher Affiliation | Academia | Alberto Maria Metelli*, Matteo Papini*, Pierluca D Oro, Marcello Restelli Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano Piazza Leonardo da Vinci, 32, 20133, Milano, Italy {albertomaria.metelli, matteo.papini, marcello.restelli}@polimi.it, pierluca.doro@mail.polimi.it |
| Pseudocode | Yes | Algorithm 1 OPTIMIST; Algorithm 2 RANDOMIST |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a direct link to a code repository. The only external link provided is to an arXiv preprint of the extended paper. |
| Open Datasets | Yes | Linear Quadratic Gaussian Regulator (LQG, Curtain 1997); Mountain Car environment (Sutton and Barto 2018) |
| Dataset Splits | No | The paper utilizes reinforcement learning environments (LQG and Mountain Car) where data is generated through interaction, rather than using pre-defined static datasets with explicit train/validation/test splits. Therefore, it does not provide specific dataset split information. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies, such as library names with version numbers (e.g., 'Python 3.8, PyTorch 1.9, and CUDA 11.1') needed to replicate the experiments. |
| Experiment Setup | Yes | For the RL experiments, similarly to Papini et al. (2019), the evaluation is carried out in the parameter-based PO setting (Sehnke et al. 2008), where the policy parameters θ are sampled from a hyperpolicy νξ and the optimization is performed in the space of hyperparameters Ξ (Appendix A). This setting is particularly convenient since the R enyi divergence between hyperpolicies can be computed exactly (at least for Gaussians). Details and an additional experiment on the Cartpole domain are reported in Appendix F. We consider the monodimensional case and a Gaussian hyperpolicy νξ Npξ, 0.152q where ξ is the learned parameter. taking 10 steps of the Metropolis-Hastings algorithm (Owen 2013) with Gaussian proposal qm Npθm, diagp0.15, 3q2q. |