Optimistic Policy Optimization via Multiple Importance Sampling

Authors: Matteo Papini, Alberto Maria Metelli, Lorenzo Lupo, Marcello Restelli

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we evaluate our algorithms on tasks of varying difficulty, comparing them with existing MAB and RL algorithms. In this section, we present the results of the numerical simulation of OPTIMIST on RL tasks with both discrete and continuous parameter spaces.
Researcher Affiliation Academia Matteo Papini 1 Alberto Maria Metelli 1 Lorenzo Lupo 1 Marcello Restelli 1 1Politecnico di Milano, Milan, Italy. Correspondence to: Matteo Papini <matteo.papini@polimi.it>.
Pseudocode Yes The pseudocode is provided in Algorithm 1. The pseudocode for this variant, called OPTIMIST 2, is reported in Algorithm 2.
Open Source Code Yes The implementation of the proposed algorithms can be found at https://github.com/Wolf Lo/optimist.
Open Datasets Yes The Linear Quadratic Gaussian Regulator (LQG, Dorato et al., 1995) is a benchmark problem for continuous control. The River Swim (Strehl & Littman, 2008) is a classical benchmark for exploration in RL. Mountain Car task (Brockman et al., 2016).
Dataset Splits No The paper describes online learning within reinforcement learning environments, where data is generated through interaction. It does not provide explicit train/validation/test dataset splits as typically seen in supervised learning contexts.
Hardware Specification No The paper does not specify the hardware used for running the experiments (e.g., CPU, GPU models, or memory configurations).
Software Dependencies No The paper mentions 'default scikit-learn kernel (RBF)' for GPUCB but does not provide specific version numbers for any software dependencies or libraries used in their implementation.
Experiment Setup Yes We consider the monodimensional case in which the state space is limited to S r 4, 4s, the action space is A r 4, 4s and the horizon is limited to 20. ... ξ is the mean parameter to be learned and σ 0.15 fixed. ... All algorithms are run with confidence level δ 0.2. ... The River Swim ... σ 0.5 fixed. ... We use a Gaussian hyperpolicy with a two-dimensional learnable mean within a box r 1, 1sˆr0, 20s and a fixed covariance diagp0.15, 3q2. We compare OPTIMIST2 with κ 3...