reproducibilityindex.ai

Optimistic Policy Optimization via Multiple Importance Sampling

Authors: Matteo Papini, Alberto Maria Metelli, Lorenzo Lupo, Marcello Restelli

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we evaluate our algorithms on tasks of varying difﬁculty, comparing them with existing MAB and RL algorithms. In this section, we present the results of the numerical simulation of OPTIMIST on RL tasks with both discrete and continuous parameter spaces.
Researcher Affiliation	Academia	Matteo Papini 1 Alberto Maria Metelli 1 Lorenzo Lupo 1 Marcello Restelli 1 1Politecnico di Milano, Milan, Italy. Correspondence to: Matteo Papini <matteo.papini@polimi.it>.
Pseudocode	Yes	The pseudocode is provided in Algorithm 1. The pseudocode for this variant, called OPTIMIST 2, is reported in Algorithm 2.
Open Source Code	Yes	The implementation of the proposed algorithms can be found at https://github.com/Wolf Lo/optimist.
Open Datasets	Yes	The Linear Quadratic Gaussian Regulator (LQG, Dorato et al., 1995) is a benchmark problem for continuous control. The River Swim (Strehl & Littman, 2008) is a classical benchmark for exploration in RL. Mountain Car task (Brockman et al., 2016).
Dataset Splits	No	The paper describes online learning within reinforcement learning environments, where data is generated through interaction. It does not provide explicit train/validation/test dataset splits as typically seen in supervised learning contexts.
Hardware Specification	No	The paper does not specify the hardware used for running the experiments (e.g., CPU, GPU models, or memory configurations).
Software Dependencies	No	The paper mentions 'default scikit-learn kernel (RBF)' for GPUCB but does not provide specific version numbers for any software dependencies or libraries used in their implementation.
Experiment Setup	Yes	We consider the monodimensional case in which the state space is limited to S r 4, 4s, the action space is A r 4, 4s and the horizon is limited to 20. ... ξ is the mean parameter to be learned and σ 0.15 ﬁxed. ... All algorithms are run with conﬁdence level δ 0.2. ... The River Swim ... σ 0.5 ﬁxed. ... We use a Gaussian hyperpolicy with a two-dimensional learnable mean within a box r 1, 1sˆr0, 20s and a ﬁxed covariance diagp0.15, 3q2. We compare OPTIMIST2 with κ 3...