reproducibilityindex.ai

High Confidence Policy Improvement

Authors: Philip Thomas, Georgios Theocharous, Mohammad Ghavamzadeh

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show the viability of our approach with a simple gridworld and the standard mountain car problem, as well as with a digital marketing application that uses real world data.
Researcher Affiliation	Collaboration	Philip S. Thomas PTHOMAS@CS.UMASS.EDU University of Massachusetts Amherst Georgios Theocharous THEOCHAR@ADOBE.COM Adobe Research Mohammad Ghavamzadeh GHAVAMZA@ADOBE.COM Adobe Research & INRIA
Pseudocode	Yes	Algorithm 1 ρCI (X, δ, m) : Predict what the 1 δ conﬁdence lower bound on E[Xi] would be if X contained m random variables rather than n. Algorithm 2 ρTT (X, δ, m) : Predict what the 1 δ conﬁdence lower bound on E[Xi] would be if X contained m random variables rather than n. Algorithm 3 ρBCa (X, δ, m) : Predict what the 1 δ conﬁdence lower bound on E[Xi] would be if X contained m random variables rather than n. Algorithm 4 POLICYIMPROVEMENT (Dtrain, Dtest, δ, ρ ) Either returns NO SOLUTION FOUND (NSF) or a (semi-)safe policy. Here can denote either CI, TT, or BCa. Algorithm 5 GETCANDIDATEPOLICY None(D, δ, ρ , m) Searches for the candidate policy, but does nothing to mitigate overﬁtting. Algorithm 6 GETCANDIDATEPOLICY k-fold(D, δ, ρ , m) Searches for the candidate policy using k-fold cross-validation to avoid overﬁtting. Algorithm 7 CROSSVALIDATE (α, D, δ, ρ , m) Algorithm 8 DAEDALUS (π0, δ, ρ , β) Incremental policy improvement algorithm.
Open Source Code	No	The paper does not provide any explicit statements about releasing source code or links to a code repository for the methodology described.
Open Datasets	No	The paper mentions using a "digital marketing application that uses real world data" and "data collected from a Fortune 20 company" but does not provide any access information (link, citation with authors/year, or repository) for this dataset. It also mentions "Mountain Car domain" and "gridworld domain" which are standard but again, no specific access information to the data used in their experiments.
Dataset Splits	Yes	Specifically, we ﬁrst partition the data into a small training set, Dtrain, and a larger test set, Dtest. The training set is used to search for which single policy, called the candidate policy, πc, should be tested for safety using the test set. This policy improvement method, POLICYIMPROVEMENT , is reported in Algorithm 4. To simplify later pseudocode, POLICYIMPROVEMENT assumes that the trajectories have already been partitioned into Dtrain and Dtest. In practice, we place 1/5 of the trajectories in the training set and the remainder in the test set.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions using CMA-ES (Hansen, 2006) and linear softmax action selection with the Fourier basis (Konidaris et al., 2011) but does not provide specific version numbers for any software components or libraries.
Experiment Setup	Yes	In all of the experiments that we present, we selected ρ to be an empirical estimate of the performance of the initial policy and δ = 0.05. We used CMA-ES (Hansen, 2006) to solve all arg maxπ, where π was parameterized by a vector of policy parameters using linear softmax action selection (Sutton & Barto, 1998) with the Fourier basis (Konidaris et al., 2011).