High Confidence Policy Improvement
Authors: Philip Thomas, Georgios Theocharous, Mohammad Ghavamzadeh
ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show the viability of our approach with a simple gridworld and the standard mountain car problem, as well as with a digital marketing application that uses real world data. |
| Researcher Affiliation | Collaboration | Philip S. Thomas PTHOMAS@CS.UMASS.EDU University of Massachusetts Amherst Georgios Theocharous THEOCHAR@ADOBE.COM Adobe Research Mohammad Ghavamzadeh GHAVAMZA@ADOBE.COM Adobe Research & INRIA |
| Pseudocode | Yes | Algorithm 1 ρCI (X, δ, m) : Predict what the 1 δ confidence lower bound on E[Xi] would be if X contained m random variables rather than n. Algorithm 2 ρTT (X, δ, m) : Predict what the 1 δ confidence lower bound on E[Xi] would be if X contained m random variables rather than n. Algorithm 3 ρBCa (X, δ, m) : Predict what the 1 δ confidence lower bound on E[Xi] would be if X contained m random variables rather than n. Algorithm 4 POLICYIMPROVEMENT (Dtrain, Dtest, δ, ρ ) Either returns NO SOLUTION FOUND (NSF) or a (semi-)safe policy. Here can denote either CI, TT, or BCa. Algorithm 5 GETCANDIDATEPOLICY None(D, δ, ρ , m) Searches for the candidate policy, but does nothing to mitigate overfitting. Algorithm 6 GETCANDIDATEPOLICY k-fold(D, δ, ρ , m) Searches for the candidate policy using k-fold cross-validation to avoid overfitting. Algorithm 7 CROSSVALIDATE (α, D, δ, ρ , m) Algorithm 8 DAEDALUS (π0, δ, ρ , β) Incremental policy improvement algorithm. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository for the methodology described. |
| Open Datasets | No | The paper mentions using a "digital marketing application that uses real world data" and "data collected from a Fortune 20 company" but does not provide any access information (link, citation with authors/year, or repository) for this dataset. It also mentions "Mountain Car domain" and "gridworld domain" which are standard but again, no specific access information to the data *used in their experiments*. |
| Dataset Splits | Yes | Specifically, we first partition the data into a small training set, Dtrain, and a larger test set, Dtest. The training set is used to search for which single policy, called the candidate policy, πc, should be tested for safety using the test set. This policy improvement method, POLICYIMPROVEMENT , is reported in Algorithm 4. To simplify later pseudocode, POLICYIMPROVEMENT assumes that the trajectories have already been partitioned into Dtrain and Dtest. In practice, we place 1/5 of the trajectories in the training set and the remainder in the test set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions using CMA-ES (Hansen, 2006) and linear softmax action selection with the Fourier basis (Konidaris et al., 2011) but does not provide specific version numbers for any software components or libraries. |
| Experiment Setup | Yes | In all of the experiments that we present, we selected ρ to be an empirical estimate of the performance of the initial policy and δ = 0.05. We used CMA-ES (Hansen, 2006) to solve all arg maxπ, where π was parameterized by a vector of policy parameters using linear softmax action selection (Sutton & Barto, 1998) with the Fourier basis (Konidaris et al., 2011). |