Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Instance-Dependent Confidence and Early Stopping for Reinforcement Learning
Authors: Eric Xia, Koulik Khamaru, Martin J. Wainwright, Michael I. Jordan
JMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We highlight beneļ¬t of such early stopping rules via some numerical studies. ... Figure 1 gives a preview of the results to come, including the behavior of this early stopping procedure (panel (a)), along with the attendant beneļ¬ts of substantially reduced sample sizes (panel (b)). ... 3.4 Numerical simulations ... 4.4 Some numerical simulations |
| Researcher Affiliation | Academia | Eric Xia EMAIL Department of EECS Massachusetts Institute of Technology Cambridge, MA 02139, USA; Koulik Khamaru EMAIL Department of Statistics Rutgers University Piscataway, NJ 08854 USA; Martin J. Wainwright EMAIL Department of EECS and Mathematics Massachusetts Institute of Technology Cambridge, MA 02139, USA; Michael I. Jordan EMAIL Department of EECS and Statistics University of California, Berkeley Berkeley, CA 94720 USA |
| Pseudocode | Yes | Algorithm Emp IRE Empirical Instance-optimal Markov Reward Evaluation ... Algorithm Single Epoch Run Epoch p Q; K, Bm, tp Jiui PCmq ... Algorithm VR-QL |
| Open Source Code | No | The paper does not contain any explicit statement about the release of source code or a link to a code repository for the methodology described. |
| Open Datasets | No | In the learning setting, the pair p P, rq is unknown and we assume that we have access to i.i.d. samples tp Rk, Zkqun k 1 from the reward vector r and from the transition matrix P. ... We operate in the generative observation model: we are given n i.i.d. samples of the form tp Zk, Rkqun k 1... The numerical simulations use a simple two-state Markov reward process (MRP) M p P, rq. The paper does not provide concrete access information (link, DOI, citation) to any publicly available dataset. |
| Dataset Splits | Yes | The base estimator acts on Dn with cardinality n to compute the value function estimate p Vn AEvalp Dnq. We then re-use the larger data set Dn to compute a portion of the covariance estimate pΣpp Vn, Dq. In addition, we make use of a smaller set D2nh corresponding to samples that are held-out, and also enter the covariance estimate pΣpp Vn, Dq as detailed in Section 3.2.3 following our theorem statement. This data set consists of 2nh samples with nh r 24 logp16|X|2{Γq |
| Hardware Specification | No | The paper describes numerical simulations but does not specify any particular hardware (e.g., GPU/CPU models, cloud platforms) used for running these experiments. |
| Software Dependencies | No | The paper mentions using the "ROOT-SA algorithm (Mou et al., 2022)" and the "variance-reduced Q-learning algorithm from Xia et al. (2021)" as base procedures, but it does not specify any ancillary software packages or libraries with version numbers. |
| Experiment Setup | Yes | For every valid combination of pγ, λq, we ran Algorithm Emp IRE with the ROOT-SA algorithm (Mou et al., 2022) as our instance-optimal sub-procedure on the MRP. ... The γ s were chosen to be uniformly spaced between 0.9 and 0.99 in the log-scale, and λ was chosen to be in the set t1.0, 1.5u. The desired tolerance was chosen to be ϵ 0.1. Our results are presented in Figure 1, as previously described. The initial point V0 was chosen by setting aside 2 p1 γq2 samples to construct a plug-in estimate of V . ... For every combination of pγ, λq, we ran Algorithm Emp IRE with the ROOT-SA algorithm (Mou et al., 2022) as our base procedure on the MDP described in Example 1 for 1000 trials. ... The desired tolerance was set at ϵ 0.05. The initialization point Q0 was chosen via setting aside 2 p1 γq2 samples and estimating r and P via averaging, and then solving for the optimal Q-function for this MDP. |