Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Instance-Dependent Confidence and Early Stopping for Reinforcement Learning

Authors: Eric Xia, Koulik Khamaru, Martin J. Wainwright, Michael I. Jordan

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We highlight benefit of such early stopping rules via some numerical studies. ... Figure 1 gives a preview of the results to come, including the behavior of this early stopping procedure (panel (a)), along with the attendant benefits of substantially reduced sample sizes (panel (b)). ... 3.4 Numerical simulations ... 4.4 Some numerical simulations
Researcher Affiliation Academia Eric Xia EMAIL Department of EECS Massachusetts Institute of Technology Cambridge, MA 02139, USA; Koulik Khamaru EMAIL Department of Statistics Rutgers University Piscataway, NJ 08854 USA; Martin J. Wainwright EMAIL Department of EECS and Mathematics Massachusetts Institute of Technology Cambridge, MA 02139, USA; Michael I. Jordan EMAIL Department of EECS and Statistics University of California, Berkeley Berkeley, CA 94720 USA
Pseudocode Yes Algorithm Emp IRE Empirical Instance-optimal Markov Reward Evaluation ... Algorithm Single Epoch Run Epoch p Q; K, Bm, tp Jiui PCmq ... Algorithm VR-QL
Open Source Code No The paper does not contain any explicit statement about the release of source code or a link to a code repository for the methodology described.
Open Datasets No In the learning setting, the pair p P, rq is unknown and we assume that we have access to i.i.d. samples tp Rk, Zkqun k 1 from the reward vector r and from the transition matrix P. ... We operate in the generative observation model: we are given n i.i.d. samples of the form tp Zk, Rkqun k 1... The numerical simulations use a simple two-state Markov reward process (MRP) M p P, rq. The paper does not provide concrete access information (link, DOI, citation) to any publicly available dataset.
Dataset Splits Yes The base estimator acts on Dn with cardinality n to compute the value function estimate p Vn AEvalp Dnq. We then re-use the larger data set Dn to compute a portion of the covariance estimate pΣpp Vn, Dq. In addition, we make use of a smaller set D2nh corresponding to samples that are held-out, and also enter the covariance estimate pΣpp Vn, Dq as detailed in Section 3.2.3 following our theorem statement. This data set consists of 2nh samples with nh r 24 logp16|X|2{Γq
Hardware Specification No The paper describes numerical simulations but does not specify any particular hardware (e.g., GPU/CPU models, cloud platforms) used for running these experiments.
Software Dependencies No The paper mentions using the "ROOT-SA algorithm (Mou et al., 2022)" and the "variance-reduced Q-learning algorithm from Xia et al. (2021)" as base procedures, but it does not specify any ancillary software packages or libraries with version numbers.
Experiment Setup Yes For every valid combination of pγ, λq, we ran Algorithm Emp IRE with the ROOT-SA algorithm (Mou et al., 2022) as our instance-optimal sub-procedure on the MRP. ... The γ s were chosen to be uniformly spaced between 0.9 and 0.99 in the log-scale, and λ was chosen to be in the set t1.0, 1.5u. The desired tolerance was chosen to be ϵ 0.1. Our results are presented in Figure 1, as previously described. The initial point V0 was chosen by setting aside 2 p1 γq2 samples to construct a plug-in estimate of V . ... For every combination of pγ, λq, we ran Algorithm Emp IRE with the ROOT-SA algorithm (Mou et al., 2022) as our base procedure on the MDP described in Example 1 for 1000 trials. ... The desired tolerance was set at ϵ 0.05. The initialization point Q0 was chosen via setting aside 2 p1 γq2 samples and estimating r and P via averaging, and then solving for the optimal Q-function for this MDP.