reproducibilityindex.ai

Replicable Reinforcement Learning

Authors: Eric Eaton, Marcel Hussing, Michael Kearns, Jessica Sorrell

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	While our asymptotic bounds have sample complexity overhead from the introduction of replicability, we would like to analyze the actual requirements in practice. We introduce a simple MDP in Figure 1 that contains several ways of reaching the two goals. We analyze the impact of the number of calls to PS(GM) on replicability for r PVI. In theory, our dependence on the number of calls is not logarithmic with respect to \|S\|\|A\| but we would like to see if can draw a sample that is much smaller, maybe even on the order of the logarithmic requirement. We choose accuracy ε = 0.02, failure rate δ = 0.001 and replicability ρ = 0.2. The number of calls that would be required by standard Phased Q-learning is at most m ≈ 13000 (ignoring γ factors). We take several multiples of m and measure the fraction of identical and unique value functions, treating the r STAT ρSQ as a hyperparameter. The results are presented Figure 2, revealing that the number of samples needed to replicably produce the same value function can be several orders of magnitude lower than suggested by our bounds and that it is feasible to use a larger ρSQ than theoretically required.
Researcher Affiliation	Academia	Eric Eaton University of Pennsylvania Philadelphia, PA 19104 eeaton@seas.upenn.edu Marcel Hussing University of Pennsylvania Philadelphia, PA 19104 mhussing@seas.upenn.edu Michael Kearns University of Pennsylvania Philadelphia, PA 19104 mkearns@cis.upenn.edu Jessica Sorrell University of Pennsylvania Philadelphia, PA 19104 jsorrell@seas.upenn.edu
Pseudocode	Yes	Algorithm 1 Replicable Phased Value Iteration (r PVI) Parameters: accuracy ε, failure probability δ, replicability failure probability ρ Input: Generative Model GM Output: ε-optimal policy bπ ... Algorithm 2 Replicable Episodic R-max (Rep RMAX) Parameters: Accuracy ε, accuracy failure probability δ, replicability failure probability ρ, horizon H Input: MDP M, maximum reward Rmax Output: ε-optimal policy π ˆ MK ... Algorithm 3 Rep Update K Parameters: Accuracy failure probability δ, replicability failure probability ρ Input: Sample of trajectories Si, set of known states K, set of state-visit counts {n(s, a)}(s,a) S A Output: List of new known state-action pairs Ki
Open Source Code	No	The paper does not contain any explicit statements about releasing source code or provide a link to a code repository for the described methodology.
Open Datasets	No	The paper describes using a 'Grid World' for its experiments (Figure 1) and refers to drawing 'samples' from a 'generative model' (Definition 2.1, 2.2). However, it does not provide concrete access information (specific link, DOI, repository name, formal citation with authors/year, or reference to established benchmark datasets) for a publicly available or open dataset used for training or evaluation.
Dataset Splits	No	The paper discusses the collection of 'samples' and 'trajectories' for its algorithms but does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning into training, validation, and test sets.
Hardware Specification	No	The paper does not provide any specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiments.
Experiment Setup	Yes	We choose accuracy ε = 0.02, failure rate δ = 0.001 and replicability ρ = 0.2. The number of calls that would be required by standard Phased Q-learning is at most m ≈ 13000 (ignoring γ factors). We take several multiples of m and measure the fraction of identical and unique value functions, treating the r STAT ρSQ as a hyperparameter.