Replicable Reinforcement Learning
Authors: Eric Eaton, Marcel Hussing, Michael Kearns, Jessica Sorrell
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | While our asymptotic bounds have sample complexity overhead from the introduction of replicability, we would like to analyze the actual requirements in practice. We introduce a simple MDP in Figure 1 that contains several ways of reaching the two goals. We analyze the impact of the number of calls to PS(GM) on replicability for r PVI. In theory, our dependence on the number of calls is not logarithmic with respect to |S||A| but we would like to see if can draw a sample that is much smaller, maybe even on the order of the logarithmic requirement. We choose accuracy ε = 0.02, failure rate δ = 0.001 and replicability ρ = 0.2. The number of calls that would be required by standard Phased Q-learning is at most m ≈ 13000 (ignoring γ factors). We take several multiples of m and measure the fraction of identical and unique value functions, treating the r STAT ρSQ as a hyperparameter. The results are presented Figure 2, revealing that the number of samples needed to replicably produce the same value function can be several orders of magnitude lower than suggested by our bounds and that it is feasible to use a larger ρSQ than theoretically required. |
| Researcher Affiliation | Academia | Eric Eaton University of Pennsylvania Philadelphia, PA 19104 eeaton@seas.upenn.edu Marcel Hussing University of Pennsylvania Philadelphia, PA 19104 mhussing@seas.upenn.edu Michael Kearns University of Pennsylvania Philadelphia, PA 19104 mkearns@cis.upenn.edu Jessica Sorrell University of Pennsylvania Philadelphia, PA 19104 jsorrell@seas.upenn.edu |
| Pseudocode | Yes | Algorithm 1 Replicable Phased Value Iteration (r PVI) Parameters: accuracy ε, failure probability δ, replicability failure probability ρ Input: Generative Model GM Output: ε-optimal policy bπ ... Algorithm 2 Replicable Episodic R-max (Rep RMAX) Parameters: Accuracy ε, accuracy failure probability δ, replicability failure probability ρ, horizon H Input: MDP M, maximum reward Rmax Output: ε-optimal policy π ˆ MK ... Algorithm 3 Rep Update K Parameters: Accuracy failure probability δ, replicability failure probability ρ Input: Sample of trajectories Si, set of known states K, set of state-visit counts {n(s, a)}(s,a) S A Output: List of new known state-action pairs Ki |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or provide a link to a code repository for the described methodology. |
| Open Datasets | No | The paper describes using a 'Grid World' for its experiments (Figure 1) and refers to drawing 'samples' from a 'generative model' (Definition 2.1, 2.2). However, it does not provide concrete access information (specific link, DOI, repository name, formal citation with authors/year, or reference to established benchmark datasets) for a publicly available or open dataset used for training or evaluation. |
| Dataset Splits | No | The paper discusses the collection of 'samples' and 'trajectories' for its algorithms but does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning into training, validation, and test sets. |
| Hardware Specification | No | The paper does not provide any specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiments. |
| Experiment Setup | Yes | We choose accuracy ε = 0.02, failure rate δ = 0.001 and replicability ρ = 0.2. The number of calls that would be required by standard Phased Q-learning is at most m ≈ 13000 (ignoring γ factors). We take several multiples of m and measure the fraction of identical and unique value functions, treating the r STAT ρSQ as a hyperparameter. |