Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Local Differential Privacy for Regret Minimization in Reinforcement Learning
Authors: Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, Matteo Pirotta
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate the empirical performance of LDP-OBI on a toy MDP. We compare LDP-OBI with the non-private algorithm UCB-VI [32]. To the best of our knowledge there is no other LDP algorithm for regret minimization in MDPs in the literature. To increase the comparators, we introduce a novel LDP algorithm based on Thompson sampling [e.g., 12]. |
| Researcher Affiliation | Collaboration | Evrard Garcelon Facebook AI Research & CREST, ENSAE Paris, France EMAIL Vianney Perchet CREST, ENSAE Paris & Criteo AI Lab Palaiseau, France, EMAIL Ciara Pike-Burke Imperial College London London, United Kingdom EMAIL Matteo Pirotta Facebook AI Research Paris, France EMAIL |
| Pseudocode | Yes | Algorithm 1 Locally Private Episodic RL Algorithm 2 LDP-OBI (M) |
| Open Source Code | No | The paper does not provide any links to open-source code for the methodology described, nor does it explicitly state that code will be made available. |
| Open Datasets | No | The paper describes using a "Random MDP environment described in [25]" where parameters are sampled to generate the MDP. This indicates a synthetic environment is generated for experiments rather than using a pre-existing, publicly available dataset with concrete access information. |
| Dataset Splits | No | The paper does not specify training, validation, or test dataset splits. It describes a randomly generated MDP environment for simulations, not a fixed dataset with partitions. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific libraries). |
| Experiment Setup | Yes | We consider the Random MDP environment described in [25] where for each stateaction pair transition probabilities are sampled from a Dirichlet(α) distribution (with αs,a,s = 0.1 for all (s, a, s )) and rewards are deterministic in {0, 1} with r(s, a) = 1{Us,a 0.5} for (Us,a)(s,a) S A U([0, 1]) sampled once when generating the MDP. We set the number of states S = 2, number of actions A = 2 and horizon H = 2. We evaluate the regret of our algorithm for ε {0.2, 2, 20} and K = 1 108 episodes. For each ε, we run 20 simulations. Confidence intervals are the minimum and maximum runs. |