reproducibilityindex.ai

Local Differential Privacy for Regret Minimization in Reinforcement Learning

Authors: Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, Matteo Pirotta

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate the empirical performance of LDP-OBI on a toy MDP. We compare LDP-OBI with the non-private algorithm UCB-VI [32]. To the best of our knowledge there is no other LDP algorithm for regret minimization in MDPs in the literature. To increase the comparators, we introduce a novel LDP algorithm based on Thompson sampling [e.g., 12].
Researcher Affiliation	Collaboration	Evrard Garcelon Facebook AI Research & CREST, ENSAE Paris, France evrard@fb.com Vianney Perchet CREST, ENSAE Paris & Criteo AI Lab Palaiseau, France, vianney@ensae.fr Ciara Pike-Burke Imperial College London London, United Kingdom c.pikeburke@gmail.com Matteo Pirotta Facebook AI Research Paris, France matteo.pirotta@gmail.com
Pseudocode	Yes	Algorithm 1 Locally Private Episodic RL Algorithm 2 LDP-OBI (M)
Open Source Code	No	The paper does not provide any links to open-source code for the methodology described, nor does it explicitly state that code will be made available.
Open Datasets	No	The paper describes using a "Random MDP environment described in [25]" where parameters are sampled to generate the MDP. This indicates a synthetic environment is generated for experiments rather than using a pre-existing, publicly available dataset with concrete access information.
Dataset Splits	No	The paper does not specify training, validation, or test dataset splits. It describes a randomly generated MDP environment for simulations, not a fixed dataset with partitions.
Hardware Specification	No	The paper does not provide any specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory, or cloud instance types).
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific libraries).
Experiment Setup	Yes	We consider the Random MDP environment described in [25] where for each stateaction pair transition probabilities are sampled from a Dirichlet(α) distribution (with αs,a,s = 0.1 for all (s, a, s )) and rewards are deterministic in {0, 1} with r(s, a) = 1{Us,a 0.5} for (Us,a)(s,a) S A U([0, 1]) sampled once when generating the MDP. We set the number of states S = 2, number of actions A = 2 and horizon H = 2. We evaluate the regret of our algorithm for ε {0.2, 2, 20} and K = 1 108 episodes. For each ε, we run 20 simulations. Conﬁdence intervals are the minimum and maximum runs.