Local Differential Privacy for Regret Minimization in Reinforcement Learning

Authors: Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, Matteo Pirotta

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate the empirical performance of LDP-OBI on a toy MDP. We compare LDP-OBI with the non-private algorithm UCB-VI [32]. To the best of our knowledge there is no other LDP algorithm for regret minimization in MDPs in the literature. To increase the comparators, we introduce a novel LDP algorithm based on Thompson sampling [e.g., 12].
Researcher Affiliation Collaboration Evrard Garcelon Facebook AI Research & CREST, ENSAE Paris, France evrard@fb.com Vianney Perchet CREST, ENSAE Paris & Criteo AI Lab Palaiseau, France, vianney@ensae.fr Ciara Pike-Burke Imperial College London London, United Kingdom c.pikeburke@gmail.com Matteo Pirotta Facebook AI Research Paris, France matteo.pirotta@gmail.com
Pseudocode Yes Algorithm 1 Locally Private Episodic RL Algorithm 2 LDP-OBI (M)
Open Source Code No The paper does not provide any links to open-source code for the methodology described, nor does it explicitly state that code will be made available.
Open Datasets No The paper describes using a "Random MDP environment described in [25]" where parameters are sampled to generate the MDP. This indicates a synthetic environment is generated for experiments rather than using a pre-existing, publicly available dataset with concrete access information.
Dataset Splits No The paper does not specify training, validation, or test dataset splits. It describes a randomly generated MDP environment for simulations, not a fixed dataset with partitions.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory, or cloud instance types).
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific libraries).
Experiment Setup Yes We consider the Random MDP environment described in [25] where for each stateaction pair transition probabilities are sampled from a Dirichlet(α) distribution (with αs,a,s = 0.1 for all (s, a, s )) and rewards are deterministic in {0, 1} with r(s, a) = 1{Us,a 0.5} for (Us,a)(s,a) S A U([0, 1]) sampled once when generating the MDP. We set the number of states S = 2, number of actions A = 2 and horizon H = 2. We evaluate the regret of our algorithm for ε {0.2, 2, 20} and K = 1 108 episodes. For each ε, we run 20 simulations. Confidence intervals are the minimum and maximum runs.