Local Differential Privacy for Regret Minimization in Reinforcement Learning
Authors: Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, Matteo Pirotta
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate the empirical performance of LDP-OBI on a toy MDP. We compare LDP-OBI with the non-private algorithm UCB-VI [32]. To the best of our knowledge there is no other LDP algorithm for regret minimization in MDPs in the literature. To increase the comparators, we introduce a novel LDP algorithm based on Thompson sampling [e.g., 12]. |
| Researcher Affiliation | Collaboration | Evrard Garcelon Facebook AI Research & CREST, ENSAE Paris, France evrard@fb.com Vianney Perchet CREST, ENSAE Paris & Criteo AI Lab Palaiseau, France, vianney@ensae.fr Ciara Pike-Burke Imperial College London London, United Kingdom c.pikeburke@gmail.com Matteo Pirotta Facebook AI Research Paris, France matteo.pirotta@gmail.com |
| Pseudocode | Yes | Algorithm 1 Locally Private Episodic RL Algorithm 2 LDP-OBI (M) |
| Open Source Code | No | The paper does not provide any links to open-source code for the methodology described, nor does it explicitly state that code will be made available. |
| Open Datasets | No | The paper describes using a "Random MDP environment described in [25]" where parameters are sampled to generate the MDP. This indicates a synthetic environment is generated for experiments rather than using a pre-existing, publicly available dataset with concrete access information. |
| Dataset Splits | No | The paper does not specify training, validation, or test dataset splits. It describes a randomly generated MDP environment for simulations, not a fixed dataset with partitions. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific libraries). |
| Experiment Setup | Yes | We consider the Random MDP environment described in [25] where for each stateaction pair transition probabilities are sampled from a Dirichlet(α) distribution (with αs,a,s = 0.1 for all (s, a, s )) and rewards are deterministic in {0, 1} with r(s, a) = 1{Us,a 0.5} for (Us,a)(s,a) S A U([0, 1]) sampled once when generating the MDP. We set the number of states S = 2, number of actions A = 2 and horizon H = 2. We evaluate the regret of our algorithm for ε {0.2, 2, 20} and K = 1 108 episodes. For each ε, we run 20 simulations. Confidence intervals are the minimum and maximum runs. |