Offline Reinforcement Learning with Differential Privacy
Authors: Dan Qiao, Yu-Xiang Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform numerical simulations to evaluate and compare the performance of our algorithm DP-VAPVI (Algorithm 2) with its non-private counterpart VAPVI [Yin et al., 2022] as well as a popular baseline PEVI [Jin et al., 2021]. The results complement the theoretical findings by demonstrating the practicality of DP-VAPVI under strong privacy parameters. |
| Researcher Affiliation | Academia | Dan Qiao Department of Computer Science UC Santa Barbara Santa Barbara, CA 93106 danqiao@ucsb.edu Yu-Xiang Wang Department of Computer Science UC Santa Barbara Santa Barbara, CA 93106 yuxiangw@cs.ucsb.edu |
| Pseudocode | Yes | Algorithm 1 Differentially Private Adaptive Pessimistic Value Iteration (DP-APVI) Algorithm 2 Differentially Private Variance-Aware Pessimistic Value Iteration (DP-VAPVI) |
| Open Source Code | No | The paper does not include an explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We evaluate DP-VAPVI (Algorithm 2) on a synthetic linear MDP example that originates from the linear MDP in [Min et al., 2021, Yin et al., 2022] but with some modifications.11 For details of the linear MDP setting, please refer to Appendix F. |
| Dataset Splits | Yes | For the offline dataset, we divide it into two independent parts with equal length: D = {(sτ h, aτ h, rτ h, sτ h+1)}h [H] τ [K] and D = {( sτ h, aτ h, rτ h, sτ h+1)}h [H] τ [K]. One for estimating variance and the other for calculating Q-values. |
| Hardware Specification | No | The paper does not mention any specific hardware (e.g., GPU/CPU models, memory) used to run the simulations. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, or solver names with versions) used for the simulations. |
| Experiment Setup | Yes | The two MDP instances we use both have horizon H = 20. The number of episodes takes value from 5 to 1000. For details of the linear MDP setting, please refer to Appendix F. (In Appendix F: The linear MDP example we use consists of |S| = 2 states and |A| = 100 actions, while the feature dimension d = 10. ... The behavior policy is to always choose action a = 0 with probability p, and other actions uniformly with probability (1 p)/99. Here we choose p = 0.6.) |