Optimal Unbiased Randomizers for Regression with Label Differential Privacy

Authors: Ashwinkumar Badanidiyuru Varadaraja, Badih Ghazi, Pritish Kamath, Ravi Kumar, Ethan Leeman, Pasin Manurangsi, Avinash V Varadarajan, Chiyuan Zhang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that these randomizers achieve state-of-the-art privacy-utility trade-offs on several datasets, highlighting the importance of reducing bias when training neural networks with label DP.
Researcher Affiliation Industry Ashwinkumar Badanidiyuru Google Mountain View, CA Badih Ghazi Google Research Mountain View, CA Pritish Kamath Google Research Mountain View, CA Ravi Kumar Google Research Mountain View, CA Ethan Leeman Google Cambridge, MA Pasin Manurangsi Google Research Bangkok, Thailand Avinash V Varadarajan Google Mountain View, CA Chiyuan Zhang Google Research Mountain View, CA
Pseudocode Yes Algorithm 1 Compute Opt Unbiased Randε Parameters: Privacy parameter ε 0. Input: P = (py)y Y prior over input labels Y, ˆY = (ˆyi)i I a finite sequence of potential output labels. Output: An ε-DP label randomizer. Solve the following LP in variables M = (My i)y Y,i I: ˆy ˆ Y My i g(ˆyi, y) , subject to [Non-negativity] y Y, i I : My i 0 [Normalization] y Y : P i I My i = 1 [ε-Label DP] i I, y, y Y s.t. y = y : My i eε My i [Unbiasedness] y Y : P i I My i ˆyi = y return Label randomizer M mapping Y to ˆY given by Pr[M(y) = ˆyi] = My i.
Open Source Code No The paper does not provide any explicit statements about making the source code available or links to a code repository for the described methodology.
Open Datasets Yes The Criteo Sponsored Search Conversion Log Dataset [TY18] is a collection of 15, 995, 634 data points derived from a sample of 90-day logs of live traffic from Criteo Predictive Search (CPS). The 1940 US Census dataset3 is widely used in the evaluation of data analysis with DP [WDZ+19, CJG21, GKM+21]. This digitally released dataset in 2012 contains 131, 903, 909 examples.
Dataset Splits Yes The training was performed on a random 80% of the dataset using RMSProp algorithm using the squared loss objective, with learning rate of 10 4, ℓ2-regularization of 10 4, batch size of 1, 024 for 50 epochs. The remaining 20% of the dataset was used to report the test loss.
Hardware Specification Yes All our experiments were performed using NVidia P100 GPUs.
Software Dependencies No The paper mentions software components like RMSProp algorithm but does not provide specific version numbers for any software dependencies or libraries used in their experiments.
Experiment Setup Yes The training was performed on a random 80% of the dataset using RMSProp algorithm using the squared loss objective, with learning rate of 10 4, ℓ2-regularization of 10 4, batch size of 1, 024 for 50 epochs.