Sparsity-Preserving Differentially Private Training of Large Embedding Models

Authors: Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, Chiyuan Zhang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we evaluate the performance of our sparsity-preserving training algorithms and compare them against vanilla DP-SGD on both the recommendation and language understanding tasks (Section 4.1).
Researcher Affiliation Collaboration Badih Ghazi Google Research Mountain View, CA badihghazi@gmail.com Yangsibo Huang Princeton University Princeton, NJ yangsibo@princeton.edu
Pseudocode Yes Algorithm 1: DP-Ada FEST: Adaptive Filtering Applied Sparse Training.
Open Source Code No The paper references 'Google s DP library' and provides a link to its GitHub repository for privacy accounting, but it does not state that the code for the methodology described in this paper is open-source or publicly available.
Open Datasets Yes We evaluate our algorithms on the widely-used Criteo predicted click-through rate (p CTR) dataset4, which includes over four billion ad impressions over 24 days. ... 4https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset ... For language understanding tasks, we employ language models from the BERT family [DCLT19]... We fine-tune the RoBERTa model for downstream classification tasks from the GLUE benchmark [WSM+19], including SST-2 [SPW+13], QNLI [RZLL16], and QQP [IYW+17].
Dataset Splits Yes To simulate a real-world online training scenario, we train the model on the first 18 days of data and evaluate it on subsequent days (i.e., days 19 24);
Hardware Specification No The paper mentions 'Google TPUs' as a motivation for sparsity but does not provide specific hardware details (like exact GPU/CPU models, processor types, or TPU versions) used for running its main experiments or simulations.
Software Dependencies No The paper mentions software like 'Tensorflow' and 'tf.keras.layers.Embedding' but does not provide specific version numbers for these or any other ancillary software components used in the experiments.
Experiment Setup Yes For DP-SGD, we fine-tune the clipping norm and report the best accuracy achieved. When evaluating DP-FEST, we adjust the hyper-parameter k, which represents the number of preserved top buckets, with values ranging from 100 to 300,000. Regarding DP-Ada FEST, we tune the following hyper-parameters: The ratio of noise added to the contribution map to the one added to the sparse gradient, σ1/σ2, with options of 0.1 to 10. The thresholding value τ {0.5, 1.0, 5.0, 10.0, 20.0, 50.0, 100.0}. The clipping norm for gradient contribution C1 {1.0, 5.0, 10.0}.