Private Adaptive Optimization with Side information

Authors: Tian Li, Manzil Zaheer, Sashank Reddi, Virginia Smith

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we leverage simple and readily available side information to explore the performance of Ada DPS in practice, comparing to strong baselines in both centralized and federated settings.
Researcher Affiliation Collaboration 1Carnegie Mellon University 2Google Deep Mind 3Google Research
Pseudocode Yes Ada DPS in centralized training is summarized in Algorithm 1.
Open Source Code Yes Our code is publicly available at github.com/litian96/Ada DPS.
Open Datasets Yes Datasets. We consider common benchmarks for adaptive optimization in centralized or federated settings (Amid et al., 2021; Reddi et al., 2018a; 2021) involving varying types of models (both convex and non-convex) and data (both text and image data). Stack Overflow (Authors, 2019) consists of posts on the Stack Overflow website, where the task is tag prediction (500class classification). IMDB (Maas et al., 2011) is widely used for for binary sentiment classification of movie reviews, consisting of 25,000 training and 25,000 testing samples. MNIST (Le Cun et al., 1998) images with a deep autoencoder model (for image reconstruction) which has the same architecture as that in previous works (Reddi et al., 2018a) (containing more than 2 million parameters).
Dataset Splits Yes For non-private training experiments, we fix the mini-batch size to 64, and tune fixed learning rates by performing a grid search over {0.0005,0.001,0.005,0.01,0.05,0.1,0.2,0.5,1,2} separately for all methods on validation data. IMDB (Maas et al., 2011) is widely used for for binary sentiment classification of movie reviews, consisting of 25,000 training and 25,000 testing samples.
Hardware Specification No The paper does not explicitly describe the hardware used for experiments (e.g., specific GPU/CPU models, memory details).
Software Dependencies No The paper does not provide specific version numbers for ancillary software dependencies (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes For non-private training experiments, we fix the mini-batch size to 64, and tune fixed learning rates by performing a grid search over {0.0005,0.001,0.005,0.01,0.05,0.1,0.2,0.5,1,2} separately for all methods on validation data. For differentially private training, the δ values in the privacy budget are always inverse of the number of training samples. We fix the noise multiplier σ for each dataset, tune the clipping threshold, and compute the final ε values. Specifically, the σ values are 1, 1, and 0.95 for IMDB (convex), IMDB (LSTM), and Stack Overflow; 1 and 0.75 for MNIST (autoencoder). The clipping threshold C (in Algorithm 1) is tuned from {0.01,0.02,0.05,0.1,0.2,0.5,1,2,3}, jointly with tuning the (fixed) learning rates. The number of micro-batches is 16 for all related experiments, and the mini-batch size is 64 (i.e., we privatize each gradient averaged over 4 individual ones to speed up computation).