Private Adaptive Gradient Methods for Convex Optimization

Authors: Hilal Asi, John Duchi, Alireza Fallah, Omid Javidbakht, Kunal Talwar

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conclude the paper with several experiments to demonstrate the performance of PAGAN and PASAN algorithms. We perform experiments both on synthetic data, where we may control all aspects of the experiment, and a real-world example training large-scale private language models.
Researcher Affiliation Collaboration Hilal Asi * 1 2 John Duchi 2 3 Alireza Fallah 4 1 Omid Javidbakht 5 Kunal Talwar 5. 1Work done while interning at Apple 2Department of Electrical Engineering, Stanford University 3Department of Statistics, Stanford University 4Department of Electrical Engineering & Computer Science, MIT 5Apple.
Pseudocode Yes Algorithm 1 Private Adaptive SGD with Adaptive Noise (PASAN). Algorithm 2 Private Adagrad with Adaptive Noise (PAGAN). Algorithm 3 Private Second Moment Estimation.
Open Source Code Yes The code is available online2. 2https://github.com/apple/ ml-private-adaptive-gradient-methods
Open Datasets Yes We train a variant of a recurrent neural network with Long Short-Term-Memory (LSTM) (Hochreiter & Schmidhuber, 1997) on the Wiki Text-2 dataset (Merity et al., 2017), which is split into train, validation, and test sets.
Dataset Splits Yes We train a variant of a recurrent neural network with Long Short-Term-Memory (LSTM) (Hochreiter & Schmidhuber, 1997) on the Wiki Text-2 dataset (Merity et al., 2017), which is split into train, validation, and test sets. We further split the train set to 59,674 data points, where each data point has 35 tokens.
Hardware Specification No The paper mentions 'a standard workstation without any accelerators' for hyper-parameter tuning, but does not provide specific hardware details (e.g., CPU/GPU models, memory) for the main experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup Yes In our experiments, we use the parameters n = 5000, d = 100, σj = j 3/2, τ = 0.01, and the batch size for all methods is b = 70. As optimization methods are sensitive to stepsize choice in general (even non-privately (Asi & Duchi, 2019)), we run each method with different values of initial stepsize in {0.005, 0.01, 0.05, 0.1, 0.15, 0.2, 0.4, 0.5, 1.0} to find the best stepsize value. ...we perform a hyper-parameter search over three algorithm-specific constants: a multiplier α {0.1, 0.2, 0.4, 0.8, 1.0, 10.0, 50.0} for step-size, mini-batch size b {50, 100, 150, 200, 250}, and projection threshold B {0.05, 0.1, 0.5, 1.0}.