A New Linear Scaling Rule for Private Adaptive Hyperparameter Optimization
Authors: Ashwinee Panda, Xinyu Tang, Saeed Mahloujifar, Vikash Sehwag, Prateek Mittal
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We obtain state-of-the-art performance on 22 benchmark tasks, across computer vision and natural language processing, across pretraining and finetuning, across architectures and a wide range of ε [0.01, 8.0], all while accounting for the privacy cost of HPO. We provide results on a range of image classification, distribution shift, and natural language processing tasks, for both finetuning of models pretrained on public data and training from scratch without any additional data. Due to the large scope of our evaluation, we defer all experimental details and full results for all datasets and models to Appendix A. |
| Researcher Affiliation | Academia | Ashwinee Panda * 1 Xinyu Tang * 1 Saeed Mahloujifar 1 Vikash Sehwag 1 Prateek Mittal 1 1Princeton University. |
| Pseudocode | Yes | Algorithm 1 Model Training Subroutine, Algorithm 2 Adaptive HPO Routine, Algorithm 3 Hyperparameter Sweep Subroutine |
| Open Source Code | Yes | We also provide the code to reproduce our results at this link. Our code is available at the following URL: https://github.com/kiddyboots216/dp-custom. |
| Open Datasets | Yes | Datasets. Image Net (Deng et al., 2009), CIFAR10 (training from scratch and finetuning), CIFAR100 (Krizhevsky et al., 2009), Fashion MNIST (Xiao et al., 2017), STL10 (Coates et al., 2011), EMNIST (Cohen et al., 2017). ... For NLP tasks we consider SQu AD (Rajpurkar et al., 2016) for questions answering, text classification tasks from the GLUE benchmark (Wang et al., 2019a): SST-2, QNLI, QQP, MNLI(m/mm) and for next word generation we use Persona Chat (Zhang et al., 2018a) and Wiki Text-2 (Merity et al., 2017), and Enron Emails (Klimt & Yang, 2004). |
| Dataset Splits | No | The paper mentions 'validation dataset' and uses train/test sets, but does not explicitly provide specific percentages, counts, or a detailed methodology for dataset splits required for reproduction. |
| Hardware Specification | Yes | However, we lack the computational resources to do full fine-tuning of large transformers, but we can do linear probing of the extracted features in under an hour on a single A100. |
| Software Dependencies | No | The paper mentions PyTorch, timm, and specific libraries for privacy accounting and clipping, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Hyperparameter Search Space. We use a logarithmic grid for the learning rate η [10 7, 10 4]. For the HPO, we do 3 runs at ε = 0.1, followed by 3 runs at ε = 0.2 and a final run at ε = 0.88, which produces a cumulative privacy cost including HPO of ε = 1.0. Table 13. Our method fixes six design choices: the architecture and initialization (for CV tasks only), the batch size (full batch), the optimizer (SGD with momentum=0.9), the accounting method (PLV where all prior HPO methods use RDP), and the clipping norm (unit clipping). Table 19. Set of hyper-parameters used in the finetuning GPT-2. Parameter Values Clipping Norm 0.1 Learning Rate [2, 5, 10, 20, 50] 10 5 Batch Size [64, 128, 256, 512, 1024, 2048] Epochs [3, 10, 20] |