AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks
Authors: Alexandra Peste, Eugenia Iofinova, Adrian Vladu, Dan Alistarh
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform an extensive empirical investigation, showing that AC/DC provides consistently good results on a wide range of models and tasks (Res Net [28] and Mobile Nets [30] on the Image Net [49] / CIFAR [36] datasets, and Transformers [56, 10] on Wiki Text [42]), under standard values of the training hyper-parameters. Specifically, when executed on the same number of training epochs, our method outperforms all previous sparse training methods in terms of the accuracy of the resulting sparse model, often by significant margins. |
| Researcher Affiliation | Collaboration | Alexandra Peste IST Austria Eugenia Iofinova IST Austria Adrian Vladu CNRS & IRIF Dan Alistarh IST Austria & Neural Magic |
| Pseudocode | Yes | Please see Algorithm 1 for pseudocode. |
| Open Source Code | Yes | The code is available at: https://github.com/IST-DASLab/ACDC. |
| Open Datasets | Yes | We tested AC/DC on image classification tasks (CIFAR-100 [36] and Image Net [49]) and on language modelling tasks [42] using the Transformer-XL model [10]. |
| Dataset Splits | No | The paper frequently mentions |
| Hardware Specification | No | The paper mentions |
| Software Dependencies | No | The paper states: |
| Experiment Setup | Yes | In all reported results, the models were trained for a fixed number of 100 epochs, using SGD with momentum. We use a cosine learning rate scheduler and training hyper-parameters following [37], but without label smoothing. The models were trained and evaluated using mixed precision (FP16). ... For all results, the AC/DC training schedule starts with a warm-up phase of dense training for 10 epochs, after which we alternate between compression and de-compression every 5 epochs, until the last dense and sparse phase. It is beneficial to allow these last two fine-tuning phases to run longer: the last decompression phase runs for 10 epochs, whereas the final 15 epochs are the compression fine-tuning phase. We reset SGD momentum at the beginning of every decompression phase. |