AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks

Authors: Alexandra Peste, Eugenia Iofinova, Adrian Vladu, Dan Alistarh

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform an extensive empirical investigation, showing that AC/DC provides consistently good results on a wide range of models and tasks (Res Net [28] and Mobile Nets [30] on the Image Net [49] / CIFAR [36] datasets, and Transformers [56, 10] on Wiki Text [42]), under standard values of the training hyper-parameters. Specifically, when executed on the same number of training epochs, our method outperforms all previous sparse training methods in terms of the accuracy of the resulting sparse model, often by significant margins.
Researcher Affiliation Collaboration Alexandra Peste IST Austria Eugenia Iofinova IST Austria Adrian Vladu CNRS & IRIF Dan Alistarh IST Austria & Neural Magic
Pseudocode Yes Please see Algorithm 1 for pseudocode.
Open Source Code Yes The code is available at: https://github.com/IST-DASLab/ACDC.
Open Datasets Yes We tested AC/DC on image classification tasks (CIFAR-100 [36] and Image Net [49]) and on language modelling tasks [42] using the Transformer-XL model [10].
Dataset Splits No The paper frequently mentions
Hardware Specification No The paper mentions
Software Dependencies No The paper states:
Experiment Setup Yes In all reported results, the models were trained for a fixed number of 100 epochs, using SGD with momentum. We use a cosine learning rate scheduler and training hyper-parameters following [37], but without label smoothing. The models were trained and evaluated using mixed precision (FP16). ... For all results, the AC/DC training schedule starts with a warm-up phase of dense training for 10 epochs, after which we alternate between compression and de-compression every 5 epochs, until the last dense and sparse phase. It is beneficial to allow these last two fine-tuning phases to run longer: the last decompression phase runs for 10 epochs, whereas the final 15 epochs are the compression fine-tuning phase. We reset SGD momentum at the beginning of every decompression phase.