Stable and low-precision training for large-scale vision-language models

Authors: Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari Morcos, Ali Farhadi, Ludwig Schmidt

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) For acceleration, we introduce Switch Back, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP Vi T-Huge the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While Switch Back proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) For stability, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become underestimated by their Adam W second moment estimator. As a result, we recommend an Adam W-Adafactor hybrid which avoids loss spikes when training a CLIP Vi T-Huge model and outperforms gradient clipping at the scales we test.
Researcher Affiliation Collaboration 1University of Washington. 2Meta AI Research, FAIR Team. 3Allen Institute for AI. 4LAION. Equal contribution. Equal senior contribution.
Pseudocode Yes Algorithm 1: Py Torch code for Switch Back; Algorithm 2: Stable Adam W; Algorithm 3 Memory efficient Switch Back M; Algorithm 4: Switch Back with row-wise and column-wise quantization for the weights Switch Back Q; Algorithm 5: A standard linear layer implemented with torch.autograd
Open Source Code Yes We will provide open-source Triton [57] kernels for Switchback to enable future work on efficient quantization schemes. ... the code to run the benchmarks and produce Figure 6 is open sourced.
Open Datasets Yes To evaluate Switch Back we train CLIP [46] visual transformer [20] models on LAION-2B [53].
Dataset Splits No The paper mentions training on LAION-2B and evaluating zero-shot on ImageNet. It specifies training iterations, warmup, and cosine decay schedule, but it does not provide specific train/validation/test dataset splits (percentages or counts) for LAION-2B or how the data was partitioned for training versus evaluation beyond using ImageNet for zero-shot testing.
Hardware Specification Yes For our int8 experiments we conduct the multiplications in int8 using A100 GPUs we perform real int8 training without any simulation.
Software Dependencies No The paper mentions using 'Py Torch [43]', 'Triton [57]', and 'Open CLIP library [29]' but does not provide specific version numbers for these software components.
Experiment Setup Yes We use batch size 16384 (per-gpu batch size of 256) and train for a total of 20k iterations. The first 5k iterations are linear warmup while the remaining 15k are cosine decay. Training and evaluation are conducted with the Open CLIP library [29] with learning rate 2e-3, weight decay 0.2, and batch-size 16384 using the optimizer described in Section 3.5. ... Concretely, we use patch-dropout 0.5 [35] and 20k iterations.