Multiplication-Free Transformer Training via Piecewise Affine Operations

Authors: Atli Kosson, Martin Jaggi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that transformers can be trained with piecewise affine matrix multiplications on both vision and language data with little to no performance impact. We compare this to Adder Net based transformers [30] demonstrating better accuracy while replacing more multiplications.
Researcher Affiliation Academia Atli Kosson Martin Jaggi EPFL, Switzerland firstname.lastname@epfl.ch
Pseudocode No No structured pseudocode or algorithm blocks were found.
Open Source Code Yes We publicly release our code2, including custom kernels, in the hopes of aiding further research into multiplication-free neural networks. 2Code available at https://github.com/epfml/piecewise-affine-multiplication
Open Datasets Yes The first one is German to English translation on the IWSLT14 DE-EN dataset [3]... We train on either CIFAR10 [19] or the Image Net-1k [6] dataset.
Dataset Splits Yes CIFAR10 consists of 50K training and 10K test images of size 32 32 corresponding to 10 classes.
Hardware Specification Yes We use Py Torch [28] for our experiments and run them using either Nvidia A100 (40GB) or V100 (32GB) GPUs.
Software Dependencies No The paper mentions 'Py Torch [28]', 'Fair Seq [27]', and 'Py Torch Image Models project [36]' but does not provide specific version numbers for these software components.
Experiment Setup Yes Our baseline setup trains for 20 epochs, using a cosine decay schedule with 4000 warmup steps and a peak learning rate 5 10 4 with a maximum batch size of 4096 tokens. We use Adam W [21, 17] for optimization with β1 = 0.9, β2 = 0.98 and weight decay of 10 4. During training we apply a dropout with drop probability 0.3 and use cross entropy with a label smoothing of 0.1.