Multiplication-Free Transformer Training via Piecewise Affine Operations
Authors: Atli Kosson, Martin Jaggi
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that transformers can be trained with piecewise affine matrix multiplications on both vision and language data with little to no performance impact. We compare this to Adder Net based transformers [30] demonstrating better accuracy while replacing more multiplications. |
| Researcher Affiliation | Academia | Atli Kosson Martin Jaggi EPFL, Switzerland firstname.lastname@epfl.ch |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | We publicly release our code2, including custom kernels, in the hopes of aiding further research into multiplication-free neural networks. 2Code available at https://github.com/epfml/piecewise-affine-multiplication |
| Open Datasets | Yes | The first one is German to English translation on the IWSLT14 DE-EN dataset [3]... We train on either CIFAR10 [19] or the Image Net-1k [6] dataset. |
| Dataset Splits | Yes | CIFAR10 consists of 50K training and 10K test images of size 32 32 corresponding to 10 classes. |
| Hardware Specification | Yes | We use Py Torch [28] for our experiments and run them using either Nvidia A100 (40GB) or V100 (32GB) GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch [28]', 'Fair Seq [27]', and 'Py Torch Image Models project [36]' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Our baseline setup trains for 20 epochs, using a cosine decay schedule with 4000 warmup steps and a peak learning rate 5 10 4 with a maximum batch size of 4096 tokens. We use Adam W [21, 17] for optimization with β1 = 0.9, β2 = 0.98 and weight decay of 10 4. During training we apply a dropout with drop probability 0.3 and use cross entropy with a label smoothing of 0.1. |