SOFT: Softmax-free Transformer with Linear Complexity
Authors: Jiachen Lu, Jinghan Yao, Junge Zhang, Xiatian Zhu, Hang Xu, Weiguo Gao, Chunjing XU, Tao Xiang, Li Zhang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on Image Net show that our SOFT significantly improves the computational efficiency of existing Vi T variants. Crucially, with a linear complexity, much longer token sequences are permitted in SOFT, resulting in superior trade-off between accuracy and complexity. Dataset: We evaluate the proposed SOFT on the ILSVRC-2012 Image Net-1K dataset [9] with 1.28M training images and 50K validation images from 1,000 classes. Following the common practice, we train a model on the training set and evaluate on the validation set. Metrics: For model performance, the top-1 accuracy on a single crop is reported. To assess the cost-effectiveness, we also report the model size and floating point operations (i.e., FLOPs). |
| Researcher Affiliation | Collaboration | 1Fudan University 2University of Surrey 3Huawei Noah s Ark Lab |
| Pseudocode | Yes | Algorithm 1: SOFT: Softmax-free attention; Algorithm 2: NR: Newton-Raphson iteration |
| Open Source Code | No | The paper mentions a project website 'https://fudan-zvg.github.io/SOFT' which typically includes code, but this URL is not a direct link to a code repository. It also states 'We use the code base [37] with the default setting to train and test all the models' and 'We also implement our method using the Mindspore [22]', referring to third-party libraries/frameworks, not the authors' own source code for the described methodology. |
| Open Datasets | Yes | Dataset: We evaluate the proposed SOFT on the ILSVRC-2012 Image Net-1K dataset [9] with 1.28M training images and 50K validation images from 1,000 classes. [9] is cited as 'Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.' |
| Dataset Splits | Yes | Dataset: We evaluate the proposed SOFT on the ILSVRC-2012 Image Net-1K dataset [9] with 1.28M training images and 50K validation images from 1,000 classes. Following the common practice, we train a model on the training set and evaluate on the validation set. |
| Hardware Specification | Yes | All our variants are trained with a batch size of 1024 on 32G NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions using 'the code base [37]' (PyTorch image models) and implementing the method using 'Mindspore [22]'. However, it does not specify version numbers for these software components or any other libraries, which is required for reproducible software dependencies. |
| Experiment Setup | Yes | Specifically, we use weight decay of 0.05 and 10 epochs of linear warm-up. We conduct 300 epochs training with an Adam W optimizer and decreasing learning rate with the cosine annealing schedule. During training, random flipping, mixup [43] and cutmix [42] are adopted for data augmentation. Label smoothing [28] is used for loss calculation. All our variants are trained with a batch size of 1024 on 32G NVIDIA V100 GPUs. |