SOFT: Softmax-free Transformer with Linear Complexity

Authors: Jiachen Lu, Jinghan Yao, Junge Zhang, Xiatian Zhu, Hang Xu, Weiguo Gao, Chunjing XU, Tao Xiang, Li Zhang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on Image Net show that our SOFT significantly improves the computational efficiency of existing Vi T variants. Crucially, with a linear complexity, much longer token sequences are permitted in SOFT, resulting in superior trade-off between accuracy and complexity. Dataset: We evaluate the proposed SOFT on the ILSVRC-2012 Image Net-1K dataset [9] with 1.28M training images and 50K validation images from 1,000 classes. Following the common practice, we train a model on the training set and evaluate on the validation set. Metrics: For model performance, the top-1 accuracy on a single crop is reported. To assess the cost-effectiveness, we also report the model size and floating point operations (i.e., FLOPs).
Researcher Affiliation Collaboration 1Fudan University 2University of Surrey 3Huawei Noah s Ark Lab
Pseudocode Yes Algorithm 1: SOFT: Softmax-free attention; Algorithm 2: NR: Newton-Raphson iteration
Open Source Code No The paper mentions a project website 'https://fudan-zvg.github.io/SOFT' which typically includes code, but this URL is not a direct link to a code repository. It also states 'We use the code base [37] with the default setting to train and test all the models' and 'We also implement our method using the Mindspore [22]', referring to third-party libraries/frameworks, not the authors' own source code for the described methodology.
Open Datasets Yes Dataset: We evaluate the proposed SOFT on the ILSVRC-2012 Image Net-1K dataset [9] with 1.28M training images and 50K validation images from 1,000 classes. [9] is cited as 'Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.'
Dataset Splits Yes Dataset: We evaluate the proposed SOFT on the ILSVRC-2012 Image Net-1K dataset [9] with 1.28M training images and 50K validation images from 1,000 classes. Following the common practice, we train a model on the training set and evaluate on the validation set.
Hardware Specification Yes All our variants are trained with a batch size of 1024 on 32G NVIDIA V100 GPUs.
Software Dependencies No The paper mentions using 'the code base [37]' (PyTorch image models) and implementing the method using 'Mindspore [22]'. However, it does not specify version numbers for these software components or any other libraries, which is required for reproducible software dependencies.
Experiment Setup Yes Specifically, we use weight decay of 0.05 and 10 epochs of linear warm-up. We conduct 300 epochs training with an Adam W optimizer and decreasing learning rate with the cosine annealing schedule. During training, random flipping, mixup [43] and cutmix [42] are adopted for data augmentation. Label smoothing [28] is used for loss calculation. All our variants are trained with a batch size of 1024 on 32G NVIDIA V100 GPUs.