Bridging the Divide: Reconsidering Softmax and Linear Attention

Authors: Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xuran Pan, Xiu Li, Jiwen Lu, Shiji Song, Gao Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we conduct empirical verification to fully validate the importance of these two properties and the effectiveness of our methods.
Researcher Affiliation Academia Tsinghua University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/Leap Lab THU/In Line.
Open Datasets Yes Image Net-1K [5] recognition dataset contains 1.28M training images and 50K validation images with a total of 1,000 classes. ... COCO [18] object detection and instance segmentation dataset ... ADE20K [44] is a well-established benchmark for semantic segmentation
Dataset Splits Yes Image Net-1K [5] recognition dataset contains 1.28M training images and 50K validation images with a total of 1,000 classes. ... COCO [18] object detection and instance segmentation dataset has 118K training and 5K validation images. ... ADE20K [44] is a well-established benchmark for semantic segmentation which encompasses 20K training images, 2K validation images and 150 semantic categories.
Hardware Specification Yes Runtime and FPS is tested on a RTX3090 GPU.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9).
Experiment Setup Yes We use Adam W [21] optimizer to train all our models from scratch for 300 epochs, employing cosine learning rate decay with 20 epochs of linear warm-up. The initial learning rate is 1 10 3, and the weight decay is 0.05. Augmentation and regularization strategies consist of Rand Augment [4], Mixup [42], Cut Mix [41], and random erasing [43].