Bridging the Divide: Reconsidering Softmax and Linear Attention
Authors: Dongchen Han, Yifan Pu, Zhuofan Xia, Yizeng Han, Xuran Pan, Xiu Li, Jiwen Lu, Shiji Song, Gao Huang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct empirical verification to fully validate the importance of these two properties and the effectiveness of our methods. |
| Researcher Affiliation | Academia | Tsinghua University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/Leap Lab THU/In Line. |
| Open Datasets | Yes | Image Net-1K [5] recognition dataset contains 1.28M training images and 50K validation images with a total of 1,000 classes. ... COCO [18] object detection and instance segmentation dataset ... ADE20K [44] is a well-established benchmark for semantic segmentation |
| Dataset Splits | Yes | Image Net-1K [5] recognition dataset contains 1.28M training images and 50K validation images with a total of 1,000 classes. ... COCO [18] object detection and instance segmentation dataset has 118K training and 5K validation images. ... ADE20K [44] is a well-established benchmark for semantic segmentation which encompasses 20K training images, 2K validation images and 150 semantic categories. |
| Hardware Specification | Yes | Runtime and FPS is tested on a RTX3090 GPU. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, PyTorch 1.9). |
| Experiment Setup | Yes | We use Adam W [21] optimizer to train all our models from scratch for 300 epochs, employing cosine learning rate decay with 20 epochs of linear warm-up. The initial learning rate is 1 10 3, and the weight decay is 0.05. Augmentation and regularization strategies consist of Rand Augment [4], Mixup [42], Cut Mix [41], and random erasing [43]. |