EcoFormer: Energy-Saving Attention with Linear Complexity
Authors: Jing Liu, Zizheng Pan, Haoyu He, Jianfei Cai, Bohan Zhuang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on both vision and language tasks show that Eco Former consistently achieves comparable performance with standard attentions while consuming much fewer resources. |
| Researcher Affiliation | Academia | Department of Data Science & AI, Monash University, Australia |
| Pseudocode | No | The paper describes the proposed method in prose and through diagrams, but it does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/ziplab/EcoFormer. |
| Open Datasets | Yes | To investigate the effectiveness of the proposed method, we conduct experiments on Image Net-1K [30], a large-scale image classification dataset that contains 1.2M training images from 1K categories and 50K validation images. |
| Dataset Splits | Yes | Image Net-1K [30], a large-scale image classification dataset that contains 1.2M training images from 1K categories and 50K validation images. |
| Hardware Specification | Yes | All models in this experiment are trained on 8 V100 GPUs with a total batch size of 256. ... Moreover, we report the on-chip energy consumption according to Table 1 and the throughput with a mini-batch size of 32 on a single NVIDIA RTX 3090 GPU. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer and states implementations are based on released code from other papers, but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | All training images are resized to 256 × 256, and 224 × 224 patches are randomly cropped from an image or its horizontal flip, with the per-pixel mean subtracted. ... Next, we finetune each model on Image Net-1K with 100 epochs. ... All models in this experiment are trained on 8 V100 GPUs with a total batch size of 256. We set the initial learning rate to 2.5 × 10−5 for PVTv2 and 1.25 × 10−4 for Twins. We use Adam W optimizer [40] with a cosine decay learning rate scheduler. |