Flowformer: Linearizing Transformers with Conservation Flows
Authors: Haixu Wu, Jialong Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To testify the effectiveness and generality of Flowformer, we extensively experiment on five well-established benchmarks, covering long sequence modeling, language processing, computer vision, time series and reinforcement learning. |
| Researcher Affiliation | Academia | 1School of Software, BNRist, Tsinghua University. Haixu Wu <whx20@mails.tsinghua.edu.cn>. Correspondence to: Mingsheng Long <mingsheng@tsinghua.edu.cn>. |
| Pseudocode | Yes | We present the pseudo-code of normal Flow-Attention in Algorithm 1 and the causal version in Algorithm 2. |
| Open Source Code | Yes | The code and settings are available at this repository: https://github.com/thuml/Flowformer. |
| Open Datasets | Yes | Long-Range Arena (LRA, Tay et al. 2020c), Wiki Text-103 (Merity et al., 2017), Image Net-1K (Deng et al., 2009), UEA Time Series Classification Archive (Bagnall et al., 2018), D4RL benchmark (Fu et al., 2020) |
| Dataset Splits | No | The paper references various datasets and benchmarks, but does not explicitly provide the train/validation/test dataset splits (e.g., percentages or sample counts) within its text. |
| Hardware Specification | Yes | All the experiments are conducted on 2 NVIDIA 2080 Ti GPUs. (LRA); All the models are trained from scratch without pre-training on 4 NVIDIA TITAN RTX 24GB GPUs for 150K updates after a 6K-steps warm-up. (WikiText-103); All the experiments are conducted on 8 NVIDIA TITAN RTX 24GB GPUs for 300 epochs. (ImageNet-1K); All the experiments are conducted on one single NVIDIA TITAN RTX 24GB GPU for 100 epochs. (UEA); We repeat each experiment three times with different seeds on one single NVIDIA 2080 Ti GPU for 10 epochs. (D4RL) |
| Software Dependencies | No | The paper mentions frameworks like JAX and Fairseq, but does not list specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow, or detailed library versions) used in the experiments. |
| Experiment Setup | Yes | The model architecture consists of 6 decoder layers with 8 heads and 512 hidden channels for attention mechanism (Ott et al., 2019). (Language Modeling); We present Flowformer with 19 layers in a four-stage hierarchical structure, where the channels are in {96, 192, 384, 768} and the input sequence length for each stage is in {3136, 784, 196, 49} correspondingly. (Image Recognition); All the models are trained from scratch without pre-training on 4 NVIDIA TITAN RTX 24GB GPUs for 150K updates after a 6K-steps warm-up. (Language Modeling); We use 2 layers for Transformer-based models with 512 hidden channels and 8 heads for the attention mechanism. (Time Series); We adopt 3 layers with 256 hidden channels and 4 heads in all experiments for Flowformer and other Transformers. (RL) |