Flowformer: Linearizing Transformers with Conservation Flows

Authors: Haixu Wu, Jialong Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To testify the effectiveness and generality of Flowformer, we extensively experiment on five well-established benchmarks, covering long sequence modeling, language processing, computer vision, time series and reinforcement learning.
Researcher Affiliation Academia 1School of Software, BNRist, Tsinghua University. Haixu Wu <whx20@mails.tsinghua.edu.cn>. Correspondence to: Mingsheng Long <mingsheng@tsinghua.edu.cn>.
Pseudocode Yes We present the pseudo-code of normal Flow-Attention in Algorithm 1 and the causal version in Algorithm 2.
Open Source Code Yes The code and settings are available at this repository: https://github.com/thuml/Flowformer.
Open Datasets Yes Long-Range Arena (LRA, Tay et al. 2020c), Wiki Text-103 (Merity et al., 2017), Image Net-1K (Deng et al., 2009), UEA Time Series Classification Archive (Bagnall et al., 2018), D4RL benchmark (Fu et al., 2020)
Dataset Splits No The paper references various datasets and benchmarks, but does not explicitly provide the train/validation/test dataset splits (e.g., percentages or sample counts) within its text.
Hardware Specification Yes All the experiments are conducted on 2 NVIDIA 2080 Ti GPUs. (LRA); All the models are trained from scratch without pre-training on 4 NVIDIA TITAN RTX 24GB GPUs for 150K updates after a 6K-steps warm-up. (WikiText-103); All the experiments are conducted on 8 NVIDIA TITAN RTX 24GB GPUs for 300 epochs. (ImageNet-1K); All the experiments are conducted on one single NVIDIA TITAN RTX 24GB GPU for 100 epochs. (UEA); We repeat each experiment three times with different seeds on one single NVIDIA 2080 Ti GPU for 10 epochs. (D4RL)
Software Dependencies No The paper mentions frameworks like JAX and Fairseq, but does not list specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow, or detailed library versions) used in the experiments.
Experiment Setup Yes The model architecture consists of 6 decoder layers with 8 heads and 512 hidden channels for attention mechanism (Ott et al., 2019). (Language Modeling); We present Flowformer with 19 layers in a four-stage hierarchical structure, where the channels are in {96, 192, 384, 768} and the input sequence length for each stage is in {3136, 784, 196, 49} correspondingly. (Image Recognition); All the models are trained from scratch without pre-training on 4 NVIDIA TITAN RTX 24GB GPUs for 150K updates after a 6K-steps warm-up. (Language Modeling); We use 2 layers for Transformer-based models with 512 hidden channels and 8 heads for the attention mechanism. (Time Series); We adopt 3 layers with 256 hidden channels and 4 heads in all experiments for Flowformer and other Transformers. (RL)