Treeformer: Dense Gradient Trees for Efficient Attention Computation

Authors: Lovish Madaan, Srinadh Bhojanapalli, Himanshu Jain, Prateek Jain

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using extensive experiments on standard NLP benchmarks, especially for long-sequences, we demonstrate that our TREEFORMER architecture can be almost as accurate as baseline Transformer while using 30x lesser FLOPs in the attention layer.
Researcher Affiliation Industry Lovish Madaan, Srinadh Bhojanapalli, Himanshu Jain & Prateek Jain Google Research {lovishm,bsrinadh,himj,prajain}@google.com
Pseudocode Yes Algorithm 1: TREEFORMER 2-D Bootstrapping Algorithm
Open Source Code No No explicit statement about open-source code release or repository link was found in the paper.
Open Datasets Yes Following Devlin et al. (2019) we use a masked LM (MLM) objective to pretrain our TREEFORMER BERT-Base models on Wikipedia, Books (Zhu et al., 2015), CC-News (Guu et al., 2020), and Stories (Trinh & Le, 2018) datasets.
Dataset Splits No We fine-tune the pre-trained BERT-Base models on 7 datasets from the GLUE benchmark (Wang et al., 2018) including MNLI and the SQu AD (Rajpurkar et al., 2016) dataset for question answering. ... We next perform experiments on the Long Range Arena Benchmark (LRA) (Tay et al., 2021). (The paper refers to standard benchmarks but does not explicitly state the train/validation/test splits, relying on the reader's knowledge of these benchmarks).
Hardware Specification Yes Hardware (TPUv3 slice) 8 × 16 (Table 6), Hardware (TPUv3 slice) 2 × 2 (Table 8), In Table 4, we finally present a comparison of inference wall time between the TREEFORMER TF-ATTENTION model and the standard attention on a Intel Xeon Platinum P-8136 CPU using one thread and 100GB of RAM.
Software Dependencies No We use Adam optimizer with decoupled weight decay (Adam W). (No software dependencies with specific version numbers were provided.)
Experiment Setup Yes We list the hyper-parameters used for pre-training in Table 6. We use Adam optimizer with decoupled weight decay (Adam W). We also use a learning rate warmup for the first 10k steps, with linear decay of the learning rate afterwards. We train the TREEFORMER models for a total of 1.8M steps.