Treeformer: Dense Gradient Trees for Efficient Attention Computation
Authors: Lovish Madaan, Srinadh Bhojanapalli, Himanshu Jain, Prateek Jain
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using extensive experiments on standard NLP benchmarks, especially for long-sequences, we demonstrate that our TREEFORMER architecture can be almost as accurate as baseline Transformer while using 30x lesser FLOPs in the attention layer. |
| Researcher Affiliation | Industry | Lovish Madaan, Srinadh Bhojanapalli, Himanshu Jain & Prateek Jain Google Research {lovishm,bsrinadh,himj,prajain}@google.com |
| Pseudocode | Yes | Algorithm 1: TREEFORMER 2-D Bootstrapping Algorithm |
| Open Source Code | No | No explicit statement about open-source code release or repository link was found in the paper. |
| Open Datasets | Yes | Following Devlin et al. (2019) we use a masked LM (MLM) objective to pretrain our TREEFORMER BERT-Base models on Wikipedia, Books (Zhu et al., 2015), CC-News (Guu et al., 2020), and Stories (Trinh & Le, 2018) datasets. |
| Dataset Splits | No | We fine-tune the pre-trained BERT-Base models on 7 datasets from the GLUE benchmark (Wang et al., 2018) including MNLI and the SQu AD (Rajpurkar et al., 2016) dataset for question answering. ... We next perform experiments on the Long Range Arena Benchmark (LRA) (Tay et al., 2021). (The paper refers to standard benchmarks but does not explicitly state the train/validation/test splits, relying on the reader's knowledge of these benchmarks). |
| Hardware Specification | Yes | Hardware (TPUv3 slice) 8 × 16 (Table 6), Hardware (TPUv3 slice) 2 × 2 (Table 8), In Table 4, we finally present a comparison of inference wall time between the TREEFORMER TF-ATTENTION model and the standard attention on a Intel Xeon Platinum P-8136 CPU using one thread and 100GB of RAM. |
| Software Dependencies | No | We use Adam optimizer with decoupled weight decay (Adam W). (No software dependencies with specific version numbers were provided.) |
| Experiment Setup | Yes | We list the hyper-parameters used for pre-training in Table 6. We use Adam optimizer with decoupled weight decay (Adam W). We also use a learning rate warmup for the first 10k steps, with linear decay of the learning rate afterwards. We train the TREEFORMER models for a total of 1.8M steps. |