reproducibilityindex.ai

Treeformer: Dense Gradient Trees for Efficient Attention Computation

Authors: Lovish Madaan, Srinadh Bhojanapalli, Himanshu Jain, Prateek Jain

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using extensive experiments on standard NLP benchmarks, especially for long-sequences, we demonstrate that our TREEFORMER architecture can be almost as accurate as baseline Transformer while using 30x lesser FLOPs in the attention layer.
Researcher Affiliation	Industry	Lovish Madaan, Srinadh Bhojanapalli, Himanshu Jain & Prateek Jain Google Research {lovishm,bsrinadh,himj,prajain}@google.com
Pseudocode	Yes	Algorithm 1: TREEFORMER 2-D Bootstrapping Algorithm
Open Source Code	No	No explicit statement about open-source code release or repository link was found in the paper.
Open Datasets	Yes	Following Devlin et al. (2019) we use a masked LM (MLM) objective to pretrain our TREEFORMER BERT-Base models on Wikipedia, Books (Zhu et al., 2015), CC-News (Guu et al., 2020), and Stories (Trinh & Le, 2018) datasets.
Dataset Splits	No	We fine-tune the pre-trained BERT-Base models on 7 datasets from the GLUE benchmark (Wang et al., 2018) including MNLI and the SQu AD (Rajpurkar et al., 2016) dataset for question answering. ... We next perform experiments on the Long Range Arena Benchmark (LRA) (Tay et al., 2021). (The paper refers to standard benchmarks but does not explicitly state the train/validation/test splits, relying on the reader's knowledge of these benchmarks).
Hardware Specification	Yes	Hardware (TPUv3 slice) 8 × 16 (Table 6), Hardware (TPUv3 slice) 2 × 2 (Table 8), In Table 4, we ﬁnally present a comparison of inference wall time between the TREEFORMER TF-ATTENTION model and the standard attention on a Intel Xeon Platinum P-8136 CPU using one thread and 100GB of RAM.
Software Dependencies	No	We use Adam optimizer with decoupled weight decay (Adam W). (No software dependencies with specific version numbers were provided.)
Experiment Setup	Yes	We list the hyper-parameters used for pre-training in Table 6. We use Adam optimizer with decoupled weight decay (Adam W). We also use a learning rate warmup for the ﬁrst 10k steps, with linear decay of the learning rate afterwards. We train the TREEFORMER models for a total of 1.8M steps.