Functional Interpolation for Relative Positions improves Long Context Transformers

Authors: Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, Srinadh Bhojanapalli

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We next empirically show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks. We conduct an extensive empirical study to demonstrate the effectiveness of FIRE for length generalization. We benchmark FIRE as well as other positional encoding approaches on a wide range of real-world language modeling (C4, ar Xiv, and Github), long text benchmark (SCROLLS), zero-shot long-context question answering (Narrative QA), and natural language understanding benchmarks (GLUE/Super GLUE).
Researcher Affiliation Collaboration 1Carnegie Mellon University 2Google Research 3Google Deep Mind
Pseudocode Yes In this section, we present the implementation of our proposed FIRE module in PyTorch (Paszke et al., 2019).
Open Source Code No The paper includes a PyTorch implementation of the FIRE module in Appendix E, but it does not explicitly state that this code is open-source, provide a public repository link, or confirm its availability in supplementary materials with an access method.
Open Datasets Yes We consider language models trained on the C4 dataset (Raffel et al., 2019) with 2048 input length, with different positional encoding methods. We pretrain the models on sequence length to 2048, and evaluate their zero-shot perplexity on sequence lengths {512, 1024, 2048, 4096, 8192}. The evaluation metrics are validation log perplexity on C4, ar Xiv, and Github (Raffel et al., 2019; Gao et al., 2020).
Dataset Splits No The paper states, "We truncate documents with length greater than 2048 to multiple sequences of length 2048 during training; similar trucation is done to construct the validation sets of different sequence lengths." However, it does not provide specific details on how the overall datasets (C4, ArXiv, Github) were split into training, validation, and test sets, such as percentages or sample counts.
Hardware Specification Yes Hardware (TPUv4 chips) 128; Hardware (TPUv4 chips) 256; Hardware (TPUv2 chips) 32; We measure the forward time on 4 TPUv2 chips for all the models
Software Dependencies No The paper mentions using PyTorch for implementation in Appendix E: "In this section, we present the implementation of our proposed FIRE module in Py Torch (Paszke et al., 2019)." However, it does not specify version numbers for PyTorch or any other software libraries or dependencies, which is required for reproducibility.
Experiment Setup Yes Table 8: Model configurations for language model pretraining (Training sequence length, Number of layers, Attention heads, Hidden layer size, Head dimensions, FFN activation, Number of parameters). Table 9: Training recipe for language model pretraining (Training sequence length, Batch size, Numer of iterations, Dropout prob., Attention dropout prob., Optimizer, Learning rate). Table 10: Finetuning configurations for SCROLLS benchmark (Batch size, Numer of iterations, Dropout prob., Attention dropout prob., Optimizer, Learning rate). Table 12: Finetuning configurations for GLUE/Super GLUE benchmark (Batch size, Numer of iterations, Dropout prob., Attention dropout prob., Optimizer, Learning rate).