Big Bird: Transformers for Longer Sequences

Authors: Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section our goal is to showcase benefits of modeling longer input sequence for NLP tasks, for which we select three representative tasks. We begin with basic masked language modeling (MLM; Devlin et al. 22) to check if better contextual representations can be learnt by utilizing longer contiguous sequences.
Researcher Affiliation Industry Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed Google Research {manzilz, gurug, avinavadubey}@google.com
Pseudocode No No explicit pseudocode or algorithm blocks were found.
Open Source Code Yes 1code available at http://goo.gle/bigbird-transformer
Open Datasets Yes Natural Questions [52]: For the given question, find a short span of answer (SA) from the given evidences as well highlight the paragraph from the given evidences containing information about the correct answer (LA). Trivia QA-wiki [41]: We need to provide an answer for the given question using provided Wikipedia evidence, however, the answer might not be present in the given evidence. On a smaller verified subset of question, the given evidence is guaranteed to contain the answer. Nevertheless, we model the answer as span selection problem in this case as well. We learn contextual representation of these token on the human reference genome (GRCh37)3 using MLM objective.
Dataset Splits No No specific dataset split information (exact percentages, sample counts, or explicit citation to predefined validation splits) was found within the paper. While 'development set' is mentioned, its size or methodology for creation is not detailed.
Hardware Specification No No specific hardware details (exact GPU/CPU models, processor types with speeds, or detailed computer specifications) were provided for running the experiments. It only mentions '16GB memory/chip'.
Software Dependencies No No specific software dependencies with version numbers (e.g., library or framework names with versions like PyTorch 1.9) were provided.
Experiment Setup Yes Pretraining and MLM We follow [22, 63] to create base and large versions of BIGBIRD and pretrain it using MLM objective. We note that we trained our models on a reasonable 16GB memory/chip with batch size of 32-64. For a fair comparison, we had to use some additional regularization for training BIGBIRD, details of which are provided in App. E.2 along with exact architecture description.