MatFormer: Nested Transformer for Elastic Inference

Authors: Fnu Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hanna Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate the efficacy of Mat Former across different model classes (decoders and encoders) and modalities (language and vision), demonstrating its potential for real-world deployment.
Researcher Affiliation Collaboration Google Deep Mind University of Texas at Austin University of Washington Harvard University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. Methods are described in prose and mathematical formulas.
Open Source Code No Justification: Implementation details are provided in Appendix B. While code to reproduce Section 4.2 has been open-sourced, the language model has been trained on proprietary data. ... Code to reproduce experiments will be released for camera ready.
Open Datasets Yes B/16 models are trained on Image Net-1K [50] with Aug Reg [57] while L/16 models are pretrained on Image Net-21K [18] followed by finetuning on Image Net-1K. ... We evaluate all the LM models trained on set of 25 English tasks similar to [8, 22, 14, 3], including: Open-Domain Closed-Book Question Answering tasks: Trivia QA [28], Natural Questions [35], and Web Questions [4]. Cloze and completion tasks: LAMBADA [46], Hella Swag [67], and Story Cloze [43]. Winograd-style tasks: Winograd [38] and Wino Grande [51]. Reading comprehension: RACE [37]. Common sense reasoning: PIQA [6], ARC [15], and Open Book QA [42]. Super GLUE [62] Natural language inference: Adversarial NLI [44].
Dataset Splits Yes We evaluate these models on validation loss and average accuracy on 25 English tasks [8, 22, 3]. ... Table 9: Downstream Eval numbers and development set log perplexity loss on 78M model size granularities.
Hardware Specification Yes Training foundation models remains expensive, with the largest models we discuss trained on 256 TPU-v4 cores for 3 days. ... We pretrained the 850M models on 256 v3 TPU chips.
Software Dependencies No We train a 256k vocabulary using the Sentence Piece library [31]... All models use the training setup and optimal hyperparameters of standard Vi T variants from the Scenic library [16]. ... Scenic: A jax library for computer vision research and beyond.
Experiment Setup Yes For each Mat LM model with fixed dmodel, we optimize for g = 4 nested granularities represented by FFN ratios of {0.5, 1, 2, 4} i.e., only the hidden representation size of the FFN block changes. ... All models have 16 layers, 16 attention heads, and a dmodel : dff ratio of 1 : 4. We train a 256k vocabulary using the Sentence Piece library [31], use a maximum context length of 1024 tokens, and a batch size of 1M tokens.