MatFormer: Nested Transformer for Elastic Inference
Authors: Fnu Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hanna Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate the efficacy of Mat Former across different model classes (decoders and encoders) and modalities (language and vision), demonstrating its potential for real-world deployment. |
| Researcher Affiliation | Collaboration | Google Deep Mind University of Texas at Austin University of Washington Harvard University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. Methods are described in prose and mathematical formulas. |
| Open Source Code | No | Justification: Implementation details are provided in Appendix B. While code to reproduce Section 4.2 has been open-sourced, the language model has been trained on proprietary data. ... Code to reproduce experiments will be released for camera ready. |
| Open Datasets | Yes | B/16 models are trained on Image Net-1K [50] with Aug Reg [57] while L/16 models are pretrained on Image Net-21K [18] followed by finetuning on Image Net-1K. ... We evaluate all the LM models trained on set of 25 English tasks similar to [8, 22, 14, 3], including: Open-Domain Closed-Book Question Answering tasks: Trivia QA [28], Natural Questions [35], and Web Questions [4]. Cloze and completion tasks: LAMBADA [46], Hella Swag [67], and Story Cloze [43]. Winograd-style tasks: Winograd [38] and Wino Grande [51]. Reading comprehension: RACE [37]. Common sense reasoning: PIQA [6], ARC [15], and Open Book QA [42]. Super GLUE [62] Natural language inference: Adversarial NLI [44]. |
| Dataset Splits | Yes | We evaluate these models on validation loss and average accuracy on 25 English tasks [8, 22, 3]. ... Table 9: Downstream Eval numbers and development set log perplexity loss on 78M model size granularities. |
| Hardware Specification | Yes | Training foundation models remains expensive, with the largest models we discuss trained on 256 TPU-v4 cores for 3 days. ... We pretrained the 850M models on 256 v3 TPU chips. |
| Software Dependencies | No | We train a 256k vocabulary using the Sentence Piece library [31]... All models use the training setup and optimal hyperparameters of standard Vi T variants from the Scenic library [16]. ... Scenic: A jax library for computer vision research and beyond. |
| Experiment Setup | Yes | For each Mat LM model with fixed dmodel, we optimize for g = 4 nested granularities represented by FFN ratios of {0.5, 1, 2, 4} i.e., only the hidden representation size of the FFN block changes. ... All models have 16 layers, 16 attention heads, and a dmodel : dff ratio of 1 : 4. We train a 256k vocabulary using the Sentence Piece library [31], use a maximum context length of 1024 tokens, and a batch size of 1M tokens. |