Better & Faster Large Language Models via Multi-token Prediction
Authors: Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Roziere, David Lopez-Paz, Gabriel Synnaeve
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide experimental evidence that this training paradigm is beneficial at scale, with models up to 13B parameters solving around 15% more code problems on average (Section 3). |
| Researcher Affiliation | Collaboration | 1FAIR at Meta 2CERMICS Ecole des Ponts Paris Tech 3LISN Université Paris-Saclay. |
| Pseudocode | No | The paper includes architectural diagrams (Figure 1, Figure 2) and describes procedures, but it does not present any formal pseudocode blocks or sections explicitly labeled 'Algorithm'. |
| Open Source Code | No | The paper mentions using 'xFormers' and cites its GitHub repository, but it does not state that the authors' own implementation code for the multi-token prediction method described in the paper is openly released or available. |
| Open Datasets | Yes | We train models of six sizes in the range 300M to 13B parameters from scratch on at least 91B tokens of code. The evaluation results in Figure 3 for MBPP (Austin et al., 2021) and Human Eval (Chen et al., 2021) show that it is possible, with the exact same computational budget, to squeeze much more performance out of large language models given a fixed dataset using multi-token prediction. |
| Dataset Splits | Yes | We finetune each pretrained model on each benchmark s training dataset for three epochs and select the checkpoint with the highest ROUGE-L F1 score on the validation dataset. |
| Hardware Specification | Yes | In aggregate, training all models reported in the paper required around 500K GPU hours of computation on hardware of type A100-80GB and H100. |
| Software Dependencies | No | The paper mentions libraries like 'xFormers' and refers to optimizers like 'Adam' but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Please refer to Table S14 for the model architectures and to Table S13 for an overview of the hyperparameters we use in our experiments. We schedule all learning rates with a linear warmup and cosine decay (Loshchilov and Hutter, 2017) to a fraction of the peak learning rate which is depicted in the last column ( decay ratio ). |