Better & Faster Large Language Models via Multi-token Prediction

Authors: Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Roziere, David Lopez-Paz, Gabriel Synnaeve

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide experimental evidence that this training paradigm is beneficial at scale, with models up to 13B parameters solving around 15% more code problems on average (Section 3).
Researcher Affiliation Collaboration 1FAIR at Meta 2CERMICS Ecole des Ponts Paris Tech 3LISN Université Paris-Saclay.
Pseudocode No The paper includes architectural diagrams (Figure 1, Figure 2) and describes procedures, but it does not present any formal pseudocode blocks or sections explicitly labeled 'Algorithm'.
Open Source Code No The paper mentions using 'xFormers' and cites its GitHub repository, but it does not state that the authors' own implementation code for the multi-token prediction method described in the paper is openly released or available.
Open Datasets Yes We train models of six sizes in the range 300M to 13B parameters from scratch on at least 91B tokens of code. The evaluation results in Figure 3 for MBPP (Austin et al., 2021) and Human Eval (Chen et al., 2021) show that it is possible, with the exact same computational budget, to squeeze much more performance out of large language models given a fixed dataset using multi-token prediction.
Dataset Splits Yes We finetune each pretrained model on each benchmark s training dataset for three epochs and select the checkpoint with the highest ROUGE-L F1 score on the validation dataset.
Hardware Specification Yes In aggregate, training all models reported in the paper required around 500K GPU hours of computation on hardware of type A100-80GB and H100.
Software Dependencies No The paper mentions libraries like 'xFormers' and refers to optimizers like 'Adam' but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Please refer to Table S14 for the model architectures and to Table S13 for an overview of the hyperparameters we use in our experiments. We schedule all learning rates with a linear warmup and cosine decay (Loshchilov and Hutter, 2017) to a fraction of the peak learning rate which is depicted in the last column ( decay ratio ).