Meet in the Middle: A New Pre-training Paradigm

Authors: Anh Nguyen, Nikos Karampatziakis, Weizhu Chen

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on both programming and natural languages and show that MIM significantly surpasses existing pre-training paradigms, in left-to-right generation as well as infilling. In our experiments, we assess MIM s effectiveness for pre-training LMs across various domains and tasks. Using public code and language data, we pre-train LMs of different sizes and evaluate their performance based on perplexity and code completion tasks. We compare MIM with FIM [BJT+22] and other baselines, demonstrating its superior performance in terms of perplexity and task-specific evaluation metrics. Furthermore, we conduct ablation studies to validate the effectiveness of our primary proposals during training and inference.
Researcher Affiliation Industry Anh Nguyen Nikos Karampatziakis Weizhu Chen Microsoft Azure AI
Pseudocode No The paper describes procedures and illustrates them with a figure, but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Code and models available at https://github.com/microsoft/Meet-in-the-Middle
Open Datasets Yes Using public code and language data, we pre-train LMs of different sizes and evaluate their performance based on perplexity and code completion tasks. Our code models are pre-trained on a large and diverse corpus of public code with permissive licenses, covering multiple programming languages. We train our natural language models on data from the following datasets: CC-News, Open Web Text, CC-Stories, and CC-100 to assess the models language modeling capability. Apart from the natural language data utilized in our language modeling experiments, we also train our models on a filtered version of the Falcon Refined Web corpus [PMH+23] used in the previous work of [LBE+23], that has a total of 88B tokens.
Dataset Splits No The paper mentions evaluating on a "held-out subset of the combined training data" for in-domain perplexity and using "validation data to select the value λ" in ablation studies. However, it does not provide specific percentages or sample counts for the training, validation, and test splits for the main pre-training datasets, nor does it explicitly cite standard predefined splits for its primary large-scale training corpus.
Hardware Specification No The paper mentions using the Megatron-LM framework for training but does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud instance specifications) used for running the experiments.
Software Dependencies No The paper mentions using the Megatron-LM framework but does not specify version numbers for any software components, libraries, or frameworks used in the experiments.
Experiment Setup Yes For the hyperparameters and training setup of each model size, please refer to the Appendix. To explore the impact of model size, we pre-train models with different capacities: 350M, 1.3B, 2.7B.