In-Context Pretraining: Language Modeling Beyond Document Boundaries

Authors: Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Wen-tau Yih, Mike Lewis

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show IN-CONTEXT PRETRAINING offers a simple and scalable approach to significantly enhance LMs performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).
Researcher Affiliation Collaboration 1Meta AI 2University of Washington 3 Allen Institute for AI
Pseudocode Yes Algorithm 1 Maximum Traveling Salesman Input: Document graph G = (D, L) N(di) returns nearest neighbors for di min_deg(D) returns a min-degree doc Output: A path P
Open Source Code Yes Code are publicly released at github.com/swj0419/in-context-pretraining.
Open Datasets Yes We use the English Commoncrawl dataset (Wenzek et al., 2020), the widelyused data source for pretraining LMs.
Dataset Splits No The paper uses evaluation datasets like SST-2, Amazon, Yelp, etc., and mentions using '32 demonstration examples' or '2-shot in-context learning' for evaluation, but it does not specify explicit train/validation/test splits (e.g., percentages or counts) for the datasets used in its experiments, nor for the pretraining data.
Hardware Specification Yes The 7B model is pretrained using 128 A100 GPUs across 16 nodes with a batch size of 4 million tokens.
Software Dependencies No The paper mentions software like LLaMA, AdamW optimizer, flash attention, Contriever model, and Faiss library, but it does not specify explicit version numbers for these software dependencies (e.g., 'PyTorch 1.9' or 'CUDA 11.1').
Experiment Setup Yes We take the model architecture from LLa MA (Touvron et al., 2023a) and train models across various sizes: 0.3, 0.7, 1.5, and 7.0 billion parameters, all with an 8192-length context window. Following LLa MA, we employ the Adam W optimizer (Loshchilov & Hutter, 2018) with parameters β1 = 0.9 and β2 = 0.95, and a cosine learning rate schedule. The 7B model is pretrained using 128 A100 GPUs across 16 nodes with a batch size of 4 million tokens.