In-Context Pretraining: Language Modeling Beyond Document Boundaries
Authors: Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Wen-tau Yih, Mike Lewis
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show IN-CONTEXT PRETRAINING offers a simple and scalable approach to significantly enhance LMs performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%). |
| Researcher Affiliation | Collaboration | 1Meta AI 2University of Washington 3 Allen Institute for AI |
| Pseudocode | Yes | Algorithm 1 Maximum Traveling Salesman Input: Document graph G = (D, L) N(di) returns nearest neighbors for di min_deg(D) returns a min-degree doc Output: A path P |
| Open Source Code | Yes | Code are publicly released at github.com/swj0419/in-context-pretraining. |
| Open Datasets | Yes | We use the English Commoncrawl dataset (Wenzek et al., 2020), the widelyused data source for pretraining LMs. |
| Dataset Splits | No | The paper uses evaluation datasets like SST-2, Amazon, Yelp, etc., and mentions using '32 demonstration examples' or '2-shot in-context learning' for evaluation, but it does not specify explicit train/validation/test splits (e.g., percentages or counts) for the datasets used in its experiments, nor for the pretraining data. |
| Hardware Specification | Yes | The 7B model is pretrained using 128 A100 GPUs across 16 nodes with a batch size of 4 million tokens. |
| Software Dependencies | No | The paper mentions software like LLaMA, AdamW optimizer, flash attention, Contriever model, and Faiss library, but it does not specify explicit version numbers for these software dependencies (e.g., 'PyTorch 1.9' or 'CUDA 11.1'). |
| Experiment Setup | Yes | We take the model architecture from LLa MA (Touvron et al., 2023a) and train models across various sizes: 0.3, 0.7, 1.5, and 7.0 billion parameters, all with an 8192-length context window. Following LLa MA, we employ the Adam W optimizer (Loshchilov & Hutter, 2018) with parameters β1 = 0.9 and β2 = 0.95, and a cosine learning rate schedule. The 7B model is pretrained using 128 A100 GPUs across 16 nodes with a batch size of 4 million tokens. |