reproducibilityindex.ai

In-Context Pretraining: Language Modeling Beyond Document Boundaries

Authors: Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Wen-tau Yih, Mike Lewis

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show IN-CONTEXT PRETRAINING offers a simple and scalable approach to significantly enhance LMs performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).
Researcher Affiliation	Collaboration	1Meta AI 2University of Washington 3 Allen Institute for AI
Pseudocode	Yes	Algorithm 1 Maximum Traveling Salesman Input: Document graph G = (D, L) N(di) returns nearest neighbors for di min_deg(D) returns a min-degree doc Output: A path P
Open Source Code	Yes	Code are publicly released at github.com/swj0419/in-context-pretraining.
Open Datasets	Yes	We use the English Commoncrawl dataset (Wenzek et al., 2020), the widelyused data source for pretraining LMs.
Dataset Splits	No	The paper uses evaluation datasets like SST-2, Amazon, Yelp, etc., and mentions using '32 demonstration examples' or '2-shot in-context learning' for evaluation, but it does not specify explicit train/validation/test splits (e.g., percentages or counts) for the datasets used in its experiments, nor for the pretraining data.
Hardware Specification	Yes	The 7B model is pretrained using 128 A100 GPUs across 16 nodes with a batch size of 4 million tokens.
Software Dependencies	No	The paper mentions software like LLaMA, AdamW optimizer, flash attention, Contriever model, and Faiss library, but it does not specify explicit version numbers for these software dependencies (e.g., 'PyTorch 1.9' or 'CUDA 11.1').
Experiment Setup	Yes	We take the model architecture from LLa MA (Touvron et al., 2023a) and train models across various sizes: 0.3, 0.7, 1.5, and 7.0 billion parameters, all with an 8192-length context window. Following LLa MA, we employ the Adam W optimizer (Loshchilov & Hutter, 2018) with parameters β1 = 0.9 and β2 = 0.95, and a cosine learning rate schedule. The 7B model is pretrained using 128 A100 GPUs across 16 nodes with a batch size of 4 million tokens.