Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling Embedding Layers in Language Models

Authors: Da Yu, Edith Cohen, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, Chiyuan Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that scaling both aspects enables a model with 1B accelerator-resident parameters to outperform a 1.9B-parameter baseline across diverse corpora, while using only about half the FLOPS and accelerator memory during inference. Our results show that SCONE significantly boosts model performance without introducing inference latency bottlenecks; Figure 2 highlights representative findings. Notably, a SCONE model with 1B accelerator-resident parameters outperforms a 1.9B baseline that requires approximately 2 more inference FLOPS and accelerator memory. In Section 4, we evaluate SCONE in pre-training settings. In Section 4.1, we assess SCONE on GPT-2 sized models to study various design choices, and in Section 4.2, we extend the evaluation to large-scale pre-training scenarios involving trillions of tokens. Finally, in Section 4.3, we analyze the inference and storage costs during deployment.
Researcher Affiliation Industry Google. Correspondence to: EMAIL.
Pseudocode Yes Algorithm 1 SCONE method FT ,Vf-gram,Af-gram|F. Algorithm 2 Basic Next-Word Prediction Model MT ,A,D. Algorithm 3 Constructing a set of f-grams Vf-gram. Algorithm 4 Next-word prediction with SCONE MT ,Vf-gram,Af-gram|F,Amain,D
Open Source Code No We will release our code after the reviewing process.
Open Datasets Yes We use the released GPT-2 tokenizer, which has |Vtoken| = 50,257, and train on the Web Text dataset [Peterson et al., 2019]. ... For evaluation, we use the validation split of Web Text and Wiki Text-103 [Merity et al., 2017]. ... Our implementation builds on the open-source OLMo codebase [Groeneveld et al., 2024], licensed under Apache 2.0. ... We uniformly sample tokenized sequences from Dolma [Soldaini et al., 2024] to vary the corpus size.
Dataset Splits No For evaluation, we use the validation split of Web Text and Wiki Text-103. ... In the main text, we focus on presenting downstream accuracy results under our primary training setting. Additional results, including perplexity evaluations, training curves, and further SCONE configurations under alternative settings, are provided in Appendix E.2. ... The evaluation perplexity curves for OLMo-0.7B, OLMo1B, and OLMo-1.3B throughout training.
Hardware Specification Yes We experiment with |Vf-gram| being 10M, 100M, and 1B with embedding dimension of d = 2048 and 16-bit precision per floating point value. Experiments were conducted on a workstation with 64 Intel Xeon CPU cores and 512 GB of memory. ... On a single A100 GPU. ... All measurements are taken with a context length of 2048 and a batch size of 4 on a single A100 80 GB GPU. ... Most of our experiments are conducted on 4–8 H100 nodes, while some experiments are conducted on 2–16 A100 nodes.
Software Dependencies No Our implementation builds on the open-source OLMo codebase [Groeneveld et al., 2024], licensed under Apache 2.0. ... For optimization, we use Adam W [Loshchilov and Hutter, 2019]. ... We use Deep Speed [Deep Speed, 2024] with Ze RO stage 1 that partitions the optimizer state across GPUs to reduce GPU memory usage.
Experiment Setup Yes For pre-training on Web Text [Peterson et al., 2019], we follow Radford et al. [2019] and set the batch size and sequence length to 512 and 1024, respectively. ... We train the models for 80B tokens... For optimization, we use Adam W [Loshchilov and Hutter, 2019] with a weight decay of 0.1. Following Hoffmann et al. [2022], we set the maximum learning rate to 2 x 10^-4 and apply a cosine learning rate scheduler. ... All models use a sequence length of 2048.