Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

Authors: Woojin Chung, Jeonghoon Kim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study, we explore why enlarging vocabulary size improves the performance of language models by expanding vocabulary from 24K to 196K. Viewing BPE [54] with its pre-tokenization rules as a lossless compressor [13, 33], we assess how much the tokenized text is compressed against raw text by an upper-bound on Kolmogorov complexity and first illustrate that expanding the vocabulary reduces this complexity ( 3.3). Analytic experiments trace how enlarging the vocabulary changes both training dynamics and generalization behavior through token-frequency imbalance. Our experiments further reveal that beyond a certain size, vocabulary expansion no longer improves segmentation but instead steepens the skewness of the token-frequency distribution. This sharper imbalance alone lowers global cross-entropy by reducing the top 2,500 frequent words loss despite slight degradation on the rare tail. Through cross-dataset overlap analyses, we demonstrate that exploiting, rather than mitigating, token frequency imbalance causally reduces cross-entropy and boosts downstream accuracy. Finally, we show that parameter scaling replicates the same benefit as vocabulary scaling, both primarily reduce uncertainty on the same set of frequent tokens.
Researcher Affiliation Collaboration Woojin Chung KAIST EMAIL Jeonghoon Kim NAVER Cloud & KAIST EMAIL
Pseudocode No The paper describes methodologies and experiments but does not include any clearly labeled pseudocode or algorithm blocks in the main text.
Open Source Code Yes github.com/Chung-Kim/vocab-imbalance
Open Datasets Yes we train a byte-pair encoding (BPE) tokenizer [54] and estimate token frequencies using a sample of 10 billion GPT-2 tokens from Fine Web-Edu [43] and the entire Open Web Text [15]. For model pre-training, we use approximately 40 billion characters, about 7.5 billion tokens for Fine Web-Edu and 7 billion for Open Web Text with a 49K vocabulary. To compute the metrics below, we also drew an additional 5 billion characters that did not overlap with the training corpus. We report word-level average loss to ensure fair comparison across vocabulary sizes. Whenever a smaller-vocabulary tokenizer splits a word into multiple tokens (i.e., subwords), we sum their individual losses. Our model comprises 85 million non-embedding parameters with pre-layer normalization (pre-LN) [62]. Training uses Adam W [36] (β1 = 0.9, β2 = 0.95, ϵ = 10 8) with a learning rate of 6 10 4 that follows a cosine-decay schedule after a 350 million-token warmup, weight decay of 0.1, and gradient clipping at 1.0. Every experiment was repeated with five seeds (See Appendix J).
Dataset Splits Yes For model pre-training, we use approximately 40 billion characters, about 7.5 billion tokens for Fine Web-Edu and 7 billion for Open Web Text with a 49K vocabulary. To compute the metrics below, we also drew an additional 5 billion characters that did not overlap with the training corpus. Models are pre-trained on 40B bytes and evaluated on a separate, non-overlapping 5B byte split of Fine Web-Edu.
Hardware Specification No The paper does not explicitly provide details about the specific hardware used (e.g., GPU/CPU models, memory) for running its experiments.
Software Dependencies No The paper mentions optimizers and training configurations but does not provide specific software dependencies with version numbers (e.g., PyTorch 1.x, Python 3.x).
Experiment Setup Yes Our model comprises 85 million non-embedding parameters with pre-layer normalization (pre-LN) [62]. Training uses Adam W [36] (β1 = 0.9, β2 = 0.95, ϵ = 10 8) with a learning rate of 6 10 4 that follows a cosine-decay schedule after a 350 million-token warmup, weight decay of 0.1, and gradient clipping at 1.0. Every experiment was repeated with five seeds (See Appendix J). In this section, we provide detailed configurations of pretraining to reproduce our results. The training setup (Table 7) uses a global batch size of 256, weight decay 0.1, and sequence length 2048. Optimization is Adam with a cosine learning-rate schedule, a 700-step warmup, and a weightinitialization scale of 0.02. The model setup (Table 8) covers two different model size: an 85M model with 12 layers and 12 heads (dmodel = 768, dffn = 2048, dhead = 64) and a 450M model with 21 layers and 21 heads (dmodel = 1344, dffn = 3548, dhead = 21). Together, these tables specify the standardized training hyperparameters and the core architectural dimensions for both scales.