Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Authors: Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes.
Researcher Affiliation	Collaboration	1The University of Hong Kong 2Sea AI Lab 3Contextual AI 4Stanford University 5The Ohio State University
Pseudocode	No	The paper does not contain any pseudocode blocks or algorithms explicitly labeled as such. It provides mathematical derivations in the appendix.
Open Source Code	Yes	The code and demo are available at https://github.com/sail-sg/scaling-with-vocab and https://hf.co/spaces/sail/scaling-with-vocab-demo.
Open Datasets	Yes	For all experiments, we uniformly sample the training data from different domains in the Slim Pajama dataset [61].
Dataset Splits	Yes	We evaluate the normalized loss Lu on a held-out validation dataset.
Hardware Specification	Yes	For our experiments with Nnv = 2870M, it takes about 120 hours to train on over 500B training characters with 64 total GPUs. We use a global batch size of 512 for all runs and run all experiments on 40GB Nvidia-A100 GPUs.
Software Dependencies	No	The paper mentions using "Adam W [37]" as the optimizer, "bfloat16 mixed precision training," and the "Megatron-LM framework [60]." However, specific version numbers for these software components or any other libraries (e.g., Python, PyTorch) are not provided.
Experiment Setup	Yes	The maximum learning rate is set to 4e-4 and decays to 10% i.e. 4e-5 similar to prior scaling work [26, 44]. We use a global batch size of 512 for all runs... We adopt the Llama architecture [69], except for the vocabulary size. For the vocabulary size, we use numbers divisible by 128 for compatibility with NVIDIA s tensor core to accelerate matrix multiplication.