Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

Authors: Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes.
Researcher Affiliation Collaboration 1The University of Hong Kong 2Sea AI Lab 3Contextual AI 4Stanford University 5The Ohio State University
Pseudocode No The paper does not contain any pseudocode blocks or algorithms explicitly labeled as such. It provides mathematical derivations in the appendix.
Open Source Code Yes The code and demo are available at https://github.com/sail-sg/scaling-with-vocab and https://hf.co/spaces/sail/scaling-with-vocab-demo.
Open Datasets Yes For all experiments, we uniformly sample the training data from different domains in the Slim Pajama dataset [61].
Dataset Splits Yes We evaluate the normalized loss Lu on a held-out validation dataset.
Hardware Specification Yes For our experiments with Nnv = 2870M, it takes about 120 hours to train on over 500B training characters with 64 total GPUs. We use a global batch size of 512 for all runs and run all experiments on 40GB Nvidia-A100 GPUs.
Software Dependencies No The paper mentions using "Adam W [37]" as the optimizer, "bfloat16 mixed precision training," and the "Megatron-LM framework [60]." However, specific version numbers for these software components or any other libraries (e.g., Python, PyTorch) are not provided.
Experiment Setup Yes The maximum learning rate is set to 4e-4 and decays to 10% i.e. 4e-5 similar to prior scaling work [26, 44]. We use a global batch size of 512 for all runs... We adopt the Llama architecture [69], except for the vocabulary size. For the vocabulary size, we use numbers divisible by 128 for compatibility with NVIDIA s tensor core to accelerate matrix multiplication.