Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies
Authors: Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations. We validate our predictions empirically by training models with 3B parameters across different FLOPs budgets. Adopting our predicted optimal vocabulary size consistently improves downstream performance over commonly used vocabulary sizes. |
| Researcher Affiliation | Collaboration | 1The University of Hong Kong 2Sea AI Lab 3Contextual AI 4Stanford University 5The Ohio State University |
| Pseudocode | No | The paper does not contain any pseudocode blocks or algorithms explicitly labeled as such. It provides mathematical derivations in the appendix. |
| Open Source Code | Yes | The code and demo are available at https://github.com/sail-sg/scaling-with-vocab and https://hf.co/spaces/sail/scaling-with-vocab-demo. |
| Open Datasets | Yes | For all experiments, we uniformly sample the training data from different domains in the Slim Pajama dataset [61]. |
| Dataset Splits | Yes | We evaluate the normalized loss Lu on a held-out validation dataset. |
| Hardware Specification | Yes | For our experiments with Nnv = 2870M, it takes about 120 hours to train on over 500B training characters with 64 total GPUs. We use a global batch size of 512 for all runs and run all experiments on 40GB Nvidia-A100 GPUs. |
| Software Dependencies | No | The paper mentions using "Adam W [37]" as the optimizer, "bfloat16 mixed precision training," and the "Megatron-LM framework [60]." However, specific version numbers for these software components or any other libraries (e.g., Python, PyTorch) are not provided. |
| Experiment Setup | Yes | The maximum learning rate is set to 4e-4 and decays to 10% i.e. 4e-5 similar to prior scaling work [26, 44]. We use a global batch size of 512 for all runs... We adopt the Llama architecture [69], except for the vocabulary size. For the vocabulary size, we use numbers divisible by 128 for compatibility with NVIDIA s tensor core to accelerate matrix multiplication. |