Scaling Laws for Fine-Grained Mixture of Experts
Authors: Jan Ludziejewski, Jakub Krajewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, Marek Cygan, Sebastian Jaszczur
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We run over 100 experiments on the decoder-only Transformer architecture, with each feed-forward component replaced by a Mixture of Experts layer. Those experiments involve training models with sizes ranging from 129M to 3.7B parameters across different training durations, from 16B to 130B tokens. |
| Researcher Affiliation | Collaboration | 1IDEAS NCBR 2University of Warsaw 3Polish Academy of Sciences 4Trade Link 5Nomagic. |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Additionally, we open-source the code used to produce the results described in this work at github.com/llm-random/llm-random. |
| Open Datasets | Yes | All of the models considered in this work are decoder-only Transformers trained on the C4 dataset (Raffel et al., 2023). |
| Dataset Splits | No | The paper uses the C4 dataset but does not explicitly state specific training, validation, and test splits (e.g., percentages or counts) for the dataset itself, nor does it refer to standard predefined splits with full citation for reproducibility. |
| Hardware Specification | Yes | We can see that the model with G = 8 achieves the best performance in this case. ... measured in terms of wall-clock training time on NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions using Adam W optimizer and GPT2 tokenizer but does not specify version numbers for any software dependencies like programming languages or libraries (e.g., Python, PyTorch/TensorFlow, HuggingFace Transformers library). |
| Experiment Setup | Yes | Each batch consists of 0.5M tokens packed into 2048 sequences. Our optimizer is Adam W (Loshchilov & Hutter, 2019), with a weight decay of 0.1. In each training run, we use the maximum learning rate of 2e 4, with linear warmup for 1% steps and cosine decay to 2e 5. To improve stability, we initialize weights using the truncated normal distribution with reduced scale, as advised in (Fedus et al., 2022). The models are trained using mixed precision; we always keep the attention mechanism and router in high precision. |