Scaling Laws for Fine-Grained Mixture of Experts

Authors: Jan Ludziejewski, Jakub Krajewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, Marek Cygan, Sebastian Jaszczur

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We run over 100 experiments on the decoder-only Transformer architecture, with each feed-forward component replaced by a Mixture of Experts layer. Those experiments involve training models with sizes ranging from 129M to 3.7B parameters across different training durations, from 16B to 130B tokens.
Researcher Affiliation Collaboration 1IDEAS NCBR 2University of Warsaw 3Polish Academy of Sciences 4Trade Link 5Nomagic.
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Additionally, we open-source the code used to produce the results described in this work at github.com/llm-random/llm-random.
Open Datasets Yes All of the models considered in this work are decoder-only Transformers trained on the C4 dataset (Raffel et al., 2023).
Dataset Splits No The paper uses the C4 dataset but does not explicitly state specific training, validation, and test splits (e.g., percentages or counts) for the dataset itself, nor does it refer to standard predefined splits with full citation for reproducibility.
Hardware Specification Yes We can see that the model with G = 8 achieves the best performance in this case. ... measured in terms of wall-clock training time on NVIDIA A100 GPU.
Software Dependencies No The paper mentions using Adam W optimizer and GPT2 tokenizer but does not specify version numbers for any software dependencies like programming languages or libraries (e.g., Python, PyTorch/TensorFlow, HuggingFace Transformers library).
Experiment Setup Yes Each batch consists of 0.5M tokens packed into 2048 sequences. Our optimizer is Adam W (Loshchilov & Hutter, 2019), with a weight decay of 0.1. In each training run, we use the maximum learning rate of 2e 4, with linear warmup for 1% steps and cosine decay to 2e 5. To improve stability, we initialize weights using the truncated normal distribution with reduced scale, as advised in (Fedus et al., 2022). The models are trained using mixed precision; we always keep the attention mechanism and router in high precision.