reproducibilityindex.ai

Scaling Laws for Fine-Grained Mixture of Experts

Authors: Jan Ludziejewski, Jakub Krajewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, Marek Cygan, Sebastian Jaszczur

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We run over 100 experiments on the decoder-only Transformer architecture, with each feed-forward component replaced by a Mixture of Experts layer. Those experiments involve training models with sizes ranging from 129M to 3.7B parameters across different training durations, from 16B to 130B tokens.
Researcher Affiliation	Collaboration	1IDEAS NCBR 2University of Warsaw 3Polish Academy of Sciences 4Trade Link 5Nomagic.
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	Additionally, we open-source the code used to produce the results described in this work at github.com/llm-random/llm-random.
Open Datasets	Yes	All of the models considered in this work are decoder-only Transformers trained on the C4 dataset (Raffel et al., 2023).
Dataset Splits	No	The paper uses the C4 dataset but does not explicitly state specific training, validation, and test splits (e.g., percentages or counts) for the dataset itself, nor does it refer to standard predefined splits with full citation for reproducibility.
Hardware Specification	Yes	We can see that the model with G = 8 achieves the best performance in this case. ... measured in terms of wall-clock training time on NVIDIA A100 GPU.
Software Dependencies	No	The paper mentions using Adam W optimizer and GPT2 tokenizer but does not specify version numbers for any software dependencies like programming languages or libraries (e.g., Python, PyTorch/TensorFlow, HuggingFace Transformers library).
Experiment Setup	Yes	Each batch consists of 0.5M tokens packed into 2048 sequences. Our optimizer is Adam W (Loshchilov & Hutter, 2019), with a weight decay of 0.1. In each training run, we use the maximum learning rate of 2e 4, with linear warmup for 1% steps and cosine decay to 2e 5. To improve stability, we initialize weights using the truncated normal distribution with reduced scale, as advised in (Fedus et al., 2022). The models are trained using mixed precision; we always keep the attention mechanism and router in high precision.