Mixture of Tokens: Continuous MoE through Cross-Example Aggregation
Authors: Szymon Antoniak, Michał Krutul, Maciej Pióro, Jakub Krajewski, Jan Ludziejewski, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Marek Cygan, Sebastian Jaszczur
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The focus of this work is to investigate the efficiency of Tokens on autoregressive language modeling. To measure model quality, we pretrain models for a fixed number of tokens and compare final perplexity in accordance with existing Mo E literature [28, 27]. In all experiments, the models are trained on the C4 dataset3 [41] and use the GPT-2 tokenizer. Unless specified otherwise, we use mixed precision, where all heavy computation is done in bfloat16, whereas the optimizer state and weights are kept in full precision. To study the stability of our model, we experiment with training fully in reduced precision. Our main result is a substantial speed-up of Mo T models compared to dense Transformer models (Figure 7) and results comparable to sparse Mo Es (Figure 6). |
| Researcher Affiliation | Collaboration | Szymon Antoniak Michał Krutul IDEAS NCBR University of Warsaw Maciej Pióro IDEAS NCBR Polish Academy of Sciences Jakub Krajewski IDEAS NCBR University of Warsaw Jan Ludziejewski IDEAS NCBR University of Warsaw Kamil Ciebiera IDEAS NCBR University of Warsaw Krystian Król IDEAS NCBR University of Warsaw Tomasz Odrzygó zd z IDEAS NCBR Marek Cygan University of Warsaw Nomagic Sebastian Jaszczur IDEAS NCBR University of Warsaw |
| Pseudocode | Yes | Algorithm 1 Mixture of Tokens layer 1: for each E in experts do: 2: weights E = Softmax(Linear(tokens)) 3: mix = P i token i weights i,E 4: output E = E(mix) 5: for each i do 6: for each E do 7: updatei = P E output E weights i,E |
| Open Source Code | Yes | The code and configuration files used to produce the results described in this work are available in our public repository at https://github.com/llm-random/llm-random. |
| Open Datasets | Yes | In all experiments, the models are trained on the C4 dataset3 [41] and use the GPT-2 tokenizer. The footnote 3 links to https://huggingface.co/datasets/c4, dataset licensed under ODC-By. |
| Dataset Splits | No | The paper states models are trained on the C4 dataset, and for downstream evaluations, 'a single evaluation query is included in a batch of 32, with the remainder of the batch comprised of random sequences from the C4 training dataset, ensuring it remains zero-shot.' However, it does not specify explicit train/validation/test splits (e.g., percentages or sample counts) for the main C4 training experiments, nor does it explicitly mention a validation set used during training beyond the general 'perplexity' evaluation. |
| Hardware Specification | Yes | All models were trained on NVIDIA A100 GPUs, with either 40 or 80 GB of RAM. |
| Software Dependencies | No | The paper mentions using the AdamW optimizer, GPT-2 tokenizer, and mixed precision (bfloat16), but it does not specify version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used. |
| Experiment Setup | Yes | We conducted all experiments with a batch size of 256 and a context length of 256 for 150K training steps (unless explicitly stated), resulting in a total of 10B training tokens. We used the Adam W optimizer with default hyperparameters. When necessary, we adopted a Fully Sharded Data Parallel approach from Py Torch to parallelize training across multiple machines. Learning rates were tuned separately based on model size and architecture. The optimal learning rate for Transformers was 1e-3 for Medium models and 4e-4 for Base models, while for both Mo T and Mo E, they were 7e-4 for Medium models and 2e-4 for Base models. |