MoEUT: Mixture-of-Experts Universal Transformers

Authors: Robert Csordas, Kazuki Irie, Jürgen Schmidhuber, Christopher Potts, Christopher D Manning

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present our main experimental results on the performance and efficiency of Mo EUT on language modeling using the popular C4 dataset [39]. To demonstrate the versatility of our model, we also show our main results on the Slim Pajama [40] and pe S2o [41] language modeling datasets, and code generation on The Stack [42]. For experimental evidence in support of the benefits of shared layers for compositional generalization, we refer to much previous work (e.g., [10, 15, 14, 16, 38]). Following prior work [27, 31], we measure the compute requirements in terms of the number of multiply-accumulate (MAC) operations needed in the forward pass.
Researcher Affiliation Academia Róbert Csordás1,2 Kazuki Irie3 Jürgen Schmidhuber2,4 Christopher Potts1 Christopher D. Manning1 1Stanford University, Stanford, CA, USA 2The Swiss AI Lab IDSIA, USI & SUPSI, Lugano, Switzerland 3Center for Brain Science, Harvard University, Cambridge, MA, USA 4AI Initiative, KAUST, Thuwal, Saudi Arabia {rcsordas,cgpotts,manning}@stanford.edu kirie@fas.harvard.edu, juergen@idsia.ch
Pseudocode No The paper describes algorithms and architectures using mathematical equations and textual descriptions, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our code is public: https://github.com/robertcsordas/moeut
Open Datasets Yes We demonstrate their capabilities on the C4, Slim Pajama, and pe S2o language modeling datasets, as well as on The Stack code generation.
Dataset Splits Yes All experiments in this section are performed by calculating statistics on the validation set of C4 for a model with G = 2 (i.e., two layers in the group; see Sec. 2.3).
Hardware Specification Yes We measured the training iteration time and memory usage on 8 V100 32GB GPUs. [...] Table 6: Training hardware information for the experiments reported in the paper
Software Dependencies No The paper mentions 'Py Torch [57]', 'Ro PE positional encodings [43]', 'Sentence Piece [59] tokenizer', and 'Adam W optimizer [58]'. However, it does not provide specific version numbers for these software components, which is required for reproducibility.
Experiment Setup Yes All our models are trained in Py Torch [57] with a batch size of 64, context length of 1024, for 100k iterations, a learning rate of 0.00025, Adam W optimizer [58] with default hyperparameters, weight decay of 0.01. They are trained on a single node in a data-parallel manner. The learning rate is decayed to 10% of its initial value using cosine decay. We use a gradient clipping of κ and Nwarmup linear learning rate warmup steps (see Tab. 3). None of our models uses dropout. For the entropy regularization of the MLP expert selection, we use γ = 0.01 and for Switch Head attention δ = 0.001. Expert dropout is not used. All of our models use a Sentence Piece [59] tokenizer with 8000 tokens, trained on a subset of the training set for the given dataset. All models are trained with mixed precision. The hyperparameters of the SUT models can be found in Tab. 4.