MoEUT: Mixture-of-Experts Universal Transformers
Authors: Robert Csordas, Kazuki Irie, Jürgen Schmidhuber, Christopher Potts, Christopher D Manning
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present our main experimental results on the performance and efficiency of Mo EUT on language modeling using the popular C4 dataset [39]. To demonstrate the versatility of our model, we also show our main results on the Slim Pajama [40] and pe S2o [41] language modeling datasets, and code generation on The Stack [42]. For experimental evidence in support of the benefits of shared layers for compositional generalization, we refer to much previous work (e.g., [10, 15, 14, 16, 38]). Following prior work [27, 31], we measure the compute requirements in terms of the number of multiply-accumulate (MAC) operations needed in the forward pass. |
| Researcher Affiliation | Academia | Róbert Csordás1,2 Kazuki Irie3 Jürgen Schmidhuber2,4 Christopher Potts1 Christopher D. Manning1 1Stanford University, Stanford, CA, USA 2The Swiss AI Lab IDSIA, USI & SUPSI, Lugano, Switzerland 3Center for Brain Science, Harvard University, Cambridge, MA, USA 4AI Initiative, KAUST, Thuwal, Saudi Arabia {rcsordas,cgpotts,manning}@stanford.edu kirie@fas.harvard.edu, juergen@idsia.ch |
| Pseudocode | No | The paper describes algorithms and architectures using mathematical equations and textual descriptions, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Our code is public: https://github.com/robertcsordas/moeut |
| Open Datasets | Yes | We demonstrate their capabilities on the C4, Slim Pajama, and pe S2o language modeling datasets, as well as on The Stack code generation. |
| Dataset Splits | Yes | All experiments in this section are performed by calculating statistics on the validation set of C4 for a model with G = 2 (i.e., two layers in the group; see Sec. 2.3). |
| Hardware Specification | Yes | We measured the training iteration time and memory usage on 8 V100 32GB GPUs. [...] Table 6: Training hardware information for the experiments reported in the paper |
| Software Dependencies | No | The paper mentions 'Py Torch [57]', 'Ro PE positional encodings [43]', 'Sentence Piece [59] tokenizer', and 'Adam W optimizer [58]'. However, it does not provide specific version numbers for these software components, which is required for reproducibility. |
| Experiment Setup | Yes | All our models are trained in Py Torch [57] with a batch size of 64, context length of 1024, for 100k iterations, a learning rate of 0.00025, Adam W optimizer [58] with default hyperparameters, weight decay of 0.01. They are trained on a single node in a data-parallel manner. The learning rate is decayed to 10% of its initial value using cosine decay. We use a gradient clipping of κ and Nwarmup linear learning rate warmup steps (see Tab. 3). None of our models uses dropout. For the entropy regularization of the MLP expert selection, we use γ = 0.01 and for Switch Head attention δ = 0.001. Expert dropout is not used. All of our models use a Sentence Piece [59] tokenizer with 8000 tokens, trained on a subset of the training set for the given dataset. All models are trained with mixed precision. The hyperparameters of the SUT models can be found in Tab. 4. |