Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Authors: Robert Csordas, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that Switch Head can achieve performance comparable to parameter-matched baselines with just a fraction of the compute and memory budget. |
| Researcher Affiliation | Academia | 1Stanford University, Stanford, CA, USA 2AI Initiative, KAUST, Thuwal, Saudi Arabia 3Center for Brain Science, Harvard University, Cambridge, MA, USA 4The Swiss AI Lab IDSIA, USI & SUPSI, Lugano, Switzerland |
| Pseudocode | No | The paper includes mathematical equations and schematic representations (e.g., Figure 1) but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Our code is public: https://github.com/robertcsordas/switchhead |
| Open Datasets | Yes | We evaluate our method on C4 [20], Enwik8 [21], pe S2o [22] and Wikitext 103 [23] |
| Dataset Splits | No | The paper mentions using well-known datasets (C4, Enwik8, pe S2o, Wikitext 103) and discusses training details like batch size and learning rate, but does not explicitly state the specific train/validation/test splits (e.g., percentages or sample counts) used for these datasets, nor does it cite predefined splits with specific details for reproducibility. |
| Hardware Specification | Yes | Table 5: Size Model ms/iteration Rel. iter. time RAM/GPU Rel. Mem. #GPUs GPU type... 47M Transformer ... 1 RTX 3090 ... 262M Transformer ... 8 V100. Table 10: Training hardware information for the experiments reported in the paper... G GPU Type ... V100-32GB-LS, RTX 4090, RTX 3090, V100-32GB, P100-16GB, A100-80GB. |
| Software Dependencies | No | The paper mentions using the 'Triton kernel of σ-Mo E [17]' and 'Adam optimizer [40]', but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We train all our models with Adam optimizer [40], with a batch size of 64, a learning rate of 0.00025, and gradient clipping with a maximum norm of κ. ... Detailed hyperparameters are shown in the Tab. 9. |