SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Authors: Robert Csordas, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that Switch Head can achieve performance comparable to parameter-matched baselines with just a fraction of the compute and memory budget.
Researcher Affiliation Academia 1Stanford University, Stanford, CA, USA 2AI Initiative, KAUST, Thuwal, Saudi Arabia 3Center for Brain Science, Harvard University, Cambridge, MA, USA 4The Swiss AI Lab IDSIA, USI & SUPSI, Lugano, Switzerland
Pseudocode No The paper includes mathematical equations and schematic representations (e.g., Figure 1) but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes 1Our code is public: https://github.com/robertcsordas/switchhead
Open Datasets Yes We evaluate our method on C4 [20], Enwik8 [21], pe S2o [22] and Wikitext 103 [23]
Dataset Splits No The paper mentions using well-known datasets (C4, Enwik8, pe S2o, Wikitext 103) and discusses training details like batch size and learning rate, but does not explicitly state the specific train/validation/test splits (e.g., percentages or sample counts) used for these datasets, nor does it cite predefined splits with specific details for reproducibility.
Hardware Specification Yes Table 5: Size Model ms/iteration Rel. iter. time RAM/GPU Rel. Mem. #GPUs GPU type... 47M Transformer ... 1 RTX 3090 ... 262M Transformer ... 8 V100. Table 10: Training hardware information for the experiments reported in the paper... G GPU Type ... V100-32GB-LS, RTX 4090, RTX 3090, V100-32GB, P100-16GB, A100-80GB.
Software Dependencies No The paper mentions using the 'Triton kernel of σ-Mo E [17]' and 'Adam optimizer [40]', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We train all our models with Adam optimizer [40], with a batch size of 64, a learning rate of 0.00025, and gradient clipping with a maximum norm of κ. ... Detailed hyperparameters are shown in the Tab. 9.