reproducibilityindex.ai

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Authors: Robert Csordas, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that Switch Head can achieve performance comparable to parameter-matched baselines with just a fraction of the compute and memory budget.
Researcher Affiliation	Academia	1Stanford University, Stanford, CA, USA 2AI Initiative, KAUST, Thuwal, Saudi Arabia 3Center for Brain Science, Harvard University, Cambridge, MA, USA 4The Swiss AI Lab IDSIA, USI & SUPSI, Lugano, Switzerland
Pseudocode	No	The paper includes mathematical equations and schematic representations (e.g., Figure 1) but does not provide structured pseudocode or algorithm blocks.
Open Source Code	Yes	1Our code is public: https://github.com/robertcsordas/switchhead
Open Datasets	Yes	We evaluate our method on C4 [20], Enwik8 [21], pe S2o [22] and Wikitext 103 [23]
Dataset Splits	No	The paper mentions using well-known datasets (C4, Enwik8, pe S2o, Wikitext 103) and discusses training details like batch size and learning rate, but does not explicitly state the specific train/validation/test splits (e.g., percentages or sample counts) used for these datasets, nor does it cite predefined splits with specific details for reproducibility.
Hardware Specification	Yes	Table 5: Size Model ms/iteration Rel. iter. time RAM/GPU Rel. Mem. #GPUs GPU type... 47M Transformer ... 1 RTX 3090 ... 262M Transformer ... 8 V100. Table 10: Training hardware information for the experiments reported in the paper... G GPU Type ... V100-32GB-LS, RTX 4090, RTX 3090, V100-32GB, P100-16GB, A100-80GB.
Software Dependencies	No	The paper mentions using the 'Triton kernel of σ-Mo E [17]' and 'Adam optimizer [40]', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	We train all our models with Adam optimizer [40], with a batch size of 64, a learning rate of 0.00025, and gradient clipping with a maximum norm of κ. ... Detailed hyperparameters are shown in the Tab. 9.