SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Authors: Robert Csordas, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that Switch Head can achieve performance comparable to parameter-matched baselines with just a fraction of the compute and memory budget. |
| Researcher Affiliation | Academia | 1Stanford University, Stanford, CA, USA 2AI Initiative, KAUST, Thuwal, Saudi Arabia 3Center for Brain Science, Harvard University, Cambridge, MA, USA 4The Swiss AI Lab IDSIA, USI & SUPSI, Lugano, Switzerland |
| Pseudocode | No | The paper includes mathematical equations and schematic representations (e.g., Figure 1) but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Our code is public: https://github.com/robertcsordas/switchhead |
| Open Datasets | Yes | We evaluate our method on C4 [20], Enwik8 [21], pe S2o [22] and Wikitext 103 [23] |
| Dataset Splits | No | The paper mentions using well-known datasets (C4, Enwik8, pe S2o, Wikitext 103) and discusses training details like batch size and learning rate, but does not explicitly state the specific train/validation/test splits (e.g., percentages or sample counts) used for these datasets, nor does it cite predefined splits with specific details for reproducibility. |
| Hardware Specification | Yes | Table 5: Size Model ms/iteration Rel. iter. time RAM/GPU Rel. Mem. #GPUs GPU type... 47M Transformer ... 1 RTX 3090 ... 262M Transformer ... 8 V100. Table 10: Training hardware information for the experiments reported in the paper... G GPU Type ... V100-32GB-LS, RTX 4090, RTX 3090, V100-32GB, P100-16GB, A100-80GB. |
| Software Dependencies | No | The paper mentions using the 'Triton kernel of σ-Mo E [17]' and 'Adam optimizer [40]', but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We train all our models with Adam optimizer [40], with a batch size of 64, a learning rate of 0.00025, and gradient clipping with a maximum norm of κ. ... Detailed hyperparameters are shown in the Tab. 9. |