Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization
Authors: James Oldfield, Markos Georgopoulos, Grigorios Chrysos, Christos Tzelepis, Yannis Panagakis, Mihalis Nicolaou, Jiankang Deng, Ioannis Patras
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present both qualitative and quantitative evidence that scaling µMo E layers when fine-tuning foundation models for vision tasks leads to more specialized experts at the class-level, further enabling manual bias correction in Celeb A attribute classification. Finally, we show qualitative results demonstrating the expert specialism achieved when pre-training large GPT2 and MLP-Mixer models with parameter-matched µMo E blocks at every layer, maintaining comparable accuracy. Our code is available at: https://github.com/james-oldfield/mu Mo E. |
| Researcher Affiliation | Academia | James Oldfield1 Markos Georgopoulos Grigorios G. Chrysos2 Christos Tzelepis3 Yannis Panagakis4,5 Mihalis A. Nicolaou6 Jiankang Deng7 Ioannis Patras1 1Queen Mary University of London 2University of Wisconsin-Madison 3City University of London 4National and Kapodistrian University of Athens 5Archimedes AI, Athena RC 6The Cyprus Institute 7Imperial College London |
| Pseudocode | Yes | We now present the derivations of the forward passes of the factorized µMo E models (with einsum pseudocode implementations in Appendix B): |
| Open Source Code | Yes | Our code is available at: https://github.com/james-oldfield/mu Mo E. |
| Open Datasets | Yes | To isolate the impact of µMo E layers and varying expert counts, we first explore the controlled setting of fine-tuning large foundation models CLIP [61] Vi T-B-32 and DINO [62] on Image NET1k (following the fine-tuning protocol in Ilharco et al. [63, 64]). ... Celeb A [74] ... Open Web Text [82] |
| Dataset Splits | Yes | We plot in Figure 3 the average expert polysemanticity p(n) for all experts with non-zero difference vectors5, observing a steady drop in its value as N increases from 32 to 1024 total experts. In other words, increasing N leads to individual experts increasingly responsible for a single subtask: classifying all inputs of just one class. As shown in Figure 3 we observe this trend both when µMo Es are used as final classification layers and as penultimate layers (followed by a Re LU activation and linear classification layer), and for multiple pre-trained foundation models. ... Figure 7: Val. accuracy for an S-16 MLP-mixer when performing truncated SVD on all MLP s linear layers weight; model accuracy is closely retained even with half the singular vectors. ... Figure 10: Training loss and validation accuracy for the MLP-mixers models for 300 epochs. |
| Hardware Specification | Yes | MLP Mixer 1e-3 4096 1e-4 10k 300 epochs True 15 0 0.5 bf16 0 4x A100 80GB Nano GPT 6e-4 24 1e-1 2k 100k iter. False 0 0 0 fp16 0 4x A100 80GB CLIP 3e-5 4096 1e-1 500 10 epochs False 0 0 0 fp16 0 1x A100 80GB |
| Software Dependencies | No | The paper mentions software like PyTorch and einops, and tools for FLOPs counting (fvcore via detectron2 documentation), but it does not specify concrete version numbers for any of these dependencies. For example, it states "Py Torch U[ k, k] initialization" but not "PyTorch 1.x.x". |
| Experiment Setup | Yes | Table 7: Experimental configuration and settings for the results reported in the main paper in Section 4.3. This table includes specific hyperparameters such as Learning rate, Batch size, Weight decay, Warmup steps, Training duration, and mentions of techniques like Stochastic Rand Augment, Mixup, Mixed precision. |