reproducibilityindex.ai

Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization

Authors: James Oldfield, Markos Georgopoulos, Grigorios Chrysos, Christos Tzelepis, Yannis Panagakis, Mihalis Nicolaou, Jiankang Deng, Ioannis Patras

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present both qualitative and quantitative evidence that scaling µMo E layers when ﬁne-tuning foundation models for vision tasks leads to more specialized experts at the class-level, further enabling manual bias correction in Celeb A attribute classiﬁcation. Finally, we show qualitative results demonstrating the expert specialism achieved when pre-training large GPT2 and MLP-Mixer models with parameter-matched µMo E blocks at every layer, maintaining comparable accuracy. Our code is available at: https://github.com/james-oldfield/mu Mo E.
Researcher Affiliation	Academia	James Oldﬁeld1 Markos Georgopoulos Grigorios G. Chrysos2 Christos Tzelepis3 Yannis Panagakis4,5 Mihalis A. Nicolaou6 Jiankang Deng7 Ioannis Patras1 1Queen Mary University of London 2University of Wisconsin-Madison 3City University of London 4National and Kapodistrian University of Athens 5Archimedes AI, Athena RC 6The Cyprus Institute 7Imperial College London
Pseudocode	Yes	We now present the derivations of the forward passes of the factorized µMo E models (with einsum pseudocode implementations in Appendix B):
Open Source Code	Yes	Our code is available at: https://github.com/james-oldfield/mu Mo E.
Open Datasets	Yes	To isolate the impact of µMo E layers and varying expert counts, we ﬁrst explore the controlled setting of ﬁne-tuning large foundation models CLIP [61] Vi T-B-32 and DINO [62] on Image NET1k (following the ﬁne-tuning protocol in Ilharco et al. [63, 64]). ... Celeb A [74] ... Open Web Text [82]
Dataset Splits	Yes	We plot in Figure 3 the average expert polysemanticity p(n) for all experts with non-zero difference vectors5, observing a steady drop in its value as N increases from 32 to 1024 total experts. In other words, increasing N leads to individual experts increasingly responsible for a single subtask: classifying all inputs of just one class. As shown in Figure 3 we observe this trend both when µMo Es are used as ﬁnal classiﬁcation layers and as penultimate layers (followed by a Re LU activation and linear classiﬁcation layer), and for multiple pre-trained foundation models. ... Figure 7: Val. accuracy for an S-16 MLP-mixer when performing truncated SVD on all MLP s linear layers weight; model accuracy is closely retained even with half the singular vectors. ... Figure 10: Training loss and validation accuracy for the MLP-mixers models for 300 epochs.
Hardware Specification	Yes	MLP Mixer 1e-3 4096 1e-4 10k 300 epochs True 15 0 0.5 bf16 0 4x A100 80GB Nano GPT 6e-4 24 1e-1 2k 100k iter. False 0 0 0 fp16 0 4x A100 80GB CLIP 3e-5 4096 1e-1 500 10 epochs False 0 0 0 fp16 0 1x A100 80GB
Software Dependencies	No	The paper mentions software like PyTorch and einops, and tools for FLOPs counting (fvcore via detectron2 documentation), but it does not specify concrete version numbers for any of these dependencies. For example, it states "Py Torch U[ k, k] initialization" but not "PyTorch 1.x.x".
Experiment Setup	Yes	Table 7: Experimental conﬁguration and settings for the results reported in the main paper in Section 4.3. This table includes specific hyperparameters such as Learning rate, Batch size, Weight decay, Warmup steps, Training duration, and mentions of techniques like Stochastic Rand Augment, Mixup, Mixed precision.