reproducibilityindex.ai

Scaling Vision with Sparse Mixture of Experts

Authors: Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, Neil Houlsby

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When applied to image recognition, V-Mo E matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-Mo E to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-Mo E to scale vision models, and train a 15B parameter model that attains 90.35% on Image Net.
Researcher Affiliation	Industry	Carlos Riquelme Google Brain Joan Puigcerver * Google Brain Basil Mustafa * Google Brain Maxim Neumann Google Brain Rodolphe Jenatton Google Brain André Susano Pinto Google Brain Daniel Keysers Google Brain Neil Houlsby Google Brain
Pseudocode	Yes	The algorithm is detailed in Algorithm 1 of Appendix C. We summarize BPR in Algorithm 2, in Appendix C.
Open Source Code	Yes	2Mixture of experts code and models available at http://github.com/google-research/vmoe.
Open Datasets	Yes	We pre-train our models on JFT-300M [57], a semi-automatically noisy-labeled dataset. We also ﬁne-tuned the pre-trained models on the full training set (ca. 1M images). We report performance in a similar regime for four other datasets in Appendix B.5.
Dataset Splits	Yes	It has 305M training and 50 000 validation images, organised in a hierarchy of 18 291 classes (average 1.89 labels per image). Our few-shot experiments on Image Net (i.e. ILSVRC2012) use only 1, 5, or 10 shots per class to adapt the upstream model, evaluating the resulting model on the validation set.
Hardware Specification	Yes	Training this model required 16.8k TPUv3-core-days.
Software Dependencies	No	The paper describes the model architecture and components (e.g., ViT, MoE, MLP, GeLU activation), but does not specify software dependencies with version numbers (e.g., Python, TensorFlow/PyTorch, CUDA versions).
Experiment Setup	Yes	We use by default k = 2 (see Figure 10 in Appendix B for the exploration of different values of k), while we found the total number of experts E = 32 to be the sweet spot in our setting. During upstream training, we set C = 1.05 by default to give a small amount of slack without increasing the cost noticeably. We follow the setup of [20], except that we apply a dropout rate of 0.1 on the expert MLPs (as done in [22]), and we halve the number of ﬁne-tuning steps for all datasets other than Image Net.