Scaling Vision with Sparse Mixture of Experts
Authors: Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, Neil Houlsby
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When applied to image recognition, V-Mo E matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-Mo E to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-Mo E to scale vision models, and train a 15B parameter model that attains 90.35% on Image Net. |
| Researcher Affiliation | Industry | Carlos Riquelme Google Brain Joan Puigcerver * Google Brain Basil Mustafa * Google Brain Maxim Neumann Google Brain Rodolphe Jenatton Google Brain André Susano Pinto Google Brain Daniel Keysers Google Brain Neil Houlsby Google Brain |
| Pseudocode | Yes | The algorithm is detailed in Algorithm 1 of Appendix C. We summarize BPR in Algorithm 2, in Appendix C. |
| Open Source Code | Yes | 2Mixture of experts code and models available at http://github.com/google-research/vmoe. |
| Open Datasets | Yes | We pre-train our models on JFT-300M [57], a semi-automatically noisy-labeled dataset. We also fine-tuned the pre-trained models on the full training set (ca. 1M images). We report performance in a similar regime for four other datasets in Appendix B.5. |
| Dataset Splits | Yes | It has 305M training and 50 000 validation images, organised in a hierarchy of 18 291 classes (average 1.89 labels per image). Our few-shot experiments on Image Net (i.e. ILSVRC2012) use only 1, 5, or 10 shots per class to adapt the upstream model, evaluating the resulting model on the validation set. |
| Hardware Specification | Yes | Training this model required 16.8k TPUv3-core-days. |
| Software Dependencies | No | The paper describes the model architecture and components (e.g., ViT, MoE, MLP, GeLU activation), but does not specify software dependencies with version numbers (e.g., Python, TensorFlow/PyTorch, CUDA versions). |
| Experiment Setup | Yes | We use by default k = 2 (see Figure 10 in Appendix B for the exploration of different values of k), while we found the total number of experts E = 32 to be the sweet spot in our setting. During upstream training, we set C = 1.05 by default to give a small amount of slack without increasing the cost noticeably. We follow the setup of [20], except that we apply a dropout rate of 0.1 on the expert MLPs (as done in [22]), and we halve the number of fine-tuning steps for all datasets other than Image Net. |