Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
Authors: Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, Neil Houlsby
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate remarkable performance improvement over dense models of equivalent computational cost. LIMo E-L/16 trained comparably to CLIP-L/14 achieves 78.6% zero-shot Image Net accuracy (vs. 76.2%), and when further scaled to H/14 (with additional data) it achieves 84.1%, comparable to state-of-the-art methods which use larger custom per-modality backbones and pre-training schemes. We analyse the quantitative and qualitative behavior of LIMo E, and demonstrate phenomena such as differing treatment of the modalities and the organic emergence of modality-speciļ¬c experts. |
| Researcher Affiliation | Industry | Basil Mustafa , Carlos Riquelme*, Joan Puigcerver*, Rodolphe Jenatton, Neil Houlsby Google Brain {basilm, rikel, jpuigcerver, rjenatton, neilhoulsby}@google.com |
| Pseudocode | No | The paper describes algorithms and formulations (e.g., equation 1 for contrastive training objective, equation 2 for entropy losses), but it does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or provide links to a code repository. |
| Open Datasets | Yes | Training data. By default, all models are trained on paired image-text data used in [16], consisting of 3.6B images and alt-texts scraped from the web. For large LIMo E-H/14 experiment, we also co-train with JFT-4B [17]. |
| Dataset Splits | No | The paper mentions 'Validation accuracy' in Table 3, but it does not provide specific details on how training, validation, and test splits were performed for the datasets used to reproduce the experiments (e.g., percentages, counts, or citations to predefined splits for the data itself). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | We train a range of LIMo E models at batch size 16k for 781k steps. ... In particular, we train a 32-layer LIMo E-H/14 ... It was trained at a batch size of 21k ... We train B/16 models at batch size 8096 for 100,000 steps. |