Robustness in Multimodal Learning under Train-Test Modality Mismatch
Authors: Brandon Mckinzie, Vaishaal Shankar, Joseph Yitan Cheng, Yinfei Yang, Jonathon Shlens, Alexander T Toshev
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods. Further, we identify robustness shortcomings of these approaches and propose two intervention techniques leading to 1.5 -4 robustness improvements on three datasets, Audio Set, Kinetics-400 and Image Net-Captions. |
| Researcher Affiliation | Industry | 1Apple ML Research 2Work done while at Apple 3Apple. Correspondence to: Alexander Toshev <toshev@apple.com>. |
| Pseudocode | No | The paper describes algorithms and models in text and diagrams (e.g., Figure 3 for MASD) but does not include any explicit pseudocode blocks or algorithm listings. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating the release of open-source code for the described methodology. |
| Open Datasets | Yes | We focus our experiments on representation learning with the Audio Set dataset (Gemmeke et al., 2017)... Additionally, we explore the generality of our results on Kinetics-400 (Kay et al., 2017) and Image Net-Captions (Fang et al., 2022a). |
| Dataset Splits | Yes | Audio Set consists of an unbalanced training set of 1,743,790 examples, used as unlabeled pretraining data; a training and evaluation sets of 18,649 and 17,065 examples respectively used for the downstream task. |
| Hardware Specification | No | The paper mentions training on specific datasets and using certain models (e.g., ViT-B/16 architecture) but does not specify any hardware details such as GPU models, CPU types, or cloud computing instances used for the experiments. |
| Software Dependencies | No | The paper mentions using specific optimizers (AdamW) and models (CLIP, VATT, MAE) but does not provide specific version numbers for any software dependencies like programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Table 2. Training hyperparameters used for pretraining, linear probing, and finetuning. Config Pretraining Linear Probing Finetuning Contr. MAE Contr. MAE Contr. MAE global batch 1024 1024 256 128 128 64 learning rate 8e-4 8e-4 1e-2 1e-2 1e-4 1e-4 LR warmup 1000 2000 200 200 1000 2000 epochs 32 256 360 360 30 60 optimizer Adam W Adam W Adam W Adam W Adam W Adam W |