Module-wise Adaptive Distillation for Multimodality Foundation Models
Authors: Chen Liang, Jiahui Yu, Ming-Hsuan Yang, Matthew Brown, Yin Cui, Tuo Zhao, Boqing Gong, Tianyi Zhou
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the effectiveness of OPTIMA through distillation experiments on various multimodal understanding and image captioning tasks, using the Co Ca-Large model [48] as the teacher model. |
| Researcher Affiliation | Collaboration | Chen Liang Georgia Tech cliang73@gatech.edu Jiahui Yu Google Research jiahuiyu@google.com Ming-Hsuan Yang UC Merced, Google Research minghsuan@google.com Matthew Brown Google Research mtbr@google.com Yin Cui NVIDIA Research richardaecn@gmail.com Tuo Zhao Georgia Tech tourzhao@gatech.edu Boqing Gong Google Research bgong@google.com Tianyi Zhou University of Maryland, College Park tianyi@umd.edu |
| Pseudocode | Yes | Algorithm 1 OPTIMA: Module Adaptive Distillation |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code for the methodology, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We conduct task-specific distillation on three multimodal understanding tasks: visual question answering (VQA, [14]), visual entailment (SNLI-VE, [47]), and visual reasoning (NLVR2, [37]). We further train and evaluate the model using the Microsoft COCO Caption dataset [6] and the Karpathy-test split, respectively. |
| Dataset Splits | Yes | For the VQA task, we conduct downstream fine-tuning and testing on the VQA 2.0 dataset [14], which consists of 83k images and 444k questions for training, 41k images, and 214k questions for validation. For the image captioning task on COCO, we use [6] for training and testing. It contains 11k images for training and 5k images for validation and 5k images for testing. |
| Hardware Specification | Yes | We also extend our thanks to the TPU team for providing abundant computational infrastructure and resources. |
| Software Dependencies | No | The paper mentions software components like "Adafactor with decoupled weight decay" (an optimizer) and "sentence-piece model" but does not specify version numbers for these or other key software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For all tasks, we train the student for T = 100k steps. We use Adafactor with decoupled weight decay [34] as the optimizer with β = (0.9, 0.999) and a learning rate of 1 10 3 with a linear decay schedule. We set α1 = 0, α2 = 1 and α3 = 1 10 2 for all tasks. For OPTIMA, we set γ = 0.98, T0 = 10, P = 100 and T = T /P = 1k. Full details are deferred to Appendix A.4. |