Module-wise Adaptive Distillation for Multimodality Foundation Models

Authors: Chen Liang, Jiahui Yu, Ming-Hsuan Yang, Matthew Brown, Yin Cui, Tuo Zhao, Boqing Gong, Tianyi Zhou

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the effectiveness of OPTIMA through distillation experiments on various multimodal understanding and image captioning tasks, using the Co Ca-Large model [48] as the teacher model.
Researcher Affiliation Collaboration Chen Liang Georgia Tech cliang73@gatech.edu Jiahui Yu Google Research jiahuiyu@google.com Ming-Hsuan Yang UC Merced, Google Research minghsuan@google.com Matthew Brown Google Research mtbr@google.com Yin Cui NVIDIA Research richardaecn@gmail.com Tuo Zhao Georgia Tech tourzhao@gatech.edu Boqing Gong Google Research bgong@google.com Tianyi Zhou University of Maryland, College Park tianyi@umd.edu
Pseudocode Yes Algorithm 1 OPTIMA: Module Adaptive Distillation
Open Source Code No The paper does not provide any explicit statements about releasing source code for the methodology, nor does it provide a direct link to a code repository.
Open Datasets Yes We conduct task-specific distillation on three multimodal understanding tasks: visual question answering (VQA, [14]), visual entailment (SNLI-VE, [47]), and visual reasoning (NLVR2, [37]). We further train and evaluate the model using the Microsoft COCO Caption dataset [6] and the Karpathy-test split, respectively.
Dataset Splits Yes For the VQA task, we conduct downstream fine-tuning and testing on the VQA 2.0 dataset [14], which consists of 83k images and 444k questions for training, 41k images, and 214k questions for validation. For the image captioning task on COCO, we use [6] for training and testing. It contains 11k images for training and 5k images for validation and 5k images for testing.
Hardware Specification Yes We also extend our thanks to the TPU team for providing abundant computational infrastructure and resources.
Software Dependencies No The paper mentions software components like "Adafactor with decoupled weight decay" (an optimizer) and "sentence-piece model" but does not specify version numbers for these or other key software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes For all tasks, we train the student for T = 100k steps. We use Adafactor with decoupled weight decay [34] as the optimizer with β = (0.9, 0.999) and a learning rate of 1 10 3 with a linear decay schedule. We set α1 = 0, α2 = 1 and α3 = 1 10 2 for all tasks. For OPTIMA, we set γ = 0.98, T0 = 10, P = 100 and T = T /P = 1k. Full details are deferred to Appendix A.4.