No Head Left Behind – Multi-Head Alignment Distillation for Transformers

Authors: Tianyang Zhao, Kunwar Yashraj Singh, Srikar Appalaraju, Peng Tang, Vijay Mahadevan, R. Manmatha, Ying Nian Wu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment on distilling VL-T5 and BLIP, and apply AMAD loss on their T5, BERT, and Vi T sub-modules. We show, under vision-language setting, that AMAD outperforms conventional distillation methods on VQA-2.0, COCO captioning, and Multi30K translation datasets. We further show that even without VL pre-training, the distilled VLT5 models outperform corresponding VL pre-trained VL-T5 models that are further fine-tuned by ground-truth signals, and that fine-tuning distillation can also compensate to some degree for the absence of VL pre-training for BLIP models.
Researcher Affiliation Collaboration Tianyang Zhao1,2*, Kunwar Yashraj Singh1 , Srikar Appalaraju1 , Peng Tang1, Vijay Mahadevan1, R. Manmatha1, Ying Nian Wu1,2 1AWS AI Labs, 2University of California, Los Angeles tyzhao@ucla.edu, {sinkunwa, srikara, tangpen, vmahad, manmatha, wunyin}@amazon.com
Pseudocode No The paper provides mathematical formulations and references PyTorch code in the appendix, but it does not include a distinct pseudocode block or algorithm description.
Open Source Code Yes In the Appendix, we provide... Py Torch code for the proposed AMAD loss in Section F. Please kindly refer to the Appendix via the following link: https://www.amazon.science/publications/no-head-leftbehind-multi-head-alignment-distillation-for-transformers
Open Datasets Yes After uni-modal pre-training its T5 and Faster R-CNN (Ren et al. 2015) sub-modules, it is then VL pretrained on MS COCO (Lin et al. 2014; Chen et al. 2015), Visual Genome (Krishna et al. 2016), VQA-2.0 (Goyal et al. 2019), GQA (Hudson and Manning 2019), and Visual7W (Zhu et al. 2016).
Dataset Splits Yes We evaluate image captioning performance on MS COCO dataset (Chen et al. 2015). As in (Cho et al. 2021; Fang et al. 2021; Li et al. 2022b), we use the Karparthy split (Karpathy and Fei-Fei 2015), which re-splits train2014 and val2014 images (Lin et al. 2014) into 11K / 5K / 5K for train / validation / test.
Hardware Specification No The paper does not explicitly specify hardware details such as GPU models, CPU types, or cloud instance specifications used for running the experiments in the main text.
Software Dependencies No The paper mentions using "Py Torch" for batched tensor computations, but it does not specify a version number for PyTorch or any other software dependencies.
Experiment Setup Yes The overall training objective for the Student LTOTAL is a weighted sum of the classification distillation loss LKD and the proposed loss LAMAD, LTOTAL = LKD + αLAMAD (15) We apply LAMAD to distill the self/cross-attention maps of the last (Wang et al. 2020b; Fang et al. 2021) layers of each stream. α is tuned so that LKD and LAMAD scale similarly. We do not add ground-truth loss (Beyer et al. 2022). ... We report our implementation details in Appendix.