No Head Left Behind – Multi-Head Alignment Distillation for Transformers
Authors: Tianyang Zhao, Kunwar Yashraj Singh, Srikar Appalaraju, Peng Tang, Vijay Mahadevan, R. Manmatha, Ying Nian Wu
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment on distilling VL-T5 and BLIP, and apply AMAD loss on their T5, BERT, and Vi T sub-modules. We show, under vision-language setting, that AMAD outperforms conventional distillation methods on VQA-2.0, COCO captioning, and Multi30K translation datasets. We further show that even without VL pre-training, the distilled VLT5 models outperform corresponding VL pre-trained VL-T5 models that are further fine-tuned by ground-truth signals, and that fine-tuning distillation can also compensate to some degree for the absence of VL pre-training for BLIP models. |
| Researcher Affiliation | Collaboration | Tianyang Zhao1,2*, Kunwar Yashraj Singh1 , Srikar Appalaraju1 , Peng Tang1, Vijay Mahadevan1, R. Manmatha1, Ying Nian Wu1,2 1AWS AI Labs, 2University of California, Los Angeles tyzhao@ucla.edu, {sinkunwa, srikara, tangpen, vmahad, manmatha, wunyin}@amazon.com |
| Pseudocode | No | The paper provides mathematical formulations and references PyTorch code in the appendix, but it does not include a distinct pseudocode block or algorithm description. |
| Open Source Code | Yes | In the Appendix, we provide... Py Torch code for the proposed AMAD loss in Section F. Please kindly refer to the Appendix via the following link: https://www.amazon.science/publications/no-head-leftbehind-multi-head-alignment-distillation-for-transformers |
| Open Datasets | Yes | After uni-modal pre-training its T5 and Faster R-CNN (Ren et al. 2015) sub-modules, it is then VL pretrained on MS COCO (Lin et al. 2014; Chen et al. 2015), Visual Genome (Krishna et al. 2016), VQA-2.0 (Goyal et al. 2019), GQA (Hudson and Manning 2019), and Visual7W (Zhu et al. 2016). |
| Dataset Splits | Yes | We evaluate image captioning performance on MS COCO dataset (Chen et al. 2015). As in (Cho et al. 2021; Fang et al. 2021; Li et al. 2022b), we use the Karparthy split (Karpathy and Fei-Fei 2015), which re-splits train2014 and val2014 images (Lin et al. 2014) into 11K / 5K / 5K for train / validation / test. |
| Hardware Specification | No | The paper does not explicitly specify hardware details such as GPU models, CPU types, or cloud instance specifications used for running the experiments in the main text. |
| Software Dependencies | No | The paper mentions using "Py Torch" for batched tensor computations, but it does not specify a version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | The overall training objective for the Student LTOTAL is a weighted sum of the classification distillation loss LKD and the proposed loss LAMAD, LTOTAL = LKD + αLAMAD (15) We apply LAMAD to distill the self/cross-attention maps of the last (Wang et al. 2020b; Fang et al. 2021) layers of each stream. α is tuned so that LKD and LAMAD scale similarly. We do not add ground-truth loss (Beyer et al. 2022). ... We report our implementation details in Appendix. |