Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
No Head Left Behind – Multi-Head Alignment Distillation for Transformers
Authors: Tianyang Zhao, Kunwar Yashraj Singh, Srikar Appalaraju, Peng Tang, Vijay Mahadevan, R. Manmatha, Ying Nian Wu
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment on distilling VL-T5 and BLIP, and apply AMAD loss on their T5, BERT, and Vi T sub-modules. We show, under vision-language setting, that AMAD outperforms conventional distillation methods on VQA-2.0, COCO captioning, and Multi30K translation datasets. We further show that even without VL pre-training, the distilled VLT5 models outperform corresponding VL pre-trained VL-T5 models that are further fine-tuned by ground-truth signals, and that fine-tuning distillation can also compensate to some degree for the absence of VL pre-training for BLIP models. |
| Researcher Affiliation | Collaboration | Tianyang Zhao1,2*, Kunwar Yashraj Singh1 , Srikar Appalaraju1 , Peng Tang1, Vijay Mahadevan1, R. Manmatha1, Ying Nian Wu1,2 1AWS AI Labs, 2University of California, Los Angeles EMAIL, EMAIL |
| Pseudocode | No | The paper provides mathematical formulations and references PyTorch code in the appendix, but it does not include a distinct pseudocode block or algorithm description. |
| Open Source Code | Yes | In the Appendix, we provide... Py Torch code for the proposed AMAD loss in Section F. Please kindly refer to the Appendix via the following link: https://www.amazon.science/publications/no-head-leftbehind-multi-head-alignment-distillation-for-transformers |
| Open Datasets | Yes | After uni-modal pre-training its T5 and Faster R-CNN (Ren et al. 2015) sub-modules, it is then VL pretrained on MS COCO (Lin et al. 2014; Chen et al. 2015), Visual Genome (Krishna et al. 2016), VQA-2.0 (Goyal et al. 2019), GQA (Hudson and Manning 2019), and Visual7W (Zhu et al. 2016). |
| Dataset Splits | Yes | We evaluate image captioning performance on MS COCO dataset (Chen et al. 2015). As in (Cho et al. 2021; Fang et al. 2021; Li et al. 2022b), we use the Karparthy split (Karpathy and Fei-Fei 2015), which re-splits train2014 and val2014 images (Lin et al. 2014) into 11K / 5K / 5K for train / validation / test. |
| Hardware Specification | No | The paper does not explicitly specify hardware details such as GPU models, CPU types, or cloud instance specifications used for running the experiments in the main text. |
| Software Dependencies | No | The paper mentions using "Py Torch" for batched tensor computations, but it does not specify a version number for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | The overall training objective for the Student LTOTAL is a weighted sum of the classification distillation loss LKD and the proposed loss LAMAD, LTOTAL = LKD + αLAMAD (15) We apply LAMAD to distill the self/cross-attention maps of the last (Wang et al. 2020b; Fang et al. 2021) layers of each stream. α is tuned so that LKD and LAMAD scale similarly. We do not add ground-truth loss (Beyer et al. 2022). ... We report our implementation details in Appendix. |