ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Authors: Peyman Passban, Yimeng Wu, Mehdi Rezagholizadeh, Qun Liu13657-13665

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that our combinatorial approach is able to outperform other existing techniques. Experimental Study A common practice in our field to evaluate the quality of a KD technique is to feed T and S models with instances of standard datasets and measure how they perform.
Researcher Affiliation Industry Peyman Passban2,*, Yimeng Wu1, Mehdi Rezagholizadeh1, Qun Liu1 1Huawei Noah s Ark Lab, 2Amazon
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code or a link to a code repository.
Open Datasets Yes We followed the same tradition in this paper and selected a set of eight GLUE tasks (Wang et al. 2018) including Co LA, MNLI, MRPC, QNLI, QQP, RTE, SST-2, and STS-B datasets to benchmark our models. Detailed information about datasets is available in the appendix section.
Dataset Splits Yes Similar to other papers, we evaluate our models on validation sets. Testset labels of GLUE datasets are not publicly available and researchers need to participate in leaderboard competitions to evaluate their models on testsets. Co LA: A corpus of English sentences drawn from books and journal articles with 8, 551 training and 1, 043 validation instances.
Hardware Specification Yes Hardware Each model is fine-tuned on a single NVIDIA 32GB V100 GPU.
Software Dependencies No The paper mentions various models and frameworks (e.g., BERT, Transformer blocks) and implicitly uses common ML libraries, but it does not specify version numbers for any software dependencies.
Experiment Setup Yes In our setting, the batch size is set to 32 and the learning rate is selected from {1e 5, 2e 5, 5e 5}. η and λ take values from {0, 0.2, 0.5, 0.7} and β = 1 η λ.