ALP-KD: Attention-Based Layer Projection for Knowledge Distillation
Authors: Peyman Passban, Yimeng Wu, Mehdi Rezagholizadeh, Qun Liu13657-13665
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our combinatorial approach is able to outperform other existing techniques. Experimental Study A common practice in our field to evaluate the quality of a KD technique is to feed T and S models with instances of standard datasets and measure how they perform. |
| Researcher Affiliation | Industry | Peyman Passban2,*, Yimeng Wu1, Mehdi Rezagholizadeh1, Qun Liu1 1Huawei Noah s Ark Lab, 2Amazon |
| Pseudocode | No | The paper does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or a link to a code repository. |
| Open Datasets | Yes | We followed the same tradition in this paper and selected a set of eight GLUE tasks (Wang et al. 2018) including Co LA, MNLI, MRPC, QNLI, QQP, RTE, SST-2, and STS-B datasets to benchmark our models. Detailed information about datasets is available in the appendix section. |
| Dataset Splits | Yes | Similar to other papers, we evaluate our models on validation sets. Testset labels of GLUE datasets are not publicly available and researchers need to participate in leaderboard competitions to evaluate their models on testsets. Co LA: A corpus of English sentences drawn from books and journal articles with 8, 551 training and 1, 043 validation instances. |
| Hardware Specification | Yes | Hardware Each model is fine-tuned on a single NVIDIA 32GB V100 GPU. |
| Software Dependencies | No | The paper mentions various models and frameworks (e.g., BERT, Transformer blocks) and implicitly uses common ML libraries, but it does not specify version numbers for any software dependencies. |
| Experiment Setup | Yes | In our setting, the batch size is set to 32 and the learning rate is selected from {1e 5, 2e 5, 5e 5}. η and λ take values from {0, 0.2, 0.5, 0.7} and β = 1 η λ. |