IterDE: An Iterative Knowledge Distillation Framework for Knowledge Graph Embeddings

Authors: Jiajun Liu, Peng Wang, Ziyu Shang, Chenxiao Wu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate that Iter DE achieves a new state-of-the-art distillation performance for KGEs compared to strong baselines on the link prediction task. Significantly, Iter DE can reduce the training time by 50% on average. Finally, more exploratory experiments show that the soft-label weighting dynamic adjustment mechanism and more fine-grained iterations can improve distillation performance.
Researcher Affiliation Academia Jiajun Liu, Peng Wang*, Ziyu Shang, Chenxiao Wu School of Computer Science and Engineering, Southeast University {jiajliu, pwang, ziyus1999, chenxiaowu}@seu.edu.cn
Pseudocode No The paper includes a framework overview diagram (Figure 2) but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code of Iter DE and the datasets can be accessed via https://github.com/seukgcode/Iter DE.
Open Datasets Yes Datasets We use the two popular and open datasets in KGEs: FB15K-237 (Dettmers et al. 2018) and WN18RR (Toutanova et al. 2015). ... The detailed statistical information is shown in Table 1.
Dataset Splits Yes The detailed statistical information is shown in Table 1. Datasets Ne Nr NTrain NValid NTest FB15K-237 14,541 237 272,115 17,535 20,466 WN18RR 40,943 11 86,835 3,034 3,134
Hardware Specification Yes All experiments are implemented on GPU Ge Force RTX 2080 Ti.
Software Dependencies Yes The experiments are extended from Open KE (Han et al. 2018), an open source library based on Py Torch (Paszke et al. 2019), with CUDA version 10.2.89.
Experiment Setup Yes In all experiments, we set the teacher model dimension to 512 and the student model dimension to 32. In distillation, we set the compression ratio of each layer α to 2 and the number of iterations N to 4. We set the value of hyperparameter p to 2, while 5 and 10 have the similar results. We set the batch size to 1024 and the epoch for each iteration to a maximum of 1000. We use Adagrad as the optimizer, and the learning rate is chosen among [0.5, 0.1, 0.01]. The initial soft label weight λ0 is chosen in the range [1, 0.1, 0.01].