Reasoning on Knowledge Graphs with Debate Dynamics

Authors: Marcel Hildebrandt, Jorge Andres Quintero Serna, Yunpu Ma, Martin Ringsquandl, Mitchell Joblin, Volker Tresp4123-4131

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We benchmark our method on the triple classification and link prediction task. Thereby, we find that our method outperforms several baselines on the benchmark datasets FB15k-237, WN18RR, and Hetionet.
Researcher Affiliation Collaboration 1Siemens Corporate Technology, 2Ludwig Maximilian University
Pseudocode Yes Algorithm 1 contains a pseudocode of R2D2 at inference time.
Open Source Code Yes The datasets along with the code of R2D2 are available at https://github.com/m-hildebrandt/R2D2.
Open Datasets Yes We measure the performance of R2D2 with respect to the triple classification and the KG completion task on the benchmark datasets FB15k-237 (Toutanova et al. 2015) and WN18RR (Dettmers et al. 2018). To test R2D2 on a real world task we also consider Hetionet (Himmelstein and Baranzini 2015)...
Dataset Splits Yes Thereby the canonical splits of the datasets into a training, validation, and test set are used. In particular, we ensured that triples that are assigned to the validation or test set (and their respective inverse relations) are not included in the KG during training. The results on the test set of all methods are reported based on the hyperparameters that showed the best performance (based on the highest accuracy for triple classification and the the highest MRR for link prediction) on the validation set.
Hardware Specification Yes All experiments were conducted on a machine with 48 CPU cores and 96 GB RAM.
Software Dependencies No The paper mentions algorithms like LSTMs and Adam, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, or other libraries).
Experiment Setup Yes We considered the following hyperparameter ranges for R2D2: The number of latent dimensions d for the embeddings is chosen from the range {32, 64, 128}. The number of LSTM layers for the agents is chosen from {1, 2, 3}. The the number of layers in the MLP for the judge is tuned in the range {1, 2, 3, 4, 5}. β was chosen from {0.02, 0.05, 0.1}. The length of each argument T was tuned in the range {1, 2, 3} and the number of debate rounds N was set to 3. Moreover, the L2-regularization strength λ is set to 0.02. Furthermore, the number of rollouts during training is given by 20 and 50 (triple classification) or 100 (KG completion) at test time. The loss of the judge and the rewards of the agents were optimized using Adam with learning rate given 10 4. The best hyperparameter are reported in Table 3.