reproducibilityindex.ai

Reasoning on Knowledge Graphs with Debate Dynamics

Authors: Marcel Hildebrandt, Jorge Andres Quintero Serna, Yunpu Ma, Martin Ringsquandl, Mitchell Joblin, Volker Tresp4123-4131

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark our method on the triple classiﬁcation and link prediction task. Thereby, we ﬁnd that our method outperforms several baselines on the benchmark datasets FB15k-237, WN18RR, and Hetionet.
Researcher Affiliation	Collaboration	1Siemens Corporate Technology, 2Ludwig Maximilian University
Pseudocode	Yes	Algorithm 1 contains a pseudocode of R2D2 at inference time.
Open Source Code	Yes	The datasets along with the code of R2D2 are available at https://github.com/m-hildebrandt/R2D2.
Open Datasets	Yes	We measure the performance of R2D2 with respect to the triple classiﬁcation and the KG completion task on the benchmark datasets FB15k-237 (Toutanova et al. 2015) and WN18RR (Dettmers et al. 2018). To test R2D2 on a real world task we also consider Hetionet (Himmelstein and Baranzini 2015)...
Dataset Splits	Yes	Thereby the canonical splits of the datasets into a training, validation, and test set are used. In particular, we ensured that triples that are assigned to the validation or test set (and their respective inverse relations) are not included in the KG during training. The results on the test set of all methods are reported based on the hyperparameters that showed the best performance (based on the highest accuracy for triple classiﬁcation and the the highest MRR for link prediction) on the validation set.
Hardware Specification	Yes	All experiments were conducted on a machine with 48 CPU cores and 96 GB RAM.
Software Dependencies	No	The paper mentions algorithms like LSTMs and Adam, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, or other libraries).
Experiment Setup	Yes	We considered the following hyperparameter ranges for R2D2: The number of latent dimensions d for the embeddings is chosen from the range {32, 64, 128}. The number of LSTM layers for the agents is chosen from {1, 2, 3}. The the number of layers in the MLP for the judge is tuned in the range {1, 2, 3, 4, 5}. β was chosen from {0.02, 0.05, 0.1}. The length of each argument T was tuned in the range {1, 2, 3} and the number of debate rounds N was set to 3. Moreover, the L2-regularization strength λ is set to 0.02. Furthermore, the number of rollouts during training is given by 20 and 50 (triple classiﬁcation) or 100 (KG completion) at test time. The loss of the judge and the rewards of the agents were optimized using Adam with learning rate given 10 4. The best hyperparameter are reported in Table 3.