Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Neural Graph Reasoning: A Survey on Complex Logical Query Answering

Authors: Hongyu Ren, Mikhail Galkin, Zhaocheng Zhu, Jure Leskovec, Michael Cochez

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Multiple datasets have been proposed for evaluation of query reasoning models. Here we introduce the common setup for CLQA task. Given a knowledge graph G = (E, R, S), the standard practice is to split G into a training graph Gtrain, a validation graph Gval and a test graph Gtest (simulating the unobserved complete graph ˆG from Section 2). The standard experiment protocol is to train a query reasoning model only on the training graph Gtrain, and evaluate the model on answering queries over the validation graph Gval and the test graph Gtest. [...] Several metrics have been proposed to evaluate the performance of query reasoning models that can be broadly classified into generalization, entailment, and query representation quality metrics.
Researcher Affiliation	Collaboration	Hongyu Ren* 1 EMAIL Mikhail Galkin* 2 EMAIL Zhaocheng Zhu3 EMAIL Jure Leskovec1 EMAIL Michael Cochez4 EMAIL * Equal contribution 1 Stanford University 2Intel AI Lab 3Mila Québec AI Institute and Université de Montréal 4Vrije Universiteit Amsterdam and Elsevier discovery lab, Amsterdam, the Netherlands
Pseudocode	No	The paper is a survey that describes methods conceptually and uses figures to illustrate components (e.g., Figure 7: Neural Query Execution through the Encoder-Processor-Decoder modules) but does not contain any explicit pseudocode or algorithm blocks. It focuses on reviewing existing work rather than presenting a new algorithm with structured steps.
Open Source Code	No	The paper is a survey and does not present new methods or implementations by the authors. Therefore, it does not contain an explicit statement from the authors about releasing their source code, nor does it provide a link to a code repository for the methodology described.
Open Datasets	Yes	Multiple datasets have been proposed for evaluation of query reasoning models. ... Beta E datasets include sets of queries from denser Freebase (Bollacker et al., 2008) with average node degree of 18 and sparser Word Net (Miller, 1998) and NELL (Mitchell et al., 2015). Hyper-relational datasets WD50K (Alivanistos et al., 2022) and WD50KNFOL (Luo et al., 2023) were sampled from Wikidata (Vrandecic & Krötzsch, 2014)...
Dataset Splits	Yes	Given a knowledge graph G = (E, R, S), the standard practice is to split G into a training graph Gtrain, a validation graph Gval and a test graph Gtest (simulating the unobserved complete graph ˆG from Section 2). The standard experiment protocol is to train a query reasoning model only on the training graph Gtrain, and evaluate the model on answering queries over the validation graph Gval and the test graph Gtest.
Hardware Specification	No	As a survey paper, the document describes existing research and evaluation practices without conducting new experiments. Therefore, it does not provide specific details about the hardware used to run experiments, such as exact GPU or CPU models, memory, or cloud instance types.
Software Dependencies	No	As a survey paper, the document focuses on reviewing methodologies rather than implementing new ones. Consequently, it does not specify particular software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup	No	As a survey paper, this document describes the experimental setups and evaluation methodologies found in the literature (e.g., Section 7.3 Training, which discusses training objectives of other methods). It does not present specific hyperparameters or training configurations for its own experiments, as it does not conduct new experimental research.