Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Sound Logical Explanations for Mean Aggregation Graph Neural Networks
Authors: Matthew Morris, Ian Horrocks
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that restricting mean-aggregation GNNs to have non-negative weights yields comparable or improved performance on standard inductive benchmarks, that sound rules are obtained in practice, that insightful explanations can be generated in practice, and that the sound rules can expose issues in the trained models. |
| Researcher Affiliation | Academia | Matthew Morris Department of Computer Science University of Oxford EMAIL Ian Horrocks Department of Computer Science University of Oxford EMAIL |
| Pseudocode | Yes | This procedure is described in Algorithm 1 of Appendix B.1. |
| Open Source Code | Yes | Code and data have been submitted with the supplementary material. The README in the code and hyperparameters in Section 4 provide the details necessary for reproducing the experimental results. |
| Open Datasets | Yes | We use 3 standard benchmarks: WN18RRv1, FB237v1, and NELLv1 [38], each of which provides datasets for training, validation, and testing, as well as negative examples and positive targets. Importantly, these benchmarks are also inductive, meaning that the validation and testing sets contain constants not seen during training. We also use the LUBM dataset [15, LUBM(1,0)], with the train/test split from Liu et al. [23]; this is a node classification dataset, all others are link prediction. |
| Dataset Splits | Yes | We use 3 standard benchmarks: WN18RRv1, FB237v1, and NELLv1 [38], each of which provides datasets for training, validation, and testing, as well as negative examples and positive targets. We also use the LUBM dataset [15, LUBM(1,0)], with the train/test split from Liu et al. [23]; this is a node classification dataset, all others are link prediction. |
| Hardware Specification | Yes | Experiments were run using Py Torch Geometric, with 2 CPUs and 16GB of memory on a Linux server, using 34 days of compute time. |
| Software Dependencies | No | Experiments were run using Py Torch Geometric, with 2 CPUs and 16GB of memory on a Linux server, using 34 days of compute time. The paper mentions 'Py Torch Geometric' but does not provide specific version numbers for it or any other software dependencies. |
| Experiment Setup | Yes | For the model architecture, we fix a hidden dimension of twice the input dimension, 2 GNN layers, ReLU after the first layer, and sigmoid after the second layer. The GNN definition given in Section 2, which was chosen for ease of presentation, describes GNNs aggregating in the reverse direction of the edges. For our experiments, we follow the standard approach and aggregate in the direction of the edges. Thus, when presenting a rule, we write each binary predicate as its inverse. For example, advisor is written as advisor Of . We use GNNs with max, sum, and mean aggregation. We train each model for 8000 epochs, stopping training early if loss does not improve for 50 epochs. For all trained models, we compute standard classification metrics, such as precision, recall, accuracy, and F1 score. For each model, we choose the classification threshold by computing the accuracy on the validation set across a range of 108 thresholds between 0 and 1, selecting the one which maximises accuracy. We train all our models using binary cross entropy loss and the Adam optimiser with a learning rate of 0.001. We run each experiment across 5 different random seeds and present the aggregated metrics. |