Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scalable Feature Learning on Huge Knowledge Graphs for Downstream Machine Learning
Authors: Félix Lefebvre, Gael Varoquaux
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate SEPAL on 7 large-scale knowledge graphs and 46 downstream machine learning tasks. Our results show that SEPAL significantly outperforms previous methods on downstream tasks. In addition, SEPAL scales up its base embedding model, enabling fitting huge knowledge graphs on commodity hardware. Our code is available at: https://github.com/soda-inria/sepal. In this paper, we evaluate SEPAL s performance on knowledge graphs of increasing size between YAGO3 [2.6M entities, Mahdisoltani et al., 2014] and Wiki KG90Mv2 [91M entities, Hu et al., 2020]; we study the use of the embeddings for feature enrichment on 46 downstream machine learning tasks, showing that SEPAL makes embedding methods more tractable while generating better embeddings for downstream tasks. |
| Researcher Affiliation | Academia | Félix Lefebvre SODA Team, Inria Saclay EMAIL Gaël Varoquaux SODA Team, Inria Saclay Probabl |
| Pseudocode | Yes | Algorithm 1 BLOCS Input: Graph G = (V, E) with nodes V and edges E, hyperparameters h and m Output: List of overlapping connected subgraphs S list of subgraphs U V set of unassigned nodes |
| Open Source Code | Yes | Our code is available at: https://github.com/soda-inria/sepal. |
| Open Datasets | Yes | Knowledge graph datasets To compare large knowledge graphs of different sizes, we use Freebase [Bollacker et al., 2008], Wiki KG90Mv2 [(an extract of Wikidata) Hu et al., 2020], and three generations of YAGO: YAGO3 [Mahdisoltani et al., 2014], YAGO4 [Pellissier Tanon et al., 2020], and YAGO4.5 [Suchanek et al., 2024]. |
| Dataset Splits | Yes | We randomly split each dataset into training (90%), validation (5%), and test (5%) subsets of triples. During stratification, we ensure that the train graph remains connected by moving as few triples as required from the validation/test sets to the training set. |
| Hardware Specification | Yes | Dist Mult, DGL-KE, Node Piece, and SEPAL were trained on Nvidia V100 GPUs with 32 GB of memory, and 20 CPU nodes with 252 GB of RAM. |
| Software Dependencies | Yes | We use the Py KEEN [Ali et al., 2021b] implementation for Dist Mult and Node Piece, and the implementations provided by the authors for the others. |
| Experiment Setup | Yes | Validation/test split and hyperparameter tuning We use 4 of the 42 Wiki DBs tables as validation data 2 for regression and 2 for classification tasks (see Figure 4). The remaining 38 Wiki DBs tables, along with the 4 real-world tables, are used exclusively for testing. ... Optimizer for core training: we use the Adam optimizer with learning rate lr = 1 10 3; Number p of negative samples per positive for core training: we use p = 100 (Table 9). |