Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

KGGen: Extracting Knowledge Graphs from Plain Text with Language Models

Authors: Belinda Mo, Kyssen Yu, Joshua Kazdan, Proud Mpala, Lisa Yu, Charilaos Kanatsoulis, Sanmi Koyejo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark our new tool against leading existing generators such as Microsoft’s Graph RAG; we achieve comparable retrieval accuracy on the generated graphs and better information retention.
Researcher Affiliation	Collaboration	Belinda Mo 1, Kyssen Yu 2, Joshua Kazdan 1, Joan Cabezas, Proud Mpala1, Lisa Yu2, Chris Cundy3, Charilaos Kanatsoulis1, Sanmi Koyejo1 1Stanford University 2University of Toronto 3 FAR AI
Pseudocode	No	The exact prompts for each step can be found in Appendix A, and the process is illustrated in Figure 1. (Figure 1 is a diagram, Appendix A contains natural language prompts, not pseudocode.)
Open Source Code	Yes	Our code is open-sourced at https://github.com/stair-lab/kg-gen/
Open Datasets	Yes	The RAG evaluation is based on the Wiki QA dataset Yang et al. [2015]
Dataset Splits	No	The paper describes MINE-1 as using '100 articles' and MINE-2 using the 'Wiki QA dataset ... which contains 20,400 questions based on 1,995 Wikipedia articles', but does not specify explicit training/validation/test splits for KGGen or the datasets it uses for evaluation.
Hardware Specification	No	Our experiments do not require special hardware, and can be run on most laptops with production models. They require only an API key from a model provider this is clear from the paper.
Software Dependencies	No	The paper mentions 'Google’s Gemini 2.0 Flash', 'DSPy signatures', and 'all-Mini LM-L6-v2 model from Sentence Transformers', but only Gemini includes a specific version number. Other software dependencies are mentioned without specific versions.
Experiment Setup	Yes	For each fact, the verifier retrieves the top-k most semantically similar nodes in the KG, then expands the result to include all nodes within two relations of those top-k nodes... For each question in the dataset, we retrieve the top 10 most relevant triples... The final similarity score is obtained by combining BM25 relevance score and the cosine similarity score, weighted equally.