Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Can LLMs Effectively Leverage Graph Structural Information through Prompts, and Why?

Authors: Jin Huang, Xingjian Zhang, Qiaozhu Mei, Jiaqi Ma

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our exploration of these questions reveals that (i) there is no substantial evidence that the performance of LLMs is significantly attributed to data leakage; (ii) instead of understanding prompts as graph structures, LLMs tend to process prompts more as contextual paragraphs and (iii) the most efficient elements of the local neighborhood included in the prompt are phrases that are pertinent to the node label, rather than the graph structure. ... For the first question, we investigate the extent to which data leakage might artificially inflate the performance of LLMs in Section 3.2. To rigorously measure the data leakage effect, we collect a new dataset, ensuring that the test nodes are sampled from time periods post the data cut-off of Chat GPT (Open AI, 2022) and LLa MA-2 (Touvron et al., 2023). ... As shown in Table 3, we observe the contrary: the performance drop of Chat GPT between ogbn-arxiv and arxiv-2023 is less than the drop on MPNNs on two datasets (1.3% compared to 5.1% in rich context, 3.6% compared to 4.5% in scarce context).
Researcher Affiliation	Academia	Jin Huang EMAIL University of Michigan, Ann Arbor Xingjian Zhang EMAIL University of Michigan, Ann Arbor Qiaozhu Mei EMAIL University of Michigan, Ann Arbor Jiaqi Ma EMAIL University of Illinois Urbana-Champaign
Pseudocode	No	The paper describes prompt templates in Table 1 and various experimental strategies but does not provide structured pseudocode or algorithm blocks for any computational procedure.
Open Source Code	Yes	1Codes and datasets are at: https://github.com/TRAIS-Lab/LLM-Structured-Data
Open Datasets	Yes	1Codes and datasets are at: https://github.com/TRAIS-Lab/LLM-Structured-Data. ...To rigorously measure the data leakage effect, we collect a new dataset, arxiv-2023, which is designed to resemble another widely-used dataset, ogbn-arxiv (Hu et al., 2020) ... Our study involves experiments on four node classification benchmark datasets with textual node features: cora (Mc Callum et al., 2000; Lu & Getoor, 2003; Sen et al., 2008; Yang et al., 2016), pubmed (Namata et al., 2012; Yang et al., 2016), ogbn-arxiv (Hu et al., 2020) and ogbn-product (Hu et al., 2020).
Dataset Splits	Yes	The dataset splits are as follows: 1. cora: Training/Validation/Testing ratios are 0.1/0.2/0.2. 2. pubmed: Training/Validation/Testing ratios are 0.6/0.2/0.2, following He et al. (2023). 3. ogbn-arxiv: Original OGB (Hu et al., 2020) splits are used, categorizing papers by their publication year: training (pre-2017), validation (2018), and testing (2019). 4. ogbn-product: Original OGB splits are used based on sales ranking: top 8% for training, next 2% for validation, and the remainder for testing. 5. arxiv-2023: Year-based splits similar to ogbn-arxivis adopted: training (pre-2019), validation (2020), and testing (2023).
Hardware Specification	No	The paper mentions using 'Chat GPT API' (gpt-3.5-turbo-0613) and 'LLa MA-2-7B model' for experiments, which are external models/APIs. It does not provide any specific hardware details (like GPU/CPU models, memory) for the computational resources used by the authors to run their experiments or implement their baselines.
Software Dependencies	No	The paper mentions using 'Chat GPT API' (gpt-3.5-turbo-0613) and 'LLa MA-2-7B-chat' as models. It also states that 'Baseline models GCN and SAGE are implemented with Py G (Fey & Lenssen, 2019)' and refers to 'Any Style' for reference extraction, but specific version numbers for PyG or other local software libraries used for implementation are not provided.
Experiment Setup	Yes	For hyperparameter tunning, we perform a random search on the following hyperparameter tuning range for every model following Ma et al. (2022): Number of layers: {2, 3}. Hidden size: {32, 64}. Learning rate: {.001, .005, .01, .1}. Dropout rate: {.2, .4, .6, .8}. Weight decay: {.0001, .001, .01, .1}. Each model is run on 100 random configurations and each random configuration is run for 3 times on ogbn-arxiv and arxiv-2023. The max training epoch number is 2000. ... Common settings for all methods include a temperature of 0 and a maximum output token limit of 500.